0

I have the following problem: I have data like this:

tribble(~label_a, ~label_b, ~value,
    "a", "x", 1,
    "a", "x", 3,
    "a", "y", 6,
    "a", "y", 8,
    "b", "x", 10,
    "b", "x", 30,
    "b", "y", 60,
    "b", "y", 80,
    )

I would like (using Dplyr) to group it by label_a, then by label_b, calculate mean value but only for groups where label_b is equal to "x", use this mean value for all rows in label_a groups as a new column to obtain something like this:

label_a | label_b | value | x_mean
----------------------------------
a       | x       | 1     | 2
a       | x       | 3     | 2
a       | y       | 6     | 2
a       | y       | 8     | 2
b       | x       | 10    | 20
b       | x       | 30    | 20
b       | y       | 60    | 20
b       | y       | 80    | 20

But i have no idea how to do this. Thank you in advance for your help.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Can you provide data in an R-friendly format? Perhaps either `data.frame(...)` or `dput(x)` would be the clearest. Thanks! – r2evans Jun 15 '19 at 23:49
  • 1
    I'm new here and I'm still learning how to use the interface. I have edited my question to be more readable. – Ali Jarbawi Jun 16 '19 at 00:00

2 Answers2

2

I see two ways of doing this: an easy way that gets much harder if you have or will get more levels with the labels, and one that is a little more code but might be more robust to larger data.

1: Easy but scales poorly

x %>%
  group_by(label_a) %>%
  mutate(x_mean = mean(if_else(label_b == "x", value, value[NA]), na.rm = TRUE)) %>%
  ungroup()
# # A tibble: 8 x 4
#   label_a label_b value x_mean
#   <chr>   <chr>   <dbl>  <dbl>
# 1 a       x           1      2
# 2 a       x           3      2
# 3 a       y           6      2
# 4 a       y           8      2
# 5 b       x          10     20
# 6 b       x          30     20
# 7 b       y          60     20
# 8 b       y          80     20

The crux of this one is that we take the mean only of the "x" values, omitting the rest. The use of if_else gives a vectorized if/then-like statement, and the use of value[NA] is to guarantee that we get the "right kind of NA" (there are more than a few different NAs in R).

2: More robust

x %>%
  filter(label_b == "x") %>%
  group_by(label_a) %>%
  summarize(x_mean = mean(value)) %>%
  left_join(x, ., by = "label_a") %>%
  ungroup()
# # A tibble: 8 x 4
#   label_a label_b value x_mean
#   <chr>   <chr>   <dbl>  <dbl>
# 1 a       x           1      2
# 2 a       x           3      2
# 3 a       y           6      2
# 4 a       y           8      2
# 5 b       x          10     20
# 6 b       x          30     20
# 7 b       y          60     20
# 8 b       y          80     20

This one works by first creating a summary table:

x %>%
  filter(label_b == "x") %>%
  group_by(label_a) %>%
  summarize(x_mean = mean(value))
# # A tibble: 2 x 2
#   label_a x_mean
#   <chr>    <dbl>
# 1 a            2
# 2 b           20

and then merging it back into the original frame. In base R, we would use merge, in SQL and dplyr-speak, we use a join operator. (There are several examples of multiple join-like operations here: How to join (merge) data frames (inner, outer, left, right), I encourage you to take a look and try some of the samples to see what is going on.)

It's not uncommon for the intermediate summary table to be stored in another variable and joined with the original frame in a different pipe. In the case here, I just did something like a "self-join" (though it is far from "itself" by the time we get to the join operation).

r2evans
  • 141,215
  • 6
  • 77
  • 149
0

In case anyone looks for the data.table solution:

dt[, x_mean := mean(value[label_b == 'x'])
   , by = label_a]
Cole
  • 11,130
  • 1
  • 9
  • 24