1

Well, this is embarrassing.

I'm trying to do something fairly straightforward: conduct a robustness check by seeing if the correlation between x and y is removed if the values for x are "mismatched" with y. I'm trying to do this by creating a third variable z which "mixes up" the existing values of 'x' at random. While this is a similar question to the one previously answered here, my data are in long form so I need to randomize WITHIN an id variable.

For example, my dataset might be:

x    y    id
1    4    1
1    5    1
2    8    1
2    8    1
3    12   1
3    11   1
4    16   1
4    15   1
1    4    2
1    5    2
2    8    2
2    8    2
3    12   2
3    11   2
4    16   2
4    15   2

What I'd like to do is to create a new variable z which essentially "mixes up" the values of x (but is based on the actual values of x, NOT a random variable within a certain range):

x    y    id   z
1    4    1    2
1    5    1    3
2    8    1    1
2    8    1    4
3    12   1    4
3    11   1    3
4    16   1    2
4    15   1    1
1    4    2    1
1    5    2    1
2    8    2    3
2    8    2    3
3    12   2    4
3    11   2    4
4    16   2    2
4    15   2    2

How on earth do I do this? I started out thinking it was a simple task, but then got very very confused.

SUPER-DUPER-BONUS-QUESTION:

Finally, as the careful reader will note, the data are in long form (each id has 8 rows) but they are also grouped by x (which has 4 values per id). In other words, each person has 8 observed outcomes of y, but only 4 predictors of x. In a perfect world, I'd be able to create a function where z mixed up values of x within id -- and but never assigned the same value of x to z.

In other words, if x=1, then z=2,3, or 4 but NOT 1. It is a subtle difference, but a potentially meaningful one!

x    y    id   z
1    4    1    2
1    5    1    3
2    8    1    1
2    8    1    4
3    12   1    4
3    11   1    2
4    16   1    3
4    15   1    1
1    4    2    3
1    5    2    3
2    8    2    1
2    8    2    1
3    12   2    4
3    11   2    4
4    16   2    2
4    15   2    2
Community
  • 1
  • 1
roody
  • 2,633
  • 5
  • 38
  • 50
  • Try `data$z<-sample(data$x)` – mrip Nov 11 '13 at 02:52
  • Ah, yes. So "embarrassing" WAS the right adjective. Thanks @Frank and @mrip! – roody Nov 11 '13 at 02:54
  • @Frank and @mrip: New question...how would I do this WITHIN values of another variable, i.e. there was a set of x values for a person with `id`? (I've modified the question above!) – roody Nov 11 '13 at 03:06
  • 1
    You sure you don't want to make a new question? I would convert the data to a `data.table` (requiring a package of the same name), and then do `DT[,z:=sample(x),by=id]`, but I won't post that here in case you want to make a new question. – Frank Nov 11 '13 at 03:22
  • 1
    or `plyr::ddply(data,"id",transform,z=sample(x))` – Ben Bolker Nov 11 '13 at 03:31
  • possible duplicate of [Random rows in dataframe in R](http://stackoverflow.com/questions/8273313/random-rows-in-dataframe-in-r) – CHP Nov 11 '13 at 03:54
  • Thanks everyone! To make things more clear for future readers, I am going to reframe my question so that it more clearly leads to the answer provided (and isn't a duplicate!). I also added an extension to this question (regarding more complicated logic), if anyone has thoughts :) – roody Nov 11 '13 at 04:46
  • @Frank-My Q actually became more complicated; in addition to sampling within ID, I need to make it so that the values of z != x (if x=1, then z=2,3, or 4). Do you have any ideas for an elegant `data.table` solution? – roody Nov 12 '13 at 04:24
  • @roody: Oops, that dash after my name on your last comment meant that I didn't get pinged. I guess you know, but just to clarify: if you have a new super-duper extension, it's probably best (for you, us, future visitors, the site) to write a new question elsewhere and leave the first alone... :) But anyway, I think there may be a way to do what you're looking for with `sample` and `combn` (though I don't have much experience with the latter). – Frank Nov 12 '13 at 12:46

1 Answers1

1

Update (actually, entirely new answer) for a new question

Nothing came to my mind immediately, so I thought I should just propose a while-based solution. This function basically checks whether any of the results of sample are the same as the value of the input vector. If yes, run sample and try again....

Shuffled <- function(inVec) {
  Res <- vector()
  while ( TRUE ) {
    Res <- sample(inVec)
    if ( !any(Res == inVec) ) { break }
  }
  Res
}

set.seed(1)
mydf$z <- ave(mydf$x, mydf$id, FUN = Shuffled)

mydf
#    x  y id z
# 1  1  4  1 2
# 2  1  5  1 4
# 3  2  8  1 4
# 4  2  8  1 3
# 5  3 12  1 2
# 6  3 11  1 1
# 7  4 16  1 3
# 8  4 15  1 1
# 9  1  4  2 2
# 10 1  5  2 2
# 11 2  8  2 3
# 12 2  8  2 4
# 13 3 12  2 4
# 14 3 11  2 1
# 15 4 16  2 1
# 16 4 15  2 3

any(mydf$x == mydf$z)
# [1] FALSE
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Hi @Ananda - thanks for your answer! To resolve the duplication concern, I made my question more specific. I also added an extension of the logic to the question, if you have any thoughts. (Also, is this kind of editing bad etiquette? I'm happy to start a new question if so). – roody Nov 11 '13 at 04:48
  • @roody, within a certain time-frame, based on comments, I don't feel it's bad etiquette to edit your question, though your current question is *dramatically* different from your first one. – A5C1D2H2I1M1N2O1R2T1 Nov 11 '13 at 04:56
  • @roody, see my updated answer. I think it captures what you were looking for. – A5C1D2H2I1M1N2O1R2T1 Nov 11 '13 at 16:29
  • hi @Ananda! I follow conceptually, but when I try to run the function my computer is freezing. Is the following the correct way to apply your function using data.table? `mydf[, eval(Shuffled(c('x','id'))),]`? – roody Nov 12 '13 at 04:04
  • @roody, No. That is not the correct way for using "data.table". First your object has to be a `data.table`. (`library(data.table); DT <- data.table(mydf)`). Next, you have to use `:=` to create your new column: (`DT[, z := Shuffled(x), by = id]; DT`). – A5C1D2H2I1M1N2O1R2T1 Nov 12 '13 at 04:39
  • Hi @Ananda Mahto - I've tried both using base R and data.table, but my R is timing out and doesn't actually produce a new variable. Any thoughts on what I might be doing wrong? – roody Nov 15 '13 at 19:39