-2

I have a full sample from 1986-1995 each year and 1997-2015 every other year. And it looks like: enter image description here I also have a variable lists for each year and it looks like:enter image description here

My purpose is to subset the whole dataset into subsample datasets for each year by using their variable names in variable lists (2nd picture).

My attempt is as following:

name1986 <- var_name_list$X1986
name1987 <- var_name_list$X1987
name1988 <- var_name_list$X1988
name1989 <- var_name_list$X1989
name1990 <- var_name_list$X1990
name1991 <- var_name_list$X1991
name1992 <- var_name_list$X1992
name1993 <- var_name_list$X1993
name1994 <- var_name_list$X1994
name1995 <- var_name_list$X1995
name1996 <- var_name_list$X1996
name1997 <- var_name_list$X1997
name1999 <- var_name_list$X1999
name2001 <- var_name_list$X2001
name2003 <- var_name_list$X2003
name2005 <- var_name_list$X2005
name2007 <- var_name_list$X2007
name2009 <- var_name_list$X2009
name2011 <- var_name_list$X2011
name2013 <- var_name_list$X2013
name2015 <- var_name_list$X2015

And it works well: enter image description here However I got stuck in the following:

seq_yr = c(seq(1986,1996),seq(1997,2015,by=2))
for (Number in seq_yr){
  dataname <- sprintf("name",Number)
  file<- subset(data,select = c(ID,which(colnames(data) %in% dataname)))
  # assign value to the variable name x 
  assign(x=str_c("data.",Number),value=file,envir=.GlobalEnv)  
}

This successfully gives me a few subsample datasets but each of them only includes one variable. And this is what shows in my workspace:

enter image description here

Could you please let me know what went wrong with my code?

I would like to attach the part of my data in the following link:

https://drive.google.com/open?id=19KMl6eNsCEfcRJxWGoyWJSb34kPhmJaW

data4 is the data and header4 is the part of variable lists.

Thanks a lot!!! Please comment if you have any concerns regarding my question.

  • Please use `dput` to make your data reproducible. Then we can work off your existing data rather than create a new set. Nice writeup though. – Kamil Apr 03 '18 at 21:29
  • sorry about that. Let me try to revise this. – Fangzhu Zhou Apr 03 '18 at 21:37
  • Hi @Fangzhu, if your data file is that big, often times it's better to do a write.csv(). I tried copying and pasting your `dput` data and while I'm sure it will work, it was troublesome getting a good format. Also, you did not provide a file extension to your google drive files. Try a write.csv(df, "df_raw.csv") and name the other one `def_keys`, for example. – Kamil Apr 03 '18 at 22:07

1 Answers1

1

Conceptually, you'd like to do the following:

  1. Develop the base case
  2. Turn the base case into a function
  3. Scale for the generic solution

Let's go through them, but first, a reproducible example:

> df_raw <- data.frame(v1 = 1:10, v2 = 11:20, v3 = 21:30, v4 = 31:40, v5 = 41:50, v6 = 51:60)
> 
> df_keys <- data.frame(colNames = c('name1', 'name2', 'name3'),
+                       year1 = c('v1', 'v2', 'v3'),
+                       year2 = c('v4','v5','v6'))
> 
> 
> df_raw
   v1 v2 v3 v4 v5 v6
1   1 11 21 31 41 51
2   2 12 22 32 42 52
3   3 13 23 33 43 53
4   4 14 24 34 44 54
5   5 15 25 35 45 55
6   6 16 26 36 46 56
7   7 17 27 37 47 57
8   8 18 28 38 48 58
9   9 19 29 39 49 59
10 10 20 30 40 50 60
> df_keys
  colNames year1 year2
1    name1    v1    v4
2    name2    v2    v5
3    name3    v3    v6

Base Case

Always start with your base case. Create an empty data frame and change the names to the column names to make it readable. The first column of your df_key data frame shows variable names, and the second one shows one year of variable names.

> colNames <- df_keys[,1]
> yearNames <- df_keys[,2]
> new_df <- data.frame(matrix(ncol = length(yearNames), nrow = 0))
> new_df <- df_raw[,yearNames]
> names(new_df) <- colNames
> new_df
   name1 name2 name3
1      1    11    21
2      2    12    22
3      3    13    23
4      4    14    24
5      5    15    25
6      6    16    26
7      7    17    27
8      8    18    28
9      9    19    29
10    10    20    30

Create a Function

Honestly, this is the easiest step. Just put brackets around it and rename a few key names.

subDF <- function(df_raw, df_keys, nameIndex = 1, yearIndex = 2){

  colnames <- df_keys[,nameIndex]
  yearNames <- df_keys[,yearIndex]
  new_df <- data.frame(matrix(ncol = length(yearNames), nrow = 0))
  new_df <- df_raw[,yearNames]
  names(new_df) <- colNames
  new_df
}

> subDF(df_raw, df_keys)
   name1 name2 name3
1      1    11    21
2      2    12    22
3      3    13    23
4      4    14    24
5      5    15    25
6      6    16    26
7      7    17    27
8      8    18    28
9      9    19    29
10    10    20    30

Scale

This can be one of many methods. It's simple enough to have two statements.

> allYears <- seq(dim(df_keys)[2]-1)+1
> lapply(allyears, function(x) {subDF(df_raw, df_keys, yearIndex = x)})
[[1]]
   name1 name2 name3
1      1    11    21
2      2    12    22
3      3    13    23
4      4    14    24
5      5    15    25
6      6    16    26
7      7    17    27
8      8    18    28
9      9    19    29
10    10    20    30

[[2]]
   name1 name2 name3
1      1    11    21
2      2    12    22
3      3    13    23
4      4    14    24
5      5    15    25
6      6    16    26
7      7    17    27
8      8    18    28
9      9    19    29
10    10    20    30
Kamil
  • 412
  • 4
  • 11