0

My aim was to create a function with which I can simply split a dataset in two (training and test dataset) based on a given percentage but with keeping this percentage within the defined groups. Sorry for my poor English, here is the function to clarify it:

split.g <- function (df, group, pc = 0.75) {
  group <- as.factor(df$group) 

  list.df.g <- list()
  list.df.g.train <- list()
  list.df.g.test <- list()
  for (i in 1 : length(levels(group))) {
  list.df.g[[i]] <- subset(df, group == levels(group)[i])
  list.df.g.train[[i]] <- list.df.g[[i]][sample(nrow(list.df.g[[i]]), round((nrow(list.df.g[[i]])*pc), 0), replace = F), ]
  list.df.g.test[[i]] <- list.df.g[[i]][-(which(rownames(list.df.g[[i]]) %in% rownames(list.df.g.train[[i]]))), ]
  }

  list(do.call("rbind", list.df.g.train), do.call("rbind", list.df.g.test))

}

When I run this function with my dataframe I get the following error:

    Error in list.df.g[[i]] <- subset(df, group == levels(group)[i]) : 
  attempt to select less than one element

However, with a slight change in the function code, it works well:

split.g <- function (df, group, pc = 0.75) {
  group <- as.factor(df[, which(colnames(df) == group)])

  list.df.g <- list()
  list.df.g.train <- list()
  list.df.g.test <- list()
  for (i in 1 : length(levels(group))) {
  list.df.g[[i]] <- subset(df, group == levels(group)[i])
  list.df.g.train[[i]] <- list.df.g[[i]][sample(nrow(list.df.g[[i]]), round((nrow(list.df.g[[i]])*pc), 0), replace = F), ]
  list.df.g.test[[i]] <- list.df.g[[i]][-(which(rownames(list.df.g[[i]]) %in% rownames(list.df.g.train[[i]]))), ]
  }

  list(do.call("rbind", list.df.g.train), do.call("rbind", list.df.g.test))

}

The change is in the second row. By using the $, the function does not work and I do not understand why? Has somebody an answer?

  • I would recommend using `createDataPartition` function from the excellent `caret` package, see `library(caret);?createDataPartition` – Silence Dogood Feb 21 '17 at 17:29
  • Be sure to read all the answers at the linked duplicate, not just the accepted one. The second answer in particular is probably the most directly relevant. – joran Feb 21 '17 at 17:31

0 Answers0