My aim was to create a function with which I can simply split a dataset in two (training and test dataset) based on a given percentage but with keeping this percentage within the defined groups. Sorry for my poor English, here is the function to clarify it:
split.g <- function (df, group, pc = 0.75) {
group <- as.factor(df$group)
list.df.g <- list()
list.df.g.train <- list()
list.df.g.test <- list()
for (i in 1 : length(levels(group))) {
list.df.g[[i]] <- subset(df, group == levels(group)[i])
list.df.g.train[[i]] <- list.df.g[[i]][sample(nrow(list.df.g[[i]]), round((nrow(list.df.g[[i]])*pc), 0), replace = F), ]
list.df.g.test[[i]] <- list.df.g[[i]][-(which(rownames(list.df.g[[i]]) %in% rownames(list.df.g.train[[i]]))), ]
}
list(do.call("rbind", list.df.g.train), do.call("rbind", list.df.g.test))
}
When I run this function with my dataframe I get the following error:
Error in list.df.g[[i]] <- subset(df, group == levels(group)[i]) :
attempt to select less than one element
However, with a slight change in the function code, it works well:
split.g <- function (df, group, pc = 0.75) {
group <- as.factor(df[, which(colnames(df) == group)])
list.df.g <- list()
list.df.g.train <- list()
list.df.g.test <- list()
for (i in 1 : length(levels(group))) {
list.df.g[[i]] <- subset(df, group == levels(group)[i])
list.df.g.train[[i]] <- list.df.g[[i]][sample(nrow(list.df.g[[i]]), round((nrow(list.df.g[[i]])*pc), 0), replace = F), ]
list.df.g.test[[i]] <- list.df.g[[i]][-(which(rownames(list.df.g[[i]]) %in% rownames(list.df.g.train[[i]]))), ]
}
list(do.call("rbind", list.df.g.train), do.call("rbind", list.df.g.test))
}
The change is in the second row. By using the $, the function does not work and I do not understand why? Has somebody an answer?