1

I have a list of files that looks like this

FILE10

Count_S10        
  GeneA      0.3      
  GeneB      0.4
  GeneC      0.9         
  GeneD      0.1       

.......................

FILE8

Count_S8        
  GeneA      0.22
  GeneB      0.76
  GeneC      0.2         
  GeneD      0.01       

.......................

FILE13

Count_S13        
  GeneA      0.2      
  GeneB      0.04
  GeneC      0.19         
  GeneD      0.111       

.......................

Totally I have 100 files of 5000 rows. The first column of each file has the header while the second column has not. Moreover in the folder files are not ordered in ascending order. I simply would like the following output:

FILE1

Gene_List   Count_S8   Count_S10   Count_S13
  GeneA      0.22         0.3         0.2
  GeneB      0.76         0.4         0.04
  GeneC      0.2          0.9         0.19
  GeneD      0.01         0.1         0.111        

Here, only Files 8, 10, 13 are shown as an example.

Can anyone help me please?

Thank you in advance

NewUsr_stat
  • 2,351
  • 5
  • 28
  • 38

1 Answers1

3

I've got those example files saved in my "example_files" directory. First, get those files as a list:

files <- list.files(path = "example_files", full.names = TRUE)

> files
[1] "example_files/File10.txt" "example_files/File13.txt" "example_files/File8.txt" 

Get them ordered numerically (as in your expected output):

files <- files[order(as.numeric(gsub(".*File|.txt", "", files)))]

> files
[1] "example_files/File8.txt"  "example_files/File10.txt" "example_files/File13.txt"

This function takes the first line as the source name, then uses read.table to get the actual data, skipping the first line. It then assigns the names correctly for a merge later:

read_file <- function(filename) {
  source = readLines(filename)[1]
  df_ = read.table(filename, skip = 1, sep = "")
  names(df_) = c("Gene_List", source)
  return(df_)
}

Now, you can call that function over your list of files:

list_of_files <- lapply(files, read_file)

> list_of_files
[[1]]
  Gene_List Count_S8
1     GeneA      0.3
2     GeneB      0.4
3     GeneC      0.9
4     GeneD      0.1

[[2]]
  Gene_List Count_S10
1     GeneA      0.22
2     GeneB      0.76
3     GeneC      0.20
4     GeneD      0.01

[[3]]
  Gene_List Count_S13
1     GeneA     0.200
2     GeneB     0.040
3     GeneC     0.190
4     GeneD     0.111

Now, use Reduce and merge to merge the list together (as per this answer):

> Reduce(function(x, y) merge(x, y, all = TRUE), list_of_files)
  Gene_List Count_S8 Count_S10 Count_S13
1     GeneA      0.3      0.22     0.200
2     GeneB      0.4      0.76     0.040
3     GeneC      0.9      0.20     0.190
4     GeneD      0.1      0.01     0.111
Luke C
  • 10,081
  • 1
  • 14
  • 21