Extracting Specific Columns from Multiple Files & Writing to File Python

Question

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:

 test_id gene_id gene    locus   sample_1        sample_2        status  value_1 value_2 log2(fold_change)
  000001     000001     ZZ 1:1   01  01   NOTEST  0       0       0       0       1       1       no

I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:

gene name   locus   log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)
ZZ  1:1         0     0     0     0

all the log2(fold_change) are obtain from the tenth column from each of the seven files

What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work

 dicti = defaultdict(list)
 filetag = []

 def read_data(file, base):
  with open(file, 'r') as f:
    reader = csv.reader((f), delimiter='	')
     for row in reader:
      if 'test_id' not in row[0]:
            dicti[row[2]].append((base, row))

 name_of_fold = raw_input("Folder name to stored output files in: ")
 for file in glob.glob("*.txt"):
  base=file[0:3]+"-log2(fold_change)"
  filetag.append(base)
  read_data(file, base)


 with open ("output.txt", "w") as out:
  out.write("gene name" + "	"+  "locus" + "	" + "	".join(sorted(filetag))+"
")
  for k,v in dicti:
   out.write(k + "	" + v[1][1][3] + "	" + "".join([ int(z[0][0:3]) * "	" + z[1][9]  for z in v ])+"
")

So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:

 "".join([ int(z[0][0:3]) * "	" + z[1][9]  for z in v ])

I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by " " this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:

Here is what I mean for instance,

 gene name   locus     log2(fold change) from file 1    .... log2(fold change) from file7 
 ZZ           1:3      0           
                             0

because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by " " and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.

Darius · Accepted Answer

Here is my first approach:

import glob
import numpy as np

with open('output.txt', 'w') as out:
    fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
    # Title row:
    titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
    out.write('	'.join(titles) + '
')
    # Data row:
    data = []
    for idx, fn in enumerate(fns):
        file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
        if idx == 0:
            data.extend([file[0], file[1]])
        data.append(file[2])
    out.write('	'.join(data))

Content of the created file output.txt (Note: I created just three files for testing):

gene_name   locus   1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ  1:1 0   0   0

Kaladin · Answer

I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.

import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []

def read_data(file, base):
  with open(file, 'r') as f:
    for row in f:
      r = re.compile(r'([^\s]*)\s*')
      row = r.findall(row.strip())[:-1]
      print row
      if 'test_id' not in row[0]:
        dicti[row[2]].append((base, row))

def main():
  name_of_fold = raw_input("Folder name to stored output files in: ")
  for file in glob.glob("*.txt"):
    base=file[0:3]+"-log2(fold_change)"
    filetag.append(base)
    read_data(file, base)

  with open ("output", "w") as out:
    data = ("genename" + "	"+  "locus" + "	" + "	".join(sorted(filetag))+"
")
    r = re.compile(r'([^\s]*)\s*')
    data = r.findall(data.strip())[:-1]
    out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}    {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
    out.write('
')
    for key in dicti:
      print 'locus = ' + str(dicti[key][1])
      data = (key + "	" + dicti[key][1][1][3] + "	" + "".join([     len(z[0][0:3]) * "	" + z[1][9]  for z in dicti[key] ])+"
")
      data = r.findall(data.strip())[:-1]
      out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
      out.write('
')

if __name__ == '__main__':
  main()

and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted. Thanks

gene name   locus   1.t-log2(fold_change)   2.t-log2(fold_change)    3.t-log2(fold_change)  4.t-log2(fold_change)   5.t-log2(fold_change)   6.t-log2(fold_change)   7.t-log2(fold_change)
ZZ  1:1             0           0           0           0           0           0           0

Extracting Specific Columns from Multiple Files & Writing to File Python

Tags:

python-2.7

aBiologist

2 Answers

Darius

Kaladin

Recent Activity

Donate For Us

Extracting Specific Columns from Multiple Files & Writing to File Python

Tags:

python-2.7

aBiologist

2 Answers

Darius

Kaladin

Related questions

Recent Activity

Donate For Us