Suppose you have an input formatted like this:
id____value1____value2...valueN
1____hello____world...something
2________goodnight...world
the 4 '_' are supposed to be '/t'
So far, I get something like this: the first item has an {ID:1, value1:hello, value2:world,...,valueN:something} whereas the second item has {ID:2, value1: , value2:goodnight, ... , valueN: world}
I want my final representation for the 2nd item to be: {ID:2, value1:n/a , value2:goodnight, ... , valueN: world}
I have written a script in Python to read the file line by line, but I want to be able to check whether a '/t' is followed by another '/t', and then insert the 'n/a' value.
My code so far is this:
def myFunc():
list = []
with open(file, 'r') as f:
header = f.readline() # Store the header of the file for future reference.(maybe). Don't commend out.
for line in f:
for i in range(len(line)):
if line[i] == '\t':
if line[i+1] == '\t':
line[:i] + "n/a" + line[i:]
list.append(line) # iterate through the file and store it's values on the list.
return list
Depending a bit on how you want to use the list at the end of the day, you could also use the csv module for something which will be a bit more flexible for cases where more than one column might come without entries;
import csv
with open(file, 'r') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
list = [[x if x else 'n/a' for x in line] for line in reader]
Now list will be a list of lists, each of which contains the actual items.
In [11]: print(header)
['id', 'value1', 'value2', 'value3']
In [12]: print(list)
[['1', 'hello', 'world', 'something'], ['2', 'n/a', 'goodnight', 'world']]
Edit added after the comments below:
A slight modification of the method above (using Python 2.7+ dictionary comprehensions) will land you a dictionary;
import csv
with open(file, 'r') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
list = [{header[i]: line[i] if line[i] else 'n/a' for i in range(len(header))} for line in reader]
print(list)
# [{'value1': 'hello', 'value3': 'something', 'id': '1', 'value2': 'world'}, {'value1': 'n/a', 'value3': 'world', 'id': '2', 'value2': 'goodnight'}]
You ask if this is cleaner or not, and this will probably depend quite a bit on how you intend to use the result down the line. The dictionary approach gives you something which is easier to read if you decide to inspect the result.
If you are in a situation where you need to perform a lot of data mangling on your file, you might be interested in the pandas DataFrame data structure which is made for this sort of stuff. If you are not in that situation though, that approach might just be completely overkill. A couple of simple examples of what it does (note for instance that it takes care of your original 'n/a' issue by default):
In [1]: import pandas as pd
In [5]: df = pd.read_csv('testfile', delimiter='\t') # Or whatever your file is called
In [6]: df = df.set_index('id')
In [7]: df
Out[7]:
value1 value2 value3
id
1 hello world something
2 NaN goodnight world
In [8]: df[df['value3'] == 'something'] # Find all rows with a given value3
Out[8]:
value1 value2 value3
id
1 hello world something
In [10]: df[df['value2'] == 'goodnight'] # Find all rows with a given value2
Out[10]:
value1 value2 value3
id
2 NaN goodnight world
In [11]: df['value1'] # Show only value1
Out[11]:
id
1 hello
2 NaN
Name: value1, dtype: object
Basically any operation on table you can come up with has a natural approach in pandas.
Why not replace?
for line in f:
line.replace('\t\t','n/a')
Anywhere there are two adjacent \t values, you will have 'n/a' instead. As @DeepSpace points out, f isn't actually changing so you'll have to append line to your list or do something to keep track of your results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With