Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing missing values of an input Python

Tags:

python

string

Suppose you have an input formatted like this:

id____value1____value2...valueN
1____hello____world...something
2________goodnight...world

the 4 '_' are supposed to be '/t'

So far, I get something like this: the first item has an {ID:1, value1:hello, value2:world,...,valueN:something} whereas the second item has {ID:2, value1: , value2:goodnight, ... , valueN: world} I want my final representation for the 2nd item to be: {ID:2, value1:n/a , value2:goodnight, ... , valueN: world}

I have written a script in Python to read the file line by line, but I want to be able to check whether a '/t' is followed by another '/t', and then insert the 'n/a' value.

My code so far is this:

def myFunc():
    list = []
        with open(file, 'r') as f:
            header = f.readline()    # Store the header of the file for future reference.(maybe). Don't commend out.
            for line in f:
                for i in range(len(line)):
                    if line[i] == '\t':
                        if line[i+1] == '\t':
                            line[:i] + "n/a" + line[i:]
                list.append(line)   # iterate through the file and store it's values on the list.
        return list
like image 621
Pavlos Panteliadis Avatar asked Apr 26 '26 20:04

Pavlos Panteliadis


2 Answers

Depending a bit on how you want to use the list at the end of the day, you could also use the csv module for something which will be a bit more flexible for cases where more than one column might come without entries;

import csv

with open(file, 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    header = next(reader)
    list = [[x if x else 'n/a' for x in line] for line in reader]

Now list will be a list of lists, each of which contains the actual items.

In [11]: print(header)
['id', 'value1', 'value2', 'value3']

In [12]: print(list)
[['1', 'hello', 'world', 'something'], ['2', 'n/a', 'goodnight', 'world']]

Edit added after the comments below:

A slight modification of the method above (using Python 2.7+ dictionary comprehensions) will land you a dictionary;

import csv

with open(file, 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    header = next(reader)
    list = [{header[i]: line[i] if line[i] else 'n/a' for i in range(len(header))} for line in reader]

print(list)
# [{'value1': 'hello', 'value3': 'something', 'id': '1', 'value2': 'world'}, {'value1': 'n/a', 'value3': 'world', 'id': '2', 'value2': 'goodnight'}]

You ask if this is cleaner or not, and this will probably depend quite a bit on how you intend to use the result down the line. The dictionary approach gives you something which is easier to read if you decide to inspect the result.

If you are in a situation where you need to perform a lot of data mangling on your file, you might be interested in the pandas DataFrame data structure which is made for this sort of stuff. If you are not in that situation though, that approach might just be completely overkill. A couple of simple examples of what it does (note for instance that it takes care of your original 'n/a' issue by default):

In [1]: import pandas as pd

In [5]: df = pd.read_csv('testfile', delimiter='\t')  # Or whatever your file is called

In [6]: df = df.set_index('id')

In [7]: df
Out[7]:
   value1     value2     value3
id
1   hello      world  something
2     NaN  goodnight      world

In [8]: df[df['value3'] == 'something']  # Find all rows with a given value3
Out[8]:
   value1 value2     value3
id
1   hello  world  something

In [10]: df[df['value2'] == 'goodnight']  # Find all rows with a given value2
Out[10]:
   value1     value2 value3
id
2     NaN  goodnight  world

In [11]: df['value1']  # Show only value1
Out[11]:
id
1    hello
2      NaN
Name: value1, dtype: object

Basically any operation on table you can come up with has a natural approach in pandas.

like image 141
fuglede Avatar answered Apr 28 '26 10:04

fuglede


Why not replace?

for line in f:
     line.replace('\t\t','n/a')

Anywhere there are two adjacent \t values, you will have 'n/a' instead. As @DeepSpace points out, f isn't actually changing so you'll have to append line to your list or do something to keep track of your results.

like image 29
benten Avatar answered Apr 28 '26 10:04

benten