I have a large file ~10 GB that is comma delimited. Each row starts with a 2 character code that tells what type of row it is as each row is a different type of event. Currently I read the file into R, then use a regex to split it into different pieces based on code then write the resulting objects to a flat file.
I'm curious if there's a more direct way to do this (read a row, determine row type and append the row to the appropriate flat file (there will be 7 total)) in Python, bash, sed/awk, etc.
Data looks like this:
01,[email protected],20140101120000,campaign1
02,201420140101123000,123321,Xjq12090,TX
02,201420140101123000,123321,Xjq12090,AK
...
Any suggestions would be appreciated.
Using awk you can do:
awk -F, '{fn=$1 ".txt"; print > fn}' file
If you want to keep it clean by closing all file handles in the end use this awk:
awk -F, '!($1 in files){files[$1]=$1 ".txt"} {print > files[$1]}
END {for (f in files) close(files[$f])}' file
If you don't care about performance, or trust your OS/filesystem/drive's disk caching:
with open('hugedata.txt') as infile:
for line in infile:
with open(line[:2] + '.txt', 'a') as outfile:
outfile.write(line)
However, constantly reopening and reclosing (and therefore flushing) the files is going to mean you never get the benefit of buffering, and there's only so much a disk cache can do to make up for that, so, you might want to consider pre-opening all the files. Since there are only 7 of them, that's pretty easy:
files = { format(i, '{:02}'): open(format(i, '{:02}.txt'), 'w') for i in range(1, 8)}
try:
with open('hugedata.txt') as infile:
for line in infile:
files[line[:2]].write(line)
finally:
for file in files:
file.close()
Or, more robustly:
files = collections.defaultdict(lambda s: open(s+'.txt', 'w'))
try:
with open('hugedata.txt') as infile:
for line in infile:
files[line[:2]].write(line)
finally:
for file in files:
file.close()
(You can write a with statement that does the closing automatically, but it'll be different in different Python versions; this is a bit clunky, but works with everything from 2.4 to 3.5, and probably beyond, and since you haven't told us your platform or Python version, it seemed safer.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With