given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).
I've managed to do this using awk but I'd like to know if a more pythonic way exists.
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to teradata's fastload, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.
There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with
import gzip
nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))
If you want to do something else with the lines, such as pass them to a different process, do
nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
    nlines += 1
    # pass the line to the other process
Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library.  You can do everything in a single pass:
import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
                            stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
    for i, line in enumerate(f):
         fastload_fifo.write(line)
    print "Number of lines", i + 1
os.unlink(fifo_path)
I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With