I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.
My approach has been to split()
the file by lines first, and then split()
each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.
My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?
note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.
edit
It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)
another edit
Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.
yet another edit
I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.
The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.
About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:
my $file = 'somefile.csv';
my @data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my @fields = split(/,/, $line);
push @data, \@fields;
}
This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split
with this:
my @fields = Text::ParseWords::parse_line(',', 0, $line);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With