Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of an XML file

Tags:

file

unix

size

I have an XML file of size 31 GB. I need to find the total number of lines in that file. I know the command wc -l will give me the same. However it's taking too long to perform this operation. Is there any faster mechanism to find the number of lines in a large file?

like image 882
sameer karjatkar Avatar asked Dec 13 '25 20:12

sameer karjatkar


2 Answers

31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.

But I would also make the following comments:

  • Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
  • Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
  • Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset
like image 161
BeWarned Avatar answered Dec 16 '25 01:12

BeWarned


if all you need is the line count, wc -l will be as fast as anything else.

The problem is the 31GB text file.

like image 41
Joe Koberg Avatar answered Dec 16 '25 01:12

Joe Koberg