Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text::Balanced and multiline xml

Tags:

xml

perl

Seems like I'm a little bit lost.

I need to parse a large (about 100 mb) and quite ugly xml file. If I use parsefile, it returns error (junk after document element), but it would happily parse smaller elements of the file.

So I decided to break the file into elements and parse them. Since parsing XML with regular expressions is discouraged (well I tried it anyway, but I get duplicating results), I tried Text::Balanced.

Something like

use Text::Balanced qw/extract_tagged/;

while (<FILE>) {
     my $result = extract_tagged($_, "<tag>");
     print $result if defined $result;
}

works just fine, so I can extract tagged entries which fit into one line. With something bigger, however

use Text::Balanced qw/extract_tagged/;
use File::Slurp;

my $test = read_file("file");
my $result = extract_tagged($text, "<tag>");
print $result;

does not work. It reads the file but it can not find a tagged item there.

So the question is how do I extract anything between given tags without XML::Parser? And I really really need to avoid chomping it if possible.

P.S. search would return regex guides, heredoc howtos and anything but what I look for

P.P.S. I'm a moron, been trying to parse an invalid file. Still curious how to chop a file if the parser fails though.


bvr's answer was close, it really would retrieve some data, but not if the top level tag is missing.

like image 774
Roman Grazhdan Avatar asked Oct 15 '25 20:10

Roman Grazhdan


1 Answers

For broken XML, I would try setting recover option to XML::LibXML. It makes it ignore parsing errors and continue.

like image 139
bvr Avatar answered Oct 18 '25 04:10

bvr