Text::Balanced and multiline xml

Question

Seems like I'm a little bit lost.

I need to parse a large (about 100 mb) and quite ugly xml file. If I use parsefile, it returns error (junk after document element), but it would happily parse smaller elements of the file.

So I decided to break the file into elements and parse them. Since parsing XML with regular expressions is discouraged (well I tried it anyway, but I get duplicating results), I tried Text::Balanced.

Something like

use Text::Balanced qw/extract_tagged/;

while (<FILE>) {
     my $result = extract_tagged($_, "<tag>");
     print $result if defined $result;
}

works just fine, so I can extract tagged entries which fit into one line. With something bigger, however

use Text::Balanced qw/extract_tagged/;
use File::Slurp;

my $test = read_file("file");
my $result = extract_tagged($text, "<tag>");
print $result;

does not work. It reads the file but it can not find a tagged item there.

So the question is how do I extract anything between given tags without XML::Parser? And I really really need to avoid chomping it if possible.

P.S. search would return regex guides, heredoc howtos and anything but what I look for

P.P.S. I'm a moron, been trying to parse an invalid file. Still curious how to chop a file if the parser fails though.

bvr's answer was close, it really would retrieve some data, but not if the top level tag is missing.

bvr · Accepted Answer

For broken XML, I would try setting recover option to XML::LibXML. It makes it ignore parsing errors and continue.

Text::Balanced and multiline xml

Tags:

xml

perl

Roman Grazhdan

1 Answers

bvr

Recent Activity

Donate For Us

Text::Balanced and multiline xml

Tags:

xml

perl

Roman Grazhdan

1 Answers

bvr

Related questions

Recent Activity

Donate For Us