Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Question

I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).

I can parse individual documents with no problem like so:

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

I need to work on the whole corpus though. I tried doing this:

reader = XMLCorpusReader('corpora/nytimes', r'.*')

but this doesn't create a useable reader object. For instance

len(reader.words())

returns

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

How do I read this corpus into NLTK?

I'm new to NLTK so any help is greatly appreciated.

machine yearning · Accepted Answer

I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.

Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Tags:

python

xml

nlp

nltk

NAD

1 Answers

machine yearning

Recent Activity

Donate For Us

Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Tags:

python

xml

nlp

nltk

NAD

1 Answers

machine yearning

Related questions

Recent Activity

Donate For Us