Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Tags:

python

xml

nlp

nltk

I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).

I can parse individual documents with no problem like so:

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

I need to work on the whole corpus though. I tried doing this:

reader = XMLCorpusReader('corpora/nytimes', r'.*')

but this doesn't create a useable reader object. For instance

len(reader.words())

returns

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

How do I read this corpus into NLTK?

I'm new to NLTK so any help is greatly appreciated.

like image 654
NAD Avatar asked Oct 24 '25 01:10

NAD


1 Answers

I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.

like image 122
machine yearning Avatar answered Oct 25 '25 16:10

machine yearning