I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).
I can parse individual documents with no problem like so:
from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')
I need to work on the whole corpus though. I tried doing this:
reader = XMLCorpusReader('corpora/nytimes', r'.*')
but this doesn't create a useable reader object. For instance
len(reader.words())
returns
raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string
How do I read this corpus into NLTK?
I'm new to NLTK so any help is greatly appreciated.
I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.
from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')
So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:
from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)
As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With