I'm using following regex, it suppose to find the string 'U.S.A.', but it only gets 'A.', is anyone know what's wrong?
#INPUT
import re
text = 'That U.S.A. poster-print costs $12.40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
Expected Output:
['U.S.A.']
I'm following the NLTK Book, Chapter 3.7 here, it has a set of regex but it just not workin. I've tried it in both Python 2.7 and 3.4.
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
nltk.regexp_tokenize() works the same as re.findall(), I think somehow my python here does not recognize the regex as expected. The regex listed above output this:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
Possibly, it's something to do with how regexes were compiled previously using nltk.internals.compile_regexp_to_noncapturing() that is abolished in v3.1, see here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
But it doesn't work in NLTK v3.1:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
With slight modification of how you define your regex groups, you could get the same pattern to work in NLTK v3.1, using this regex:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-@&*]) # special characters with meanings
"""
In code:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Without NLTK, using python's re module, we see that the old regex patterns are not supported natively:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-@&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Note: The change in how NLTK's RegexpTokenizer compiles the regexes would make the examples on NLTK's Regular Expression Tokenizer obsolete too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With