Python: Regular Expression not working properly

Question

I'm using following regex, it suppose to find the string 'U.S.A.', but it only gets 'A.', is anyone know what's wrong?

#INPUT
import re

text = 'That U.S.A. poster-print costs $12.40...'

print re.findall(r'([A-Z]\.)+', text)

#OUTPUT
['A.']

Expected Output:

['U.S.A.']

I'm following the NLTK Book, Chapter 3.7 here, it has a set of regex but it just not workin. I've tried it in both Python 2.7 and 3.4.

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

nltk.regexp_tokenize() works the same as re.findall(), I think somehow my python here does not recognize the regex as expected. The regex listed above output this:

[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

alvas · Accepted Answer

Possibly, it's something to do with how regexes were compiled previously using nltk.internals.compile_regexp_to_noncapturing() that is abolished in v3.1, see here)

>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> 
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

But it doesn't work in NLTK v3.1:

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

With slight modification of how you define your regex groups, you could get the same pattern to work in NLTK v3.1, using this regex:

pattern = r"""(?x)                   # set flag to allow verbose regexps
              (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
              |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
              |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
              |(?:[+/\-@&*])         # special characters with meanings
            """

In code:

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x)                   # set flag to allow verbose regexps
... (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*])         # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

Without NLTK, using python's re module, we see that the old regex patterns are not supported natively:

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               |\w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               |[+/\-@&*]        # special characters with meanings
...               |\S\w*                       # any sequence of word characters# 
... """            
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps
...                       (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
...                       |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
...                       |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
...                       |(?:[+/\-@&*])         # special characters with meanings
...                     """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

Note: The change in how NLTK's RegexpTokenizer compiles the regexes would make the examples on NLTK's Regular Expression Tokenizer obsolete too.

Python: Regular Expression not working properly

Tags:

python

regex

tokenize

nlp

nltk

LingxB

1 Answers

alvas

Recent Activity

Donate For Us

Python: Regular Expression not working properly

Tags:

python

regex

tokenize

nlp

nltk

LingxB

1 Answers

alvas

Related questions

Recent Activity

Donate For Us