I have tried to understand how to check whether the string contains only letters (from any language) in Python 2.7. I have tried this code:
# -*- coding: utf-8 -*-
import re
def main():
regexp1 = re.compile('[^\W\d_]+', re.IGNORECASE | re.UNICODE)
regexp2 = re.compile('[\p{L}]+', re.IGNORECASE | re.UNICODE)
print("1", regexp1.search(u"test"))
print("2", regexp1.search(u'äö'))
print("3", regexp1.search(u'...'))
print("4", regexp1.search(u'9a'))
print("5", regexp1.search(u'New / York'))
print("6", regexp2.search(u"test"))
print("7", regexp2.search(u'äö'))
print("8", regexp2.search(u'...'))
print("9", regexp2.search(u'9a'))
print("10", regexp2.search(u'New / York'))
if __name__ == '__main__':
main()
Output:
('1', <_sre.SRE_Match object at 0x02ACF678>)
('2', <_sre.SRE_Match object at 0x02ACF678>)
('3', None)
('4', <_sre.SRE_Match object at 0x02ACF678>)
('5', <_sre.SRE_Match object at 0x02ACF678>)
('1', None)
('2', None)
('3', None)
('4', None)
('5', None)
I want a regex that will match only string №1 and string №2 (only strings with letters from any language). But now it matches strings which contains letters (and also contains digits and /).
Also I have tried to use \p{L}
regex, but it does not work at all. I have tried this regexes: [\p{L}]+
, (\p{L})+
, \p{L}
.
regexp1
is a good start. The problem is that regexp1
matches strings that contain at least one letter, not strings that contain only letters. Try this:
regexp1 = re.compile('^[^\W\d_]+$', re.IGNORECASE | re.UNICODE)
This "anchors" the match both to the beginning and to the end of the string, meaning that it won't be able to just match the "New" part of "New / York".
The python re
module doesn't seem to have any support for character classes like \p{L}
, but there is a third party regex
module that does. See the docs at https://pypi.python.org/pypi/regex/ However, I can't speak to the performance or standards-compliance of that module.
The third-party regex
module is recommended in the re
docs for more functionality and better Unicode support. Particularly, it supports \p
patterns, so
\p{L}+
should work fine with regex
regexes, matching any sequence of Unicode letter characters.
However, you should be cautious - a combining diacritic is not a letter. You can alter your regex to accept combining marks, or normalize your input in NFC form to combine some combining marks into the preceding letter, but first, you should think very carefully about your definition of "contains only letters".
Also, search
only checks whether the string contains a match for the regex, not whether the entire string matches the regex. I would recommend fullmatch
for matching the entire string, but that's only in Python 3.4+. For 2.7, I would say to anchor the regex:
^\p{L}+$
except that $
can match right before a trailing newline, so you should still examine the match object to see if it represents a whole-string match or if it stops before a trailing newline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With