How can I match an alpha character with a regular expression. I want a character that is in \w but is not in \d. I want it unicode compatible that's why I cannot use [a-zA-Z].
Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.
match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.
The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.
The regex \w is equivalent to [A-Za-z0-9_] , matches alphanumeric characters and underscore.
Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.
Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:
(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _
So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]
Here's a simple example (Python 2.6).
>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k'] Further exploration reveals a few quirks of this approach:
>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ... print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021'] U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d
U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w
All CJK ideographs are classed as "letters" and thus match \w
Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.
What about:
\p{L} You can to use this document as reference: Unicode Regular Expressions
EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)
Another references:
For posterity, here are the examples on the blog:
import re string = 'riché' print string riché richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',) richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',) richre = re.compile('([é\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) string = 'richéñ' match = richre.match(string) print match.groups() ('rich\xe9\xf1',) richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',) matched = match.group(1) print matched richéñ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With