How can I match an alpha character with a regular expression. I want a character that is in <code>\w</code> but is not in <code>\d</code>. I want it unicode compatible that's why I cannot use <code>[a-zA-Z]</code>.

Your first two sentences contradict each other. "in <code>\w</code> but is not in <code>\d</code>" includes underscore. I'm assuming from your third sentence that you don't want underscore. Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want: (1) characters that are not matched by <code>\w</code> (i.e. don't want anything that's not alpha, digits, or underscore) => <code>\W</code> (2) digits => <code>\d</code> (3) underscore => <code>_</code> So what we don't want is anything in the character class <code>[\W\d_]</code> and consequently what we do want is anything in the character class <code>[^\W\d_]</code> Here's a simple example (Python 2.6). <pre class="prettyprint"><code>>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k'] </code></pre> Further exploration reveals a few quirks of this approach: <pre class="prettyprint"><code>>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ... print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021'] </code></pre> U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w All CJK ideographs are classed as "letters" and thus match \w Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

What about: <pre class="prettyprint"><code>\p{L} </code></pre> You can to use this document as reference: Unicode Regular Expressions EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive) Another references: <ul> <li>re.UNICODE</li> <li>python and regular expression with unicode</li> <li>Unicode Technical Standard #18: Unicode Regular Expressions</li> </ul> <hr> For posterity, here are the examples on the blog: <pre class="prettyprint"><code>import re string = 'richÃ©' print string richÃ© richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',) richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',) richre = re.compile('([Ã©\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) string = 'richÃ©Ã±' match = richre.match(string) print match.groups() ('rich\xe9\xf1',) richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',) matched = match.group(1) print matched richÃ©Ã± </code></pre>

python-re: How do I match an alpha character

2 Answers

Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:

(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _

So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

Here's a simple example (Python 2.6).

>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k']

Further exploration reveals a few quirks of this approach:

>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ...     print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d

U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w

All CJK ideographs are classed as "letters" and thus match \w

Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

186

answered Oct 05 '22 02:10

John Machin

What about:

\p{L}

You can to use this document as reference: Unicode Regular Expressions

EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)

Another references:

re.UNICODE
python and regular expression with unicode
Unicode Technical Standard #18: Unicode Regular Expressions

For posterity, here are the examples on the blog:

import re string = 'richÃ©' print string richÃ©  richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',)  richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',)  richre = re.compile('([Ã©\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  string = 'richÃ©Ã±' match = richre.match(string) print match.groups() ('rich\xe9\xf1',)  richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',)  matched = match.group(1) print matched richÃ©Ã±

answered Oct 05 '22 02:10

Rubens Farias

Related questions
                            
                                How to pass on argparse argument to function as kwargs?
                            
                                Adding Macros to Python
                            
                                Python: block network connections for testing purposes?
                            
                                Size of figure when using plt.subplots
                            
                                Add dropout layers between pretrained dense layers in keras
                            
                                Python regex split without empty string
                            
                                Reversal of string.contains In python, pandas
                            
                                set of list of lists in python
                            
                                Different ways of deleting lists
                            
                                Accessing POST Data from WSGI
                            
                                How to parse packets in a python library? [closed]
                            
                                setup.py: restrict the allowable version of the python interpreter
                            
                                Manually trigger Django email error report
                            
                                Why does a query invoke a auto-flush in SQLAlchemy?
                            
                                Best way to join / merge by range in pandas
                            
                                What does '%% time' mean in python-3?
                            
                                Django: When To Use QuerySet None
                            
                                What is the advantage of a list comprehension over a for loop?
                            
                                How to get Python to gracefully format None and non-existing fields [duplicate]
                            
                                Arrow on a line plot with matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python-re: How do I match an alpha character

Tags:

python

regex

regex-negation

unicode

basaundi

People also ask

2 Answers

John Machin

Rubens Farias

Recent Activity

Donate For Us