Regex for price doesn't work

Question

I need a regex which matches any number followed by a string which consists of digits, spaces, dots and commas followed by "Kč" or "Eur".

The problem is that my regex sometimes doesn't find all such strings.

((\d[., \d]+)(Kč|Eur))

For example:

re.findall("""((\d[., \d]+)(Kč|Eur))""","Letenky od 12 932 Kč",flags=re.IGNORECASE)

returns nothing instead of [(12 932 Kč,12 932,Kč)]

Do you know what is wrong with the regex?

Wiktor Stribiżew · Accepted Answer

Your input string contains a multibyte letter consisting of a base c letter and a diacritic, and the regex contains the precompose letter with Unicode code point \u010D.

You may use

(\d(?:[., \d]*\d)?)\s*(K(?:c\u030C|\u010D)|Eur)

Or

(\d[., \d]*)\s*(K(?:č|č)|Eur))

See the regex (second regex demo) and Python demo.

Pattern details

\d - a digit
(?:[., \d]*\d)? - an optional occurrence of
- [., \d]* - zero or more digits, spaces, . or ,
- \d - a digit
\s* - 0 or more whitespaces
(?:K(?:c\u030C|\u010D)|Eur) - either K followed with either c\u030C or \u010D, or Eur values.

When defining the currency regex, use CZK = ['Czk','K(?:č|č)'] or CZK = ['Czk', r'K(?:c\u030C|\u010D)'].

Aankhen · Answer

As Wiktor Stribiżew commented, the Kč in your regexp is different from the Kč in your text. You can use the unicodedata module to normalize both:

>>> import re
>>> re.findall("""((\d[., \d]+)(Kč|Eur))""", "Letenky od 12 932 Kč", flags=re.IGNORECASE)
[]
>>> import unicodedata
>>> re.findall(unicodedata.normalize("NFD", """((\d[., \d]+)(Kč|Eur))"""), unicodedata.normalize("NFD", "Letenky od 12 932 Kč"), flags=re.IGNORECASE)
[('12 932 Kč', '12 932 ', 'Kč')]

Regex for price doesn't work

Tags:

python

regex

Milano

2 Answers

Wiktor Stribiżew

Aankhen

Recent Activity

Donate For Us

Regex for price doesn't work

Tags:

python

regex

Milano

2 Answers

Wiktor Stribiżew

Aankhen

Related questions

Recent Activity

Donate For Us