I need a regex which matches any number followed by a string which consists of digits, spaces, dots and commas followed by "Kč" or "Eur".
The problem is that my regex sometimes doesn't find all such strings.
((\d[., \d]+)(Kč|Eur))
For example:
re.findall("""((\d[., \d]+)(Kč|Eur))""","Letenky od 12 932 Kč",flags=re.IGNORECASE)
returns nothing instead of [(12 932 Kč,12 932,Kč)]
Do you know what is wrong with the regex?
Your input string contains a multibyte letter consisting of a base c letter and a diacritic, and the regex contains the precompose letter with Unicode code point \u010D.
You may use
(\d(?:[., \d]*\d)?)\s*(K(?:c\u030C|\u010D)|Eur)
Or
(\d[., \d]*)\s*(K(?:č|č)|Eur))
See the regex (second regex demo) and Python demo.
Pattern details
\d - a digit(?:[., \d]*\d)? - an optional occurrence of
[., \d]* - zero or more digits, spaces, . or ,\d - a digit\s* - 0 or more whitespaces(?:K(?:c\u030C|\u010D)|Eur) - either K followed with either c\u030C or \u010D, or Eur values.When defining the currency regex, use CZK = ['Czk','K(?:č|č)'] or CZK = ['Czk', r'K(?:c\u030C|\u010D)'].
As Wiktor Stribiżew commented, the Kč in your regexp is different from the Kč in your text. You can use the unicodedata module to normalize both:
>>> import re
>>> re.findall("""((\d[., \d]+)(Kč|Eur))""", "Letenky od 12 932 Kč", flags=re.IGNORECASE)
[]
>>> import unicodedata
>>> re.findall(unicodedata.normalize("NFD", """((\d[., \d]+)(Kč|Eur))"""), unicodedata.normalize("NFD", "Letenky od 12 932 Kč"), flags=re.IGNORECASE)
[('12 932 Kč', '12 932 ', 'Kč')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With