Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I have a future warning when using regex on a string. finditer should work with multiple tags and give a dictionary back

This is my code below. When I run it i get the error:

c:\Users\renne\Documents\Code\Text Analysis\Assignment1.1C.py:27:
FutureWarning: Possible nested set at position 54
for item in
re.finditer("(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})(?P<user_name>[[\w]+\d{4}]|[-])(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)", logdata):

I dont know how to solve this. I have looked over my code a few times and cant figure out the problem.

I used a random string out of the test data instead of the enire txt file to make the testing easier. When this works ill change logdata = '...' to a read.

import re

logdata = '146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'

dict = {}
expression = """
    (?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})
    (?P<user_name>[[\w]+\d{4}]|[-])
    (?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})
    (?P<request>[A-Z]+ \S* HTTP/\d[.]\d)
"""

for item in re.finditer("(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})(?P<user_name>[[\w]+\d{4}]|[-])(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)", logdata):
    print(item.groupdict()['host'])

print(item.groupdict())
like image 826
DaViD Renneboog Avatar asked Nov 20 '25 16:11

DaViD Renneboog


1 Answers

You get the warning because you have a pair of unescaped square brackets inside a pair unescaped square brackets. See the re documentation:

Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal '[' or containing literal character sequences '--', '&&', '~~', and '||'. To avoid a warning escape them with a backslash.

The [[\w]+\d{4}] is wrong as it matches one or more [ or word chars (with [[\w]+) amd then four digits (with \d{4}) and then a literal ] char (with ]). You need to remove all square brackets here.

You can use

r'(?P<host>\d{3}\.\d{3}\.\d{3}\.\d{3}) - (?P<user_name>\w+\d{4}|-) \[(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})] "(?P<request>[A-Z]+ \S* HTTP/\d\.\d)'

See the regex demo.

If you encounter this error in other scenarios, you may need to fix it differently:

  • When you need to match a literal [ or ] and use them inside square brackets, escape ] and do not escape [. E.g. [a-zA-Z[\]] matches an ASCII letter, [ or ]. You may also keep ] unescaped if put at the start of a character class: []A-Za-z[] = [a-zA-Z[\]] = [][a-zA-Z].
  • When you want to match a literal [ or ] outside of square brackets (character class), you need to escape [ and keep ] unescaped. E.g. \[[0-9]+] matches [, then one or more digits and then a ] char.
  • Note that using single shorthands or chars inside character classes is considered bad practice and may lead to misunderstandings that in their turn might lead to issues like this. Instead of [\w]+, always use \w+.
like image 190
Wiktor Stribiżew Avatar answered Nov 23 '25 05:11

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!