This is my code below. When I run it i get the error:
c:\Users\renne\Documents\Code\Text Analysis\Assignment1.1C.py:27:
FutureWarning: Possible nested set at position 54
for item in
re.finditer("(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})(?P<user_name>[[\w]+\d{4}]|[-])(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)", logdata):
I dont know how to solve this. I have looked over my code a few times and cant figure out the problem.
I used a random string out of the test data instead of the enire txt file to make the testing easier. When this works ill change logdata = '...' to a read.
import re
logdata = '146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'
dict = {}
expression = """
(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})
(?P<user_name>[[\w]+\d{4}]|[-])
(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})
(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)
"""
for item in re.finditer("(?P<host>\d{3}[.]\d{3}[.]\d{3}[.]\d{3})(?P<user_name>[[\w]+\d{4}]|[-])(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})(?P<request>[A-Z]+ \S* HTTP/\d[.]\d)", logdata):
print(item.groupdict()['host'])
print(item.groupdict())
You get the warning because you have a pair of unescaped square brackets inside a pair unescaped square brackets. See the re documentation:
Support of nested sets and set operations as in Unicode Technical Standard #18 might be added in the future. This would change the syntax, so to facilitate this change a FutureWarning will be raised in ambiguous cases for the time being. That includes sets starting with a literal
'['or containing literal character sequences'--','&&','~~', and'||'. To avoid a warning escape them with a backslash.
The [[\w]+\d{4}] is wrong as it matches one or more [ or word chars (with [[\w]+) amd then four digits (with \d{4}) and then a literal ] char (with ]). You need to remove all square brackets here.
You can use
r'(?P<host>\d{3}\.\d{3}\.\d{3}\.\d{3}) - (?P<user_name>\w+\d{4}|-) \[(?P<time>\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})] "(?P<request>[A-Z]+ \S* HTTP/\d\.\d)'
See the regex demo.
If you encounter this error in other scenarios, you may need to fix it differently:
[ or ] and use them inside square brackets, escape ] and do not escape [. E.g. [a-zA-Z[\]] matches an ASCII letter, [ or ]. You may also keep ] unescaped if put at the start of a character class: []A-Za-z[] = [a-zA-Z[\]] = [][a-zA-Z].[ or ] outside of square brackets (character class), you need to escape [ and keep ] unescaped. E.g. \[[0-9]+] matches [, then one or more digits and then a ] char.[\w]+, always use \w+.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With