Trying to create a dictionary from a given string which can be of the format
key1:value1 key2:value2
however picking value is a problem as sometimes it may have
key1: value1
key1: "value has space"
Identifier for key is something:
Tried below
def tokenize(msg):
legit_args = [i for i in msg if ":" in i]
print(legit_args)
dline = dict(item.split(":") for item in legit_args)
return dline
above only works for no space values.
then tried below
def tokenize2(msg):
try:
#return {k: v for k, v in re.findall(r'(?=\S|^)(.+?): (\S+)', msg)}
return dict(token.split(':') for token in shlex.split(msg))
except:
return {}
this works well with key:"something given like this"
but still needs some changes to work, below is the issue
>>> msg = 'key1: "this is value1 " key2:this is value2 key3: this is value3'
>>> import shlex
>>> dict(token.split(':') for token in shlex.split(msg))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #1 has length 1; 2 is required
>>> shlex.split(msg) # problem is here i think
['key1:', 'this is value1 ', 'key2:this', 'is', 'value2', 'key3:', 'this', 'is', 'value3']
Would you please try something like:
import re
s = "key1: \"this is value1 \" key2:this is value2 key3: this is value3"
d = {}
for m in re.findall(r'\w+:\s*(?:\w+(?:\s+\w+)*(?=\s|$)|"[^"]+")', s):
key, val = re.split(r':\s*', m)
d[key] = val.strip('"')
print(d)
Output:
{'key3': 'this is value3', 'key2': 'this is value2', 'key1': 'this is value1 '}
Explanation of the regex:
\w+:\s*
matches a word followed by a colon and possible
(zero or more) whitespaces.(?: ... )
composes a non-capturing group.:\w+(?:\s+\w+)*(?=\s|$)
matches one or more words followed by
a whitespace or end of the string.|
alternates the regex pattern."[^"]+"
matches a string enclosed by double quotes.[Edit]
If you want to handle fancy quotes
(aka curly quotes or smart quotes), please try:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
s = "key1: \"this is value1 \" key2:this is value2 key3: this is value3 title: “incorrect title” title2: “incorrect title2” key4:10.20.30.40"
d = {}
for m in re.findall(r'\w+:\s*(?:[\w.]+(?:\s+[\w.]+)*(?=\s|$)|"[^"]+"|“.+?”)', s):
key, val = re.split(r':\s*', m)
d[key] = val.replace('“', '"').replace('”', '"').strip('"')
print(d)
Output:
{'title': 'incorrect title', 'key3': 'this is value3', 'key2': 'this is value2', 'key1': 'this is value1 ', 'key4': '10.20.30.40', 'title2': 'incorrect title2'}
[Edit2]
The following code now allows colon(s) in the values:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
s = "key1: \"this is value1 \" key2:this is value2 key3: this is value3 title: “incorrect title” title2: “incorrect title2” key4:10.20.30.40 key5:\"value having:colon\""
d = {}
for m in re.findall(r'\w+:\s*(?:[\w.]+(?:\s+[\w.]+)*(?=\s|$)|"[^"]+"|“.+?”)', s):
key, val = re.split(r':\s*', m, 1)
d[key] = val.replace('“', '"').replace('”', '"').strip('"')
print(d)
Output:
{'title': 'incorrect title', 'key3': 'this is value3', 'key2': 'this is value2', 'key1': 'this is value1 ', 'key5': 'value having:colon', 'key4': '10.20.30.40', 'title2': 'incorrect title2'}
The modification is applied in the line:
key, val = re.split(r':\s*', m, 1)
adding the third argument 1
as maxsplit
to limit the maximum count of split.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With