I'm trying to capture data from input like:
...
10 79 QUANT. DE ITENS A FORNECER O N 9 0 67 75
E' a quantidade de itens que o fornecedor consegue suprir
o cliente para uma determinada data. As casa decimais estao
definidas no campo 022 (unid. casas decimais).
11 24 DATA ENTREGA/EMBARQUE DO ITEM O N 6 0 76 81
Data de entrega/embarque do item. Nos casos em que este cam-
po nao contiver a data, seu conteudo devera ser ajustado en-
tre as partes.
...
My goal is to capture: ('10', '79', 'QUANT. DE ITENS A FORNECER', 'O','N', '9', '0', '67', 75') and so on...
My first try was to loop over line and capture as follow:
def parse_line(line):
pattern = r"\s(\d{1,6}|\w{1})\s" # do not capture the description
if re.search(pattern, line):
tab_find = re.findall(pattern, line, re.DOTALL|re.UNICODE)
if len(tab_find) > 6:
return tab_find
My Second try was to split the text and append expected result:
def ugly_parsing(line):
result = [None] * 9 # init list
tab_r = list(filter(None, re.split(r"\s", line))) # ignore ''
keys = [0, 1, -1, -2, -3, -4, -5, -6]
for i in keys:
result[i] = tab_r[i]
result[2] = " ".join(tab_r[2:-6])
return result
Ignoring the description is OK, but when the description contains a single letter my regex it not working.
Just translate that line into a regex, with all the required numbers and characters, and give whatever remains to the description. You can do this using a non-greedy match: (.+?).
p = re.compile(r"^(\d+)\s+(\d+)\s+(.+?)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)$")
for line in text.splitlines():
m = p.match(line)
if m:
print m.groups()
Output is
('10', '79', 'QUANT. DE ITENS A FORNECER', 'O', 'N', '9', '0', '67', '75')
('11', '24', 'DATA ENTREGA/EMBARQUE DO ITEM', 'O', 'N', '6', '0', '76', '81')
Not sure whether that makes it more readable, but you could also construct that large regex from smaller parts, e.g. "^" + r"(\d+)\s+" * 2 + "(.+?)" + r"\s+(\w+)" * 6 + "$" or "^" + "\s+".join([r"(\d+)"] * 2 + ["(.+?)"] + [r"(\w+)"] * 6) + "$"
Or, depending or your input, you could split by other things than single spaces, such as two-or-more spaces \s{2,} (as suggested in comments) or tabs, but this could yield problems in case the description contains those, too. Using a fixed number of stuff "around" the description might be more reliable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With