I have a file and parts of it looks like this:
string 0 1 10
string with white space 0 10 30
string9 with number 9 10 20 50
string_ with underline 10 50 1
(string with parentese) 50 20 100
I need to parse each line, into something like:
[[string, 0 ,1 ,10], ....]
As you can see above, the first part can be pretty much anything, and the only way I can think of parsing this is to accept anything until I have 2 white space characters, then it is just numbers.
But I can not find this "UNTIL"-functionality in pyparsing doc.
The following code sample achieves what you want (with improvements over the previous version suggested by @PaulMcGuire):
from __future__ import print_function
from pyparsing import CharsNotIn, Group, LineEnd, OneOrMore, Word, ZeroOrMore
from pyparsing import delimitedList, nums
SPACE_CHARS = ' \t'
word = CharsNotIn(SPACE_CHARS)
space = Word(SPACE_CHARS, exact=1)
label = delimitedList(word, delim=space, combine=True)
# an alternative contruction for 'label' could be:
# label = Combine(word + ZeroOrMore(space + word))
value = Word(nums)
line = label('label') + Group(OneOrMore(value))('values') + LineEnd().suppress()
text = """
string 0 1 10
string with white space 0 10 30
string9 with number 9 10 20 50
string_ with underline 10 50 1
(string with parentese) 50 20 100
""".strip()
print('input text:\n', text, '\nparsed text:\n', sep='\n')
for line_tokens, start_location, end_location in line.scanString(text):
print(line_tokens.dump())
giving the following output:
input text:
string 0 1 10
string with white space 0 10 30
string9 with number 9 10 20 50
string_ with underline 10 50 1
(string with parentese) 50 20 100
parsed text:
['string', ['0', '1', '10']]
- label: string
- values: ['0', '1', '10']
['string with white space', ['0', '10', '30']]
- label: string with white space
- values: ['0', '10', '30']
['string9 with number 9', ['10', '20', '50']]
- label: string9 with number 9
- values: ['10', '20', '50']
['string_ with underline', ['10', '50', '1']]
- label: string_ with underline
- values: ['10', '50', '1']
['(string with parentese)', ['50', '20', '100']]
- label: (string with parentese)
- values: ['50', '20', '100']
The parsed values can be obtained as a dictionary with the first column (which was named label in the example above) as the key and the list of the remaining columns (named values above) as the values with the following dict comprehension:
{label: values.asList() for label, values in line.searchString(text)}
where line and text are the variables from the example above, generating the following result:
{'(string with parentese)': ['50', '20', '100'],
'string': ['0', '1', '10'],
'string with white space': ['0', '10', '30'],
'string9 with number 9': ['10', '20', '50'],
'string_ with underline': ['10', '50', '1']}
For the sake of completeness, this one doesn't use pyparsing.
import re
lines = re.compile("\r?\n").split(text)
pattern = re.compile("\s\s+")
for line in lines:
print pattern.split(line)
#['string', '0', '1', '10']
#['string with white space', '0', '10', '30']
#['string9 with number 9', '10', '20', '50']
#['string_ with underline', '10', '50', '1']
#['(string with parentese)', '50', '20', '100']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With