I need to extract some values from a multi-line string (which I read from the text body of emails). I want to be able to feed patterns to my parser so I can customize different emails later. I came up with the following:
#!/usr/bin/env ruby
text1 =
<<-eos
Lorem ipsum dolor sit amet,
Name: Pepe Manuel Periquita
Email: [email protected]
Sisters: 1
Brothers: 3
Children: 2
Lorem ipsum dolor sit amet
eos
pattern1 = {
:exp => /Name:[\s]*(.*?)$\s*
Email:[\s]*(.*?)$\s*
Sisters:[\s]*(.*?)$\s*
Brothers:[\s]*(.*?)$\s*
Children:[\s]*(.*?)$/mx,
:blk => lambda do |m|
m.flatten!
{:name => m[0],
:email => m[1],
:total => m.drop(2).inject(0){|sum,item| sum + item.to_i}}
end
}
# Scan on text returns
#[["Pepe Manuel Periquita", "[email protected]", "1", "3", "2"]]
def do_parse text, pattern
data = pattern[:blk].call(text.scan(pattern[:exp]))
puts data.inspect
end
do_parse text1, pattern1
# ./text_parser.rb
# {:email=>"[email protected]", :total=>6, :name=>"Pepe Manuel Periquita"}
So, I define the pattern as a regular expression paired with a block to build a hash from the matches. The "parser" simply takes the text and apply the rules by executing the block on the result of matching the regular expression against the text with scan.
At the moment I have to parse emails with a format as shown in text1 but later I would like to add patterns as easily as possible to extract data from different emails (the format of those emails will be fixed for each type). Therefore I would like to simplify the pattern moving as much as possible to the "parser". The code above works and extracts the data but most of the work is located at the pattern...
Is this is the right way to go?
Could be simplified or would you think a different / better solution for this problem?
Update
I updated the parser following Tonttu solution so the pattern hash is now:
pattern2 = {
:exp => /^(.+?):\s*(.+)$/,
:blk => lambda do |m|
r = Hash[m.map{|x| [x[0].downcase.to_sym, x[1]]}]
{:name => r[:name],
:email => r[:email],
:total => r[:children].to_i + r[:brothers].to_i + r[:sisters].to_i}
end
}
Maybe something like this is generic enough?
pp Hash[*text1.scan(/^(.+?):\s(.+)$/).map{|x|
[x[0].downcase.to_sym, x[1]]
}.flatten]
=>
{:sisters=>"1",
:brothers=>"3",
:children=>"2",
:name=>"Pepe Manuel Periquita",
:email=>"[email protected]"}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With