I have this large XML file. There is a field that I want to split a field by space.
So I do the following to save the splitted data into a & b:
components = a.split(' ')
a = components[0]
b = components[1]
However some are splitted correctly, but some are not (when they all contain spaces). For example when I try to split 'Maria Canada' it does not split by space.
I am not sure why. If I open the file in Vim and copy those particular wrong text, I can split them correctly in Ruby interactive shell:
'Maria (Canada)'.split(' ')
=> ["Maria","(Canada)"]
UPDATE
Ok the reason is NBSP. I printed out those lines which doesn't split in the console by raising errors. I copied the text and pasted in irb. These copied text can't be splitted in irb either, nor can I strip that space.
>> ' '.strip
=> " "
I then run ord and found out that the space is a NBSP character (its code is 160):
>> ' '.ord
=> 160
So the xml file contains both space and NBSP characters. I think Vim auto converts NBSP to spaces, and that's why when I tried to copy it from vim it is not NBSP anymore.
Now I just need to figure out how to deal with NBSP.
You should split on all whitespaces, including the non-ASCII ones:
a, b = str.split(/[[:space:]]/)
I'm assuming you are using Ruby 1.9+ and that your str has the right encoding (e.g. utf-8). As explained in the regex reference, \s matches only ASCII spaces, while [[:space:]] will match all unicode spaces (same for \d vs [[:digit:]], etc...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With