Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't split/strip by space in a string in Ruby because it's an NBSP character

I have this large XML file. There is a field that I want to split a field by space.

So I do the following to save the splitted data into a & b:

components = a.split(' ')
a = components[0]
b = components[1]

However some are splitted correctly, but some are not (when they all contain spaces). For example when I try to split 'Maria Canada' it does not split by space.

I am not sure why. If I open the file in Vim and copy those particular wrong text, I can split them correctly in Ruby interactive shell:

'Maria (Canada)'.split(' ')
 => ["Maria","(Canada)"]

UPDATE

Ok the reason is NBSP. I printed out those lines which doesn't split in the console by raising errors. I copied the text and pasted in irb. These copied text can't be splitted in irb either, nor can I strip that space.

>> ' '.strip
=> " "

I then run ord and found out that the space is a NBSP character (its code is 160):

>> ' '.ord
=> 160

So the xml file contains both space and NBSP characters. I think Vim auto converts NBSP to spaces, and that's why when I tried to copy it from vim it is not NBSP anymore.

Now I just need to figure out how to deal with NBSP.

like image 832
lulalala Avatar asked Nov 25 '25 11:11

lulalala


1 Answers

You should split on all whitespaces, including the non-ASCII ones:

a, b = str.split(/[[:space:]]/)

I'm assuming you are using Ruby 1.9+ and that your str has the right encoding (e.g. utf-8). As explained in the regex reference, \s matches only ASCII spaces, while [[:space:]] will match all unicode spaces (same for \d vs [[:digit:]], etc...)

like image 74
Marc-André Lafortune Avatar answered Nov 28 '25 02:11

Marc-André Lafortune



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!