I want to split Brazilian names into parts. However there are names like below where "de", "da" (and others) that are not separate parts and they always go with the following word. So normal split doesn't work.
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
My expected output would be:
[Francisco, da Sousa, Rodrigues] #1
[Emiliano, Rodrigo, Carrasco] #2
[Alberto, de Francia] #3
[Bruno, Rezende] #4
For the special cases I tried this pattern:
PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])")
re.split(PATTERN, test1) (...)
but the output is not what I expected:
['Francisco', 'da Sousa Rodrigues'] #1
['Alberto', 'de Francia'] #3
Any idea how to fix it? Is there a way to just use one pattern for both "normal" and "special" case?
Will the names always be written in the "canonical" way, i.e. with every part capitalised except for da, de, do, ...?
In that case, you can use that fact:
>>> import re
>>> for t in (test1, test2, test3, test4):
... print(re.findall(r"(?:[a-z]+ )?[A-Z]\w+", t, re.UNICODE))
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
>>>
The "right" way to do what you want to do (apart from not doing it at all), would be a negative lookbehind: split when on a space that isn't preceeded by any of da, de, do, ... . Sadly, this is (AFAIK) impossible, because re requires lookbehinds to be of equal width. If no names end in the syllables, which you really can't assume, you could do this:
PATTERN = re.compile(r"(?<! da| de| do|dos|das)\s")
You may or may not occasionally stumble about cases that don't work: If the first letter is an accented character (or the article, hypothetically, contained one), it will match incorrectly. To fix this, you won't get around using an external library; regex.
Your new findall will look like this then:
regex.findall(r"(?:\p{Ll}+ )?\p{Lu}\w+", "Luiz Ângelo de Urzêda")
The \p{Ll} refers to any lowercase letter, and \p{Lu} to any uppercase letter.
With regex.split() function from python's regex library which  offers additional functionality:
installation:
pip install regex
usage:
import regex as re
test_names = ["Francisco da Sousa Rodrigues", "Emiliano Rodrigo Carrasco",
              "Alberto de Francia", "Bruno Rezende"]
for n in test_names:
    print(re.split(r'(?<!das?|de|dos?)\s+', n))
The output:
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
(?<!das?|de|dos?)\s+ - lookbehind negative assertion (?<!...) ensures that whitespace(s) \s+ is not preceded with one of the special cases da|das|de|do|dos
https://pypi.python.org/pypi/regex/
You may use this regex in findall with an optional group:
(?:(?:da|de|do|dos|das)\s+)?\S+
Here we make (?:da|de|do|dos|das) and 1+ whitespace following this optional.
RegEx Demo
Code Demo
Code Example:
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
PATTERN = re.compile(r'(?:(?:da|de|do|dos|das)\s+)?\S+')
>>> print re.findall(PATTERN, test1)
['Francisco', 'da Sousa', 'Rodrigues']
>>> print re.findall(PATTERN, test2)
['Emiliano', 'Rodrigo', 'Carrasco']
>>> print re.findall(PATTERN, test3)
['Alberto', 'de Francia']
>>> print re.findall(PATTERN, test4)
['Bruno', 'Rezende']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With