I am working with a data frame that contain letters, special characters and digits. My goal is to extract all letters and the first digit. All digits always occur at the end after letters and special characters; however, some letters may appear after special characters. See the example below:
d = {'col1': ['A./B. 1234', 'CDEF/G5.','AB./C23']}
df = pd.DataFrame(data=d)
print(df)
# col1
# 0 A./B. 1234
# 1 CDEF/G5.
# 2 AB./C23
I looked up many variants but I do not know how handle special characters ./ and the likes.
df.col1.str.extract('([A-Za-z\d]+)')
# 0
# 0 A
# 1 CDEF
# 2 AB
This gives me all the letters and digits until it reaches a special character. Eventually I would like to get the following output:
AB1
CDEFG5
ABC2
I am new to regex.
You need to extract all the characters up to and including the first digit, and then replace any non-letter/digit characters with an empty string:
d = {'col1': ['A./B. 1234', 'CDEF/G5.','AB./C23']}
df = pd.DataFrame(data=d)
df.col1.str.extract(r'^([^\d]+\d)').replace('[^A-Za-z0-9]', '', regex=True)
Output:
0
0 AB1
1 CDEFG5
2 ABC2
Another method
s=df['col1'].str.extractall("([a-zA-Z0-9])")[0]
s[s.str.isalpha()|s.shift().str.isalpha()].sum(level=0)
0 AB1
1 CDEFG5
2 ABC2
Name: 0, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With