I have a dataframe where one column consists of strings that have three patterns:
1) Upper case letters only: APPLE COMPANY
2) Upper case letters and ends with the letters AS: CAR COMPANY AS
3) Upper and lower case letters: John Smith
df = pd.DataFrame({'NAME': ['APPLE COMPANY', 'CAR COMPANY AS', 'John Smith']})
NAME ...
0 APPLE COMPANY ...
1 CAR COMPANY AS ...
2 John Smith ...
3 ... ...
How can I take out those rows that do not meet the conditions of 2) and 3), i.e. 1)? In other words, how can I take out rows that only have UPPER case letters, does not end with AS or have both UPPER and LOWER letters in the string?
I came up with this:
df['NAME'].str.findall(r"(^[A-Z ':]+$)")
df['NAME'].str.findall('AS')
The first one extract strings with only upper letters, but second one only finds AS. If there are other methods than regex than I happy to try that as well.
Expected outcome is:
NAME ...
1 CAR COMPANY AS ...
2 John Smith ...
3 ... ...
This regex should work:
^(?:[A-Z ':]+ AS|.*[a-z].*)$
It matches either one of these:
[A-Z ':]+ AS - The case of all uppercase letters followed by AS.*[a-z].* - The case of lowercase lettersone way would be,
df['temp']=df['NAME'].str.extract("(^[A-Z ':]+$)")
s1=df['temp']==df["NAME"]
s2=~df['NAME'].str.endswith('AS')
print(df.loc[~(s1&s2), 'NAME'])
O/P:
1 CAR COMPANY AS
2 John Smith
Name: NAME, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With