I need to find special characters from entire dataframe.
In below data frame some columns contains special characters, how to find the which columns contains special characters?
Want to display text for each columns if it contains special characters.
You can setup an alphabet of valid characters, for example
import string
alphabet = string.ascii_letters+string.punctuation
Which is
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
And just use
df.col.str.strip(alphabet).astype(bool).any()
For example,
df = pd.DataFrame({'col1':['abc', 'hello?'], 'col2': ['ÃÉG', 'Ç']})
col1 col2
0 abc ÃÉG
1 hello? Ç
Then, with the above alphabet,
df.col1.str.strip(alphabet).astype(bool).any()
False
df.col2.str.strip(alphabet).astype(bool).any()
True
The statement special characters can be very tricky, because it depends on your interpretation. For example, you might or might not consider #
to be a special character. Also, some languages (such as Portuguese) may have chars like ã
and é
but others (such as English) will not.
To remove unwanted characters from dataframe columns, use regex:
def strip_character(dataCol):
r = re.compile(r'[^a-zA-Z !@#$%&*_+-=|\:";<>,./()[\]{}\']')
return r.sub('', dataCol)
df[resultCol] = df[dataCol].apply(strip_character)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With