Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert/decode a pandas.Series of mixed bytes/strings into string or utf-8

I would like to solve the problem in two possible cases:

  1. Where I don't know whether the Series of strings is going to be UTF-8 or bytes beforehand.

  2. Where the strings in a pd.Series are mixed bytes and `UTF-8/.

Which I'd guess would have the same solution.

Currently for:

b = pd.Series(['123', '434,', 'fgd', 'aas', b'442321'])
b.str.decode('utf-8')

Gives NaNs where the strings were already in UTF-8. Or are they automatically ASCII? Can I give the error parameter in decode so that the string remains "undecoded" where it's already in UTF-8 for example? The docstring doesn't seem to provide much info.

Or is there a better way to accomplish this?

Alternatively, is there a string method in pandas like .str.decode which instead just returns a True/False when a string is bytes or UTF-8?

EDIT:

One option I can think of is:

b = pd.Series(['123', '434,', 'fgd', 'aas', b'442321'])
converted = b.str.decode('utf-8')
b.loc[~converted.isnull()] = converted

Is this the recommended way then? Seems a bit roundabout. I guess what would be more elegant is really just a way to check if an str is bytes on all the elements of a Series and return a boolean array where it's the case.

like image 414
Marses Avatar asked Jan 27 '26 22:01

Marses


2 Answers

This will definitely slow things down for a large Series, but you can pass a ternary expression with a callable:

>>> b.apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)                                                                                                                                                                                      
0       123
1      434,
2       fgd
3       aas
4    442321
dtype: object

Looking at the source for .str.decode() is instructive - it just applies _na_map(f, arr) over the Series, where the function f is f = lambda x: x.decode(encoding, errors). Because str doesn't have a "decode" method to begin with, that error will become NaN. This happens in str_decode().

>>> from pandas.core.strings import str_decode                                                                                                                                                                                                               
>>> from pandas.core.strings import _cpython_optimized_encoders                                                                                                                                                                                              

>>> "utf-8" in _cpython_optimized_encoders                                                                                                                                                                                                                   
True
>>> str_decode(b, "utf-8")                                                                                                                                                                                                                                   
array([nan, nan, nan, nan, '442321'], dtype=object)

>>> from pandas.core.strings import _na_map                                                                                                                                                                                                                  
>>> f = lambda x: x.decode("utf-8")                                                                                                                                                                                                                          
>>> _na_map(f, b)                                                                                                                                                                                                                                            
array([nan, nan, nan, nan, '442321'], dtype=object)
like image 79
Brad Solomon Avatar answered Jan 29 '26 12:01

Brad Solomon


The problem still open in git

Caused by the line

  except (TypeError, AttributeError):
         return na_value

Fix adding fillna

b.str.decode('utf-8').fillna(b)
Out[237]: 
0       123
1      434,
2       fgd
3       aas
4    442321
dtype: object
like image 37
BENY Avatar answered Jan 29 '26 12:01

BENY



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!