I cannot understand why a Series created using dtype=str results in this:
In [2]: pandas.Series(index=range(2), dtype=str)
Out[2]: 
0    NaN
1    NaN
dtype: object
but a DataFrame created using dtype=str results in this:
In [3]: pandas.DataFrame(index=range(2), columns=[0], dtype=str)
Out[3]: 
   0
0  n
1  n
Why strings with just the letter "n"?
Why this difference between Series and DataFrame?
And where is this documented?!
This is now fixed in master and shouldn't be an issue from 17.0 onwards.
In short, both DataFrames and Series create an empty NumPy array and fill it with np.nan values, but DataFrame uses the passed str dtype for this array while Series overrides it with the 'O' (object) dtype.
When no values are passed in, the __init__ method of both classes assigns an empty dictionary as the default data: data = {}.
After testing what type of object data is, the Series construction method falls back to generating an array of np.nan values but using Numpy's 'O' datatype (not the str datatype) - see here and then here:
np.empty(n, dtype='O') # later filled with np.nan
The 'O' datatype is capable of holding any type object, so np.nan causes no issues here.
DataFrame's __init__ method also ends up using np.empty and then filling the empty array with np.nan. The difference is that the specified str datatype is used (and not the 'O' datatype). The code essentially reads as follows:
v = np.empty(len(index), dtype=str)
v.fill(np.nan)
Now, when created with the str datatype, np.empty is cast to the NumPy dtype of '<U1' (i.e. one unicode character) and so v becomes:
array(['n', 'n'], dtype='<U1')
since n is the first letter of nan (np.nan is represented as just nan).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With