Loading STATA file: Categorial values must be unique

Question

I am trying to load the .dta file behind this zip file into pandas. However, I immediately get an error. I also have stata at my command, but since the error message doesn't tell me something more, like the faulty column, I have no clue what to do.

How can I load the file into pandas?

>>> df = pd.read_stata('cepr_org_2014.dta')

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.15.2-py2.7-macosx-10.9-x86_64.egg/pandas/io/stata.py", line 69, in read_stata
    order_categoricals)
  File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.15.2-py2.7-macosx-10.9-x86_64.egg/pandas/io/stata.py", line 1315, in data
    cat_data.categories = categories
  File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.15.2-py2.7-macosx-10.9-x86_64.egg/pandas/core/categorical.py", line 442, in _set_categories
    categories = self._validate_categories(categories)
  File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.15.2-py2.7-macosx-10.9-x86_64.egg/pandas/core/categorical.py", line 437, in _validate_categories
    raise ValueError('Categorical categories must be unique')
ValueError: Categorical categories must be unique

fernandezcuesta · Accepted Answer

Load this with pandas.read_stata('cepr_org_2014.dta', convert_categoricals=False, convert_missing=True) and have a look at what the data looks like. Optionally debugging with ipdb as commented in the question shows there's a duplicate category in your data.

loikein · Answer

I know this is an old question, but I just ran into a similar problem recently, and it seems to be no structured solution out there (including the pandas documentation I'm afraid), so this is my attempt at a more or less re-usable answer.

The file link does not work any more though, and I assume that it's the same file as the cepr_org_2014.dta on the CPS ORG Data page.

Method 1

Recent pandas versions report more detailed error messages for this particular problem (pandas-dev/pandas Issue #13923). So when I tried the same command, this is the error message:

ValueError: 
Value labels for column smsastat14 are not unique. These cannot be converted to
pandas categoricals.

Either read the file with `convert_categoricals` set to False or use the
low level interface in `StataReader` to separately read the values and the
value_labels.

The repeated labels are:
--------------------------------------------------------------------------------
Bridgeport-Stamford-Norwalk, CT
Hartford-West Hartford-East Hartford, CT
Portland-South Portland, ME
Burlington-South Burlington, VT
Worcester, MA-CT
Bangor, ME

It means that in the column smsastat14, multiple data codes have been mapped to the same labels. I do not know if this is against Stata best practice, but there are a lot of datasets in the wild with the same problem.

(By the way, I was very puzzled by the sentence use the low level interface in `StataReader` to separately read the values and the value_labels, since nowhere in the documentation that I am aware is this kind of method mentioned. The pandas.io.stata.StataReader doc seem to all focus on creating Stata datasets instead of reading one.)

Now, if you use the convert_categoricals=False flag, you will be able to get a DataFrame, but with numerical codes, like this:

df["smsastat14"]

# 0             NaN
# 1             NaN
# 2             NaN
# 3             NaN
# 4             NaN
#            ...   
# 317051    46520.0
# 317052    26180.0
# 317053    26180.0
# 317054    26180.0
# 317055    26180.0
# Name: smsastat14, Length: 317056, dtype: float32

The missing piece is how to convert the codes back to its labels. Here is one simple way how you could do it: (map seems to be much faster than replace, credit to this answer)

import pandas as pd

with pd.io.stata.StataReader("cepr_org_2014.dta") as sr:
    value_labels = sr.value_labels()

df = pd.read_stata(
    "cepr_org_2014.dta",
    convert_categoricals=False,
)

for col in df:
    if col in value_labels:
        df[col].replace(value_labels[col], inplace=True)

Now you have the values. Note that is is object type, so you may want to convert it into categorical or string before conducting analysis.

df["smsastat14"]

# 0                        NaN
# 1                        NaN
# 2                        NaN
# 3                        NaN
# 4                        NaN
#                  ...        
# 317051    Urban Honolulu, HI
# 317052         Honolulu, HI*
# 317053         Honolulu, HI*
# 317054         Honolulu, HI*
# 317055         Honolulu, HI*
# Name: smsastat14, Length: 317056, dtype: object

Method 2

Another slightly more involved way is to modify the value_labels dictionary directly, such that their (ordered) categorical structure is preserved, perfectly for unmodified columns, partially for modified columns. Since this is undocumented behaviour, please note that is could be subject to change in the future versions.

First look at the value_labels and find the repeated values.

with pd.io.stata.StataReader("cepr_org_2014.dta") as sr:
    value_labels = sr.value_labels()
    print(value_labels["smsastat14"])

Since it is a dictionary with non-unique values, I have yet to find ways of locating them other than pasting the output into a text editor and search. After replacing all repeated values in this column, another column popped up, so we deal with that as well.

The importing code:

with pd.io.stata.StataReader("cepr_org_2014.dta") as sr:
    value_labels = sr.value_labels()
    value_labels["smsastat14"][71950] = "Bridgeport-Stamford-Norwalk, CT (1)"
    value_labels["smsastat14"][73450] = "Hartford-West Hartford-East Hartford, CT (1)"
    value_labels["smsastat14"][76750] = "Portland-South Portland, ME (1)"
    value_labels["smsastat14"][72400] = "Burlington-South Burlington, VT (1)"
    value_labels["smsastat14"][79600] = "Worcester, MA-CT (1)"
    value_labels["smsastat14"][70750] = "Bangor, ME (1)"
    value_labels["reltoref"][12] = "Nonrelative (1)"
    df = sr.read()

And the columns preserved their original types except our little tweaks:

df["smsastat14"]

# 0                        NaN
# 1                        NaN
# 2                        NaN
# 3                        NaN
# 4                        NaN
#                  ...        
# 317051    Urban Honolulu, HI
# 317052         Honolulu, HI*
# 317053         Honolulu, HI*
# 317054         Honolulu, HI*
# 317055         Honolulu, HI*
# Name: smsastat14, Length: 317056, dtype: category
# Categories (326, object): [0.0 < 'Abilene, TX' < 'Akron, OH' < 'Albany, GA' ... 'Rochester-Dover, NH-ME' < 'Springfield, MA-CT' < 'Waterbury, CT' < 'Worcester, MA-CT (1)']

Loading STATA file: Categorial values must be unique

Tags:

python

pandas

FooBar

2 Answers

fernandezcuesta

Method 1

Method 2

loikein

Recent Activity

Donate For Us

Loading STATA file: Categorial values must be unique

Tags:

python

pandas

FooBar

2 Answers

fernandezcuesta

Method 1

Method 2

loikein

Related questions

Recent Activity

Donate For Us