Mapping values in place (for example with Gender) from string to int in Pandas dataframe [duplicate]

Question

I have a Dataframe called df_base that looks like this. As you can see, there's a column called Sex that's male or female. I want to map these values to 0 and 1, respectively.

+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
|   | PassengerId | Survived | Pclass |                       Name                        |  Sex   | Age | SibSp | Parch |      Ticket      |  Fare   | Cabin | Embarked |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| 0 |           1 |        0 |      3 | Braund, Mr. Owen Harris                           | male   |  22 |     1 |     0 | A/5 21171        |    7.25 | NaN   | S        |
| 1 |           2 |        1 |      1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |  38 |     1 |     0 | PC 17599         | 71.2833 | C85   | C        |
| 2 |           3 |        1 |      3 | Heikkinen, Miss. Laina                            | female |  26 |     0 |     0 | STON/O2. 3101282 |   7.925 | NaN   | S        |
| 3 |           4 |        1 |      1 | Futrelle, Mrs. Jacques Heath (Lily May Peel)      | female |  35 |     1 |     0 | 113803           |    53.1 | C123  | S        |
| 4 |           5 |        0 |      3 | Allen, Mr. William Henry                          | male   |  35 |     0 |     0 | 373450           |    8.05 | NaN   | S        |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+

There's a few methods that I've seen dotted about on StackOverflow but I'm wondering what the most efficient is to perform the following mapping:

+---------+---------+
| Old Sex | New Sex |
+---------+---------+
| male    |       0 |
| female  |       1 |
| female  |       1 |
| female  |       1 |
| male    |       0 |
+---------+---------+

I'm using this:

df_base['Sex'].replace(['male','female'],[0,1],inplace=True)

... but I can't help but feel as though this is a little shoddy. Is there a better way of doing this? There's also using .loc but that loops around the rows of the Dataframe, so is less efficient, right?

jezrael · Accepted Answer

I think here is better/faster use map by dictionary if only male and female exist in column Sex:

df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))

What is same like:

df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})

Solution if exist only female and male values is cast boolean mask to integers True/False to 1,0:

df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)

Performance:

np.random.seed(2019)

import perfplot    

def ma(df):
    df = df.copy()
    df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
    return df

def rep1(df):
    df = df.copy()
    df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
    return df

def nwhere(df):
    df = df.copy()
    df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
    return df

def mask1(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'] == 'female').astype(int)
    return df

def mask2(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
    return df


def make_df(n):
    df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})

    return df

perfplot.show(
    setup=make_df,
    kernels=[ma,  rep1, nwhere, mask1, mask2],
    n_range=[2**k for k in range(2, 18)],
    logx=True,
    logy=True,
    equality_check=False,  # rows may appear in different order
    xlabel='len(df)')

Conclusion:

If replace only 2 values is slowiest replace, numpy.where, map and mask are similar. For improve performance compare by numpy array with .values.
Also all depends of data, so best test with real data.

Niels Henkens · Answer

My instinct would have suggested to use .map(), but I made a comparison between your solution and map, based on a dataframe with 1500 random male/female values.

%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop

Edited Based on coldspeeds comment, and because reassigning it is a better comparison with the others:

%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop

So actually slower .map()...!

~~So based on this example, your 'shoddy' solution seems faster than .map()...~~

Edit

pygo's solution:

%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop

So faster!

Jezrael's solution with .astype(int):

%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop

So also faster than .map() and .replace().

Mapping values in place (for example with Gender) from string to int in Pandas dataframe [duplicate]

Tags:

python

pandas

mapping

Aaraeus

2 Answers

jezrael

Niels Henkens

Recent Activity

Donate For Us

Mapping values in place (for example with Gender) from string to int in Pandas dataframe [duplicate]

Tags:

python

pandas

mapping

Aaraeus

2 Answers

jezrael

Niels Henkens

Related questions

Recent Activity

Donate For Us