Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mapping values in place (for example with Gender) from string to int in Pandas dataframe [duplicate]

I have a Dataframe called df_base that looks like this. As you can see, there's a column called Sex that's male or female. I want to map these values to 0 and 1, respectively.

+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
|   | PassengerId | Survived | Pclass |                       Name                        |  Sex   | Age | SibSp | Parch |      Ticket      |  Fare   | Cabin | Embarked |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| 0 |           1 |        0 |      3 | Braund, Mr. Owen Harris                           | male   |  22 |     1 |     0 | A/5 21171        |    7.25 | NaN   | S        |
| 1 |           2 |        1 |      1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |  38 |     1 |     0 | PC 17599         | 71.2833 | C85   | C        |
| 2 |           3 |        1 |      3 | Heikkinen, Miss. Laina                            | female |  26 |     0 |     0 | STON/O2. 3101282 |   7.925 | NaN   | S        |
| 3 |           4 |        1 |      1 | Futrelle, Mrs. Jacques Heath (Lily May Peel)      | female |  35 |     1 |     0 | 113803           |    53.1 | C123  | S        |
| 4 |           5 |        0 |      3 | Allen, Mr. William Henry                          | male   |  35 |     0 |     0 | 373450           |    8.05 | NaN   | S        |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+

There's a few methods that I've seen dotted about on StackOverflow but I'm wondering what the most efficient is to perform the following mapping:

+---------+---------+
| Old Sex | New Sex |
+---------+---------+
| male    |       0 |
| female  |       1 |
| female  |       1 |
| female  |       1 |
| male    |       0 |
+---------+---------+

I'm using this:

df_base['Sex'].replace(['male','female'],[0,1],inplace=True)

... but I can't help but feel as though this is a little shoddy. Is there a better way of doing this? There's also using .loc but that loops around the rows of the Dataframe, so is less efficient, right?

like image 506
Aaraeus Avatar asked Dec 30 '25 01:12

Aaraeus


2 Answers

I think here is better/faster use map by dictionary if only male and female exist in column Sex:

df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))

What is same like:

df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})

Solution if exist only female and male values is cast boolean mask to integers True/False to 1,0:

df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)

Performance:

np.random.seed(2019)

import perfplot    

def ma(df):
    df = df.copy()
    df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
    return df

def rep1(df):
    df = df.copy()
    df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
    return df

def nwhere(df):
    df = df.copy()
    df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
    return df

def mask1(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'] == 'female').astype(int)
    return df

def mask2(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
    return df


def make_df(n):
    df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})

    return df

perfplot.show(
    setup=make_df,
    kernels=[ma,  rep1, nwhere, mask1, mask2],
    n_range=[2**k for k in range(2, 18)],
    logx=True,
    logy=True,
    equality_check=False,  # rows may appear in different order
    xlabel='len(df)')

pic

Conclusion:

If replace only 2 values is slowiest replace, numpy.where, map and mask are similar. For improve performance compare by numpy array with .values.
Also all depends of data, so best test with real data.

like image 183
jezrael Avatar answered Jan 01 '26 18:01

jezrael


My instinct would have suggested to use .map(), but I made a comparison between your solution and map, based on a dataframe with 1500 random male/female values.

%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop

Edited Based on coldspeeds comment, and because reassigning it is a better comparison with the others:

%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop

So actually slower .map()...!

So based on this example, your 'shoddy' solution seems faster than .map()...

Edit

pygo's solution:

%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop

So faster!

Jezrael's solution with .astype(int):

%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop

So also faster than .map() and .replace().

like image 30
Niels Henkens Avatar answered Jan 01 '26 17:01

Niels Henkens



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!