How to convert string of range (bins), into numerical values that can then be used with Seaborn visualisations

Question

So, I'm working with Python 3.7 in Jupyter Notebooks. I'm currently exploring some survey data in the form of a Pandas imported from a .CSV file. I would like to explore further with some Seaborn visualisations, however, the numerical data has been gathered in the form of age bins, using string values.

Is there a way I could go about converting these columns (Age and Approximate Household Income) into numerical values, which could then be used with Seaborn? I've attempted searches but my wording seems to only be returning methods on creating age bins for columns with numerical values. I'm really looking for how I'd convert string values into numerical age bin values.

Also, does anybody have some tips on how I could improve my search method. What would have been the ideal wording for searching up a solution for something like this?

Here is an sample from the dataframe, using df.head(5).to_dict(), with values changed for anonymity purposes.

 'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'},
 'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
 'Approximate Household Income': {0: '$175,000 - $199,999',
  1: '$75,000 - $99,999',
  2: '$25,000 - $49,999',
  3: '$50,000 - $74,999',
  4: nan},
 'Highest Level of Education Completed': {0: 'Four Year College Degree',
  1: 'Four Year College Degree',
  2: 'Jr College/Associates Degree',
  3: 'Jr College/Associates Degree',
  4: 'Four Year College Degree'},
 '2020 Candidate Choice': {0: 'Joe Biden',
  1: 'Joe Biden',
  2: 'Donald Trump',
  3: 'Joe Biden',
  4: 'Donald Trump'},
 '2016 Candidate Choice': {0: 'Hillary Clinton',
  1: 'Third Party',
  2: 'Donald Trump',
  3: 'Hillary Clinton',
  4: 'Third Party'},
 'Party Registration 2020': {0: 'Independent',
  1: 'No Party',
  2: 'No Party',
  3: 'Independent',
  4: 'Independent'},
 'Registered State for Voting': {0: 'Colorado',
  1: 'Virginia',
  2: 'California',
  3: 'North Carolina',
  4: 'Oregon'}

Alex · Accepted Answer

You can use some of pandas Series.str methods.

Smaller example dataset:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
        "Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
        "Approximate Household Income": {
            0: "$175,000 - $199,999",
            1: "$75,000 - $99,999",
            2: "$25,000 - $49,999",
            3: "$50,000 - $74,999",
            4: np.nan,
        },
    }
)
#      Age Ethnicity Approximate Household Income
# 0  45-54     White          $175,000 - $199,999
# 1  35-44     White            $75,000 - $99,999
# 2  45-54     White            $25,000 - $49,999
# 3  45-54     White            $50,000 - $74,999
# 4  55-64     White                          NaN

We can iterate through a list of columns and chain apply these methods to parse the ranges all within the pandas.DataFrame:

Methods we will use in order:

Series.str.replace - replace commas with nothing
Series.str.extract - extract the numbers from the Series, regex explained here
Series.astype - convert the extracted numbers to floats
DataFrame.rename - rename the new columns
DataFrame.join - add the extracted numbers back on to the original DataFrame

for col in ["Age", "Approximate Household Income"]:
    df = df.join(
        df[col]
        .str.replace(",", "", regex=False)
        .str.extract(pat=r"^[$]*(\d+)[-\s$]*(\d+)$")
        .astype("float")
        .rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
    )
#      Age Ethnicity Approximate Household Income  Age_lower  Age_upper  \
# 0  45-54     White          $175,000 - $199,999       45.0       54.0   
# 1  35-44     White            $75,000 - $99,999       35.0       44.0   
# 2  45-54     White            $25,000 - $49,999       45.0       54.0   
# 3  45-54     White            $50,000 - $74,999       45.0       54.0   
# 4  55-64     White                          NaN       55.0       64.0   
# 
#    Approximate Household Income_lower  Approximate Household Income_upper  
# 0                            175000.0                            199999.0  
# 1                             75000.0                             99999.0  
# 2                             25000.0                             49999.0  
# 3                             50000.0                             74999.0  
# 4                                 NaN                                 NaN

How to convert string of range (bins), into numerical values that can then be used with Seaborn visualisations

Tags:

python

pandas

visualization

seaborn

DanMack

1 Answers

Alex

Recent Activity

Donate For Us

How to convert string of range (bins), into numerical values that can then be used with Seaborn visualisations

Tags:

python

pandas

visualization

seaborn

DanMack

1 Answers

Alex

Related questions

Recent Activity

Donate For Us