So, I'm working with Python 3.7 in Jupyter Notebooks. I'm currently exploring some survey data in the form of a Pandas
imported from a .CSV file
. I would like to explore further with some Seaborn
visualisations, however, the numerical data has been gathered in the form of age bins, using string values.
Is there a way I could go about converting these columns (Age
and Approximate Household Income
) into numerical values, which could then be used with Seaborn? I've attempted searches but my wording seems to only be returning methods on creating age bins for columns with numerical values. I'm really looking for how I'd convert string values into numerical age bin values.
Also, does anybody have some tips on how I could improve my search method. What would have been the ideal wording for searching up a solution for something like this?
Here is an sample from the dataframe, using df.head(5).to_dict()
, with values changed for anonymity purposes.
'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'},
'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'Approximate Household Income': {0: '$175,000 - $199,999',
1: '$75,000 - $99,999',
2: '$25,000 - $49,999',
3: '$50,000 - $74,999',
4: nan},
'Highest Level of Education Completed': {0: 'Four Year College Degree',
1: 'Four Year College Degree',
2: 'Jr College/Associates Degree',
3: 'Jr College/Associates Degree',
4: 'Four Year College Degree'},
'2020 Candidate Choice': {0: 'Joe Biden',
1: 'Joe Biden',
2: 'Donald Trump',
3: 'Joe Biden',
4: 'Donald Trump'},
'2016 Candidate Choice': {0: 'Hillary Clinton',
1: 'Third Party',
2: 'Donald Trump',
3: 'Hillary Clinton',
4: 'Third Party'},
'Party Registration 2020': {0: 'Independent',
1: 'No Party',
2: 'No Party',
3: 'Independent',
4: 'Independent'},
'Registered State for Voting': {0: 'Colorado',
1: 'Virginia',
2: 'California',
3: 'North Carolina',
4: 'Oregon'}
You can use some of pandas Series.str
methods.
Smaller example dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
"Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
"Approximate Household Income": {
0: "$175,000 - $199,999",
1: "$75,000 - $99,999",
2: "$25,000 - $49,999",
3: "$50,000 - $74,999",
4: np.nan,
},
}
)
# Age Ethnicity Approximate Household Income
# 0 45-54 White $175,000 - $199,999
# 1 35-44 White $75,000 - $99,999
# 2 45-54 White $25,000 - $49,999
# 3 45-54 White $50,000 - $74,999
# 4 55-64 White NaN
We can iterate through a list of columns and chain apply these methods to parse the ranges all within the pandas.DataFrame
:
Methods we will use in order:
Series.str.replace
- replace commas with nothingSeries.str.extract
- extract the numbers from the Series, regex explained hereSeries.astype
- convert the extracted numbers to floats
DataFrame.rename
- rename the new columnsDataFrame.join
- add the extracted numbers back on to the original DataFramefor col in ["Age", "Approximate Household Income"]:
df = df.join(
df[col]
.str.replace(",", "", regex=False)
.str.extract(pat=r"^[$]*(\d+)[-\s$]*(\d+)$")
.astype("float")
.rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
)
# Age Ethnicity Approximate Household Income Age_lower Age_upper \
# 0 45-54 White $175,000 - $199,999 45.0 54.0
# 1 35-44 White $75,000 - $99,999 35.0 44.0
# 2 45-54 White $25,000 - $49,999 45.0 54.0
# 3 45-54 White $50,000 - $74,999 45.0 54.0
# 4 55-64 White NaN 55.0 64.0
#
# Approximate Household Income_lower Approximate Household Income_upper
# 0 175000.0 199999.0
# 1 75000.0 99999.0
# 2 25000.0 49999.0
# 3 50000.0 74999.0
# 4 NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With