Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to turn a series of strings from a pandas column into integers

I have a pandas data frame with a column on dates in this format "2016-05-03" These are strings btw. I need to convert them to an int from string and split at the hyphen('-') and only extract for the year so [0].

This is what I have tried to turn the string into an integer:

tyc.startDate = tyc.startDate.astype(np.int64) 

But it is returning and error:

ValueError: invalid literal for int() with base 10: '2015-06-01'

and this is what I've done for splitting:

tyc.startDate.str.split('-')[0]

and

tyc.startDate.str.split('-', [0]) 

but this isn't working either, it's splitting and returning a list of all the rows in the column in this form: ['2015', '06', '01'] and I want to just split for the year!

I'm sure there is a simple way to just convert to int and split for ('-') at position 0 and then put that into the df as a new column, please help!

like image 527
s.23 Avatar asked Nov 04 '25 20:11

s.23


1 Answers

I believe your data contains NaNs or some not datetime values:

tyc = pd.DataFrame({'startDate':['2016-05-03','2017-05-03', np.nan],
                    'col':[1,2,3]})
print (tyc)
   col   startDate
0    1  2016-05-03
1    2  2017-05-03
2    3         NaN

Use str[0] for return first list value of each row first. But then there is problem - some NaNs, which cannot be converted to int (be design) - so output is floats:

print (tyc.startDate.str.split('-').str[0].astype(float))
0    2016.0
1    2017.0
2       NaN
Name: startDate, dtype: float64

Another solution is convert to datetime by to_datetime and parse year by year:

print (pd.to_datetime(tyc.startDate, errors='coerce'))
0   2016-05-03
1   2017-05-03
2          NaT
Name: startDate, dtype: datetime64[ns]

print (pd.to_datetime(tyc.startDate, errors='coerce').dt.year)
0    2016.0
1    2017.0
2       NaN
Name: startDate, dtype: float64

Solutions for remove NaNs:

tyc['year'] = pd.to_datetime(tyc.startDate, errors='coerce').dt.year
print (tyc)
   col   startDate    year
0    1  2016-05-03  2016.0
1    2  2017-05-03  2017.0
2    3         NaN     NaN

1.

Remove all rows with NaNs by dropna and then cast to int:

tyc = tyc.dropna(subset=['year'])
tyc['year'] = tyc['year'].astype(int)
print (tyc)
   col   startDate  year
0    1  2016-05-03  2016
1    2  2017-05-03  2017

2.

Replace NaNs by some int value like 1 by fillna and then cast to int:

tyc['year'] = tyc['year'].fillna(1).astype(int)
print (tyc)
   col   startDate  year
0    1  2016-05-03  2016
1    2  2017-05-03  2017
2    3         NaN     1
like image 83
jezrael Avatar answered Nov 07 '25 11:11

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!