Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split strings in a Series, convert to array and average the values

Tags:

python

pandas

I have a Pandas Series that has these unique values:

array(['17', '19', '21', '20', '22', '23', '12', '13', '15', '24', '25',
       '18', '16', '14', '26', '11', '10', '12/16', '27', '10/14',
       '16/22', '16/21', '13/17', '14/19', '11/15', '10/15', '15/21',
       '13/19', '13/18', '32', '28', '12/15', '29', '42', '30', '31',
       '34', '46', '11/14', '18/25', '19/26', '17/24', '19/24', '17/23',
       '13/16', '11/16', '15/20', '36', '17/25', '19/25', '17/22',
       '18/26', '39', '41', '35', '50', '9/13', '33', '10/13', '9/12',
       '93/37', '14/20', '10/16', '14/18', '16/23', '37', '9/11', '37/94',
       '20/54', '22/31', '22/30', '23/33', '44', '40', '50/95', '38',
       '16/24', '15/23', '15/22', '18/23', '16/20', '37/98', '19/27',
       '38/88', '23/31', '14/22', '45', '39/117', '28/76', '33/82',
       '15/19', '23/30', '47', '46/115', '14/21', '17/18', '25/50',
       '12/18', '12/17', '21/28', '20/27', '26/58', '22/67', '22/47',
       '25/51', '35/83', '39/86', '31/72', '24/56', '30/80', '32/85',
       '42/106', '40/99', '30/51', '21/43', '52', '56', '25/53', '34/83',
       '30/71', '27/64', '35/111', '26/62', '32/84', '39/95', '18/24',
       '22/29', '42/97', '48', '55', '58', '39/99', '49', '43', '40/103',
       '22/46', '54/133', '25/54', '36/83', '29/72', '28/67', '35/109',
       '25/62', '14/17', '42/110', '52/119', '20/60', '46/105', '25/56',
       '27/65', '25/74', '21/49', '29/71', '26/59', '27/62'], dtype=object)

The ones that have the '/', I want to split these into arrays and then average their values. One simpler but a flawed approach is to simply extract the first value: master_data["Cmb MPG"].str.split('/').str[0].astype('int8')

However, what I truly require is the two values being averaged. I have tried several commands and this one:

np.array(master_data["Cmb MPG"].str.split('/')).astype('int8').mean()

Should ideally do the job, but I get a ValueError followed by a TypeError:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[88], line 1
----> 1 np.array(master_data["Cmb MPG"].str.split('/')).astype('int8')

ValueError: setting an array element with a sequence.

The slice() method returns a Series but it won't proceed either with the splitting of strings.

What is required is:

'18/25' ---> [18, 25] ---> 22 (rounded)
like image 244
shiv_90 Avatar asked Dec 07 '25 08:12

shiv_90


2 Answers

Example

we need minimal-reproducible-example to answer your question.

import pandas as pd
s = pd.Series(['18/25', '10', '12/16', '27', '10/14'])

s

0    18/25
1       10
2    12/16
3       27
4    10/14
dtype: object

Code

out = (s.str.split('/', expand=True).astype('float').mean(axis=1)
        .round().astype('int'))

out

0    22
1    10
2    14
3    27
4    12
dtype: int32
like image 110
Panda Kim Avatar answered Dec 11 '25 10:12

Panda Kim


I would use extractall and groupby.mean:

s = pd.Series(['10', '12/16', '27', '10/14', '16/22', '16/21', '13/17'])

out = (s.str.extractall(r'(\d+)')[0].astype(int).groupby(level=0).mean()
        .round().astype(int)
       )

You could also go with split and mean, but this generates a more expensive intermediate and will not scale as well if you have many items (1/2/3/4/5):

out = (s.str.split('/', expand=True).astype(float).mean(axis=1)
        .round().astype(int)
      )

Output:

0    10
1    14
2    27
3    12
4    19
5    18
6    15
dtype: int64
like image 22
mozway Avatar answered Dec 11 '25 11:12

mozway