Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group by the partial value of the column in pandas?

Tags:

python

pandas

I have some data in the pandas data frame as following where I converted the currency and the value earlier to the USD from CYN (Chinese Yuan)

      currency   port  supplier_id     value
0         USD  CNAQG           35  118.8344
1         USD  CNAQG           19  121.0082
2         USD  CNAQG           49   86.9520
3         USD  CNAQG           54  112.3130
4         USD  CNAQG          113  113.7622
5         USD  CNAQG            5  114.4868
6         USD  CNAQG           55  111.5884
7         USD  CNAQG           81  117.3852
8         USD  CNAQG            2  111.5884
6651      USD  USTPA           14  420.0000
6652      USD  USTPA           56  420.0000
6653      USD  USTPA          113  420.0000
6654      USD  USTPA            5  500.0000
6655      USD  USTPA           55  500.0000
6656      USD  USTPA          193  390.0000
6657      USD  USTPA           74  450.0000
6658      USD  USTPA           35  420.0000
6659      USD  USTPA           54  420.0000
6660      USD  USTPA          231  450.0000

The df.info() prints the following,

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6652 entries, 0 to 6660
Data columns (total 4 columns):
currency       6652 non-null object
port           6652 non-null object
supplier_id    6652 non-null int64
value          6652 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 259.8+ KB
None

The first 2 letter of the port indicates country and I have a map for that,

COUNTRIES = {
    "CN": "CHINA",
    "US": "USA"
}

I would like to group the data based on the country where the port is situated and the intention will be to visualize the list of values per country in a meaningful way. I appreciate any suggestion of what kind of graph would be appropriate for the job done.

like image 600
Heisenberg Avatar asked Dec 21 '25 07:12

Heisenberg


1 Answers

I think you need pivot and then plot by plot.bar or plot:

df1 = pd.pivot(index=df['supplier_id'], 
               columns = df['port'].str[:2].map(COUNTRIES),
               values=df['value']).fillna(0)
print (df1)
port            CHINA    USA
supplier_id                 
2            111.5884    0.0
5            114.4868  500.0
14             0.0000  420.0
19           121.0082    0.0
35           118.8344  420.0
49            86.9520    0.0
54           112.3130  420.0
55           111.5884  500.0
56             0.0000  420.0
74             0.0000  450.0
81           117.3852    0.0
113          113.7622  420.0
193            0.0000  390.0
231            0.0000  450.0

df1.plot.bar()

df1.plot()

But if error:

ValueError: Index contains duplicate entries, cannot reshape

then need pivot_table with some aggregate function mean, sum... (default function is 'mean'):

print (df)
     currency   port  supplier_id     value
0         USD  CNAQG           35  118.8344
1         USD  CNAQG           19  121.0082
2         USD  CNAQG           49   86.9520
3         USD  CNAQG           54  112.3130
4         USD  CNAQG          113  113.7622
5         USD  CNAQG            5  114.4868
6         USD  CNAQG           55  111.5884
7         USD  CNAQG           81  117.3852
8         USD  CNAQG            2  111.5884
6651      USD  USTPA           14  420.0000
6652      USD  USTPA           56  420.0000
6653      USD  USTPA          113  420.0000
6654      USD  USTPA            5  500.0000
6655      USD  USTPA           55  500.0000
6656      USD  USTPA          193  390.0000
6657      USD  USTPA           74  450.0000
6658      USD  USTPA           35  420.0000
6659      USD  USTPA           54  420.0000
6660      USD  USTPA          231  450.0000 <-duplicates for USTPA, 231
6660      USD  USTPA          231  800.0000 <-duplicates for USTPA, 231
COUNTRIES = {
    "CN": "CHINA",
    "US": "USA"
}

df1 = pd.pivot_table(df, 
                     index='supplier_id', 
                     columns = df['port'].str[:2].map(COUNTRIES),
                     values='value', 
                     aggfunc=np.mean, 
                     fill_value=0)
print (df1)
port            CHINA  USA
supplier_id               
2            111.5884    0
5            114.4868  500
14             0.0000  420
19           121.0082    0
35           118.8344  420
49            86.9520    0
54           112.3130  420
55           111.5884  500
56             0.0000  420
74             0.0000  450
81           117.3852    0
113          113.7622  420
193            0.0000  390
231            0.0000  625 <-mean (450 + 800) /2

df1.plot.bar()

Alternative solution with groupby and mean:

df1 = df.groupby(['supplier_id', df['port'].str[:2].map(COUNTRIES)])['value']
        .mean()
        .unstack(fill_value=0)
print (df1)
port            CHINA    USA
supplier_id                 
2            111.5884    0.0
5            114.4868  500.0
14             0.0000  420.0
19           121.0082    0.0
35           118.8344  420.0
49            86.9520    0.0
54           112.3130  420.0
55           111.5884  500.0
56             0.0000  420.0
74             0.0000  450.0
81           117.3852    0.0
113          113.7622  420.0
193            0.0000  390.0
231            0.0000  625.0
like image 185
jezrael Avatar answered Dec 23 '25 21:12

jezrael