Say I have MultiIndex series as
date foo
2006-01-01 1 12931926.310
3 11084049.460
5 10812205.359
7 9031510.239
9 5324054.903
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
If it weren't a MultiIndex, I could select those with year 2007 through df.loc['2007']. How do I do this here? My natural guess was df.loc['2007', :], but that gave my an empty Series([], name: FINLWT21, dtype: float64).
Ultimatively, I am also interested in replacing all rows for different dates than 2007 with the rows of the year 2007.
That is, my expected output is
date foo
2006-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
I tried to implement @unutbu's solution, but
mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007]
will naturally set the values (due to non-existence on the RHS) to NaN. Usually, these issues are fixed by
mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007].values
, but given that I have 10 values (and more in my real dataset) on the lefthand side, but only 5 on the right, I get
ValueError: cannot set using a list-like indexer with a different length than the value
The only alternative that comes to my mind now is to iterate over the first index and then use the previous command for each subgroup, but that doesn't appear to be the most efficient solution.
Given the Series
In [207]: series
Out[212]:
date foo
2006-01-01 1 12931926.310
3 11084049.460
5 10812205.359
7 9031510.239
9 5324054.903
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
Name: val, dtype: float64
you could extract the date index with
dateindex = series.index.get_level_values('date')
# Ensure the dateindex is a DatetimeIndex (as opposed to a plain Index)
dateindex = pd.DatetimeIndex(dateindex)
Now selection of the rows where the year equals 2007 can be done with a boolean condition:
# select rows where year equals 2007
series2007 = series.loc[dateindex.year == 2007]
If the foo values cycle through the same values in the same order for each date,
then you could replace all the values in the series with those from 2007 with
N = len(series)/len(series2007)
series[:] = np.tile(series.loc[dateindex.year == 2007].values, N)
An advantage of using np.tile and .values is that it will generate the desired array of values relatively quickly. A (possible) disadvantage is that this is ignoring the index so it relies on the assumption that the foo values cycle through the same values in the same order for each date.
The more robust (but slower) way is to use a join:
df = series.reset_index('date')
df2007 = df.loc[dateindex.year==2007]
df = df.join(df2007, rsuffix='_2007')
df = df[['date', 'val_2007']]
df = df.set_index(['date'], append=True)
df = df.swaplevel(0,1).sort_index()
yields
In [304]: df.swaplevel(0,1).sort_index()
Out[304]:
val_2007
date foo
2006-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
2008-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With