Suppose I observe the weights for each stock in a set of portfolios. I would like to compute the "active share" measure. This is defined as active_share_{ij} = active_share_{ji} = \sum^N_k | \omega^k_i - \omega^k_j | where N is the number of all stocks that exist in the market, \omega^k_i is the weight of stock k in portfolio i and the vertical bars indicate the absolute difference. Note that the sum runs over all stocks in the market: if, for example, portfolio i holds stock k while portfolio j does not, then weight \omega^k_j=0 and the absolute difference should evaluate to | \omega^k_i - 0 |.
In my data structure, I observe the three relevant pieces of information portfolio, stock and weight. In my MWE, this looks like
portfolio stock weight
0 P1 stock1 0.3
1 P1 stock2 0.6
2 P1 stock3 0.1
3 P2 stock2 0.1
4 P2 stock3 0.2
5 P2 stock4 0.3
6 P2 stock5 0.1
7 P2 stock8 0.3
8 P3 stock3 0.4
9 P3 stock4 0.6
I would like to produce
active_share
portfolio_i portfolio_j
P1 P1 0.0
P2 1.6
P3 1.8
P2 P1 1.6
P2 0.0
P3 1.0
P3 P1 1.8
P2 1.0
P3 0.0
Lets look at the calculation of pair (P1,P2) as an example: |0.3-0|+|0.6-0.1|+|0.1-0.2|+|0-0.3|+|0-0.1|+|0-0.3| = 1.6
My few lines of code are capable of producing the results. However, I am facing a very large dataset and need to increase the performance.
Do you see a way to optimize the computation? For example by making use of the fact that the relation is symmetric, i.e. the measure for (i,j) is the same as for (j,i)?
MWE:
# generate example data
data = {
'portfolio': ['P1','P1','P1','P2','P2','P2','P2','P2','P3','P3'],
'stock': ['stock1','stock2','stock3','stock2','stock3','stock4','stock5','stock8','stock3','stock4'],
'weight':[0.3, 0.6, 0.1, 0.1, 0.2, 0.3, 0.1, 0.3, 0.4, 0.6]}
df = pd.DataFrame(data)
# help dataframe
df['key'] = df.groupby('portfolio').ngroup()
portfolio_list = df['portfolio'].unique()
groups = df['key'].unique()
index = pd.MultiIndex.from_product([portfolio_list, groups], names=['portfolio','key'])
join = pd.DataFrame(index=index, columns=[]).reset_index()
# perform some sort of outer join
composite = df.merge(join, on='key', suffixes=('_i','_j'), how='inner')
composite = composite.loc[:,composite.columns!='key']
composite.rename(columns={'weight':'weight_i'}, inplace=True)
# identify whether counterparty has same holdings
composite = composite.merge(df, left_on=['portfolio_j','stock'], right_on=['portfolio','stock'], how='left', suffixes=('_i','_j'))
composite = composite.loc[:,~composite.columns.isin(['portfolio','key'])]
composite.rename(columns={'weight':'weight_j'}, inplace=True)
# compute the sum of absolute differences for overlapping portfolio
composite['abs_weight_difference'] = (composite['weight_i'] - composite['weight_j']).abs()
result = composite.groupby(['portfolio_i','portfolio_j'])['abs_weight_difference'].sum().to_frame('sum_overlap')
# compute the sum of weights of stocks that are in portfolio i but not in j
result['sum_unique_i'] = composite.loc[composite['weight_j'].isnull()].groupby(['portfolio_i','portfolio_j'])['weight_i'].sum()
# add sum of weights of stocks (1) in portfolio overlap, (2) distinct to portfolio i, (3) distinct to portolio j
result = result.reset_index()
result = result.merge(result, left_on=['portfolio_i','portfolio_j'], right_on=['portfolio_j','portfolio_i'])
result = result.loc[:,['portfolio_i_x','portfolio_j_x','sum_overlap_x','sum_unique_i_x','sum_unique_i_y']]
result.columns = ['sum_unique_j' if col=='sum_unique_i_y' else col[:-2] for col in result.columns]
result = result.fillna(0)
result.set_index(['portfolio_i','portfolio_j'], inplace=True)
result = result.sum(axis=1).to_frame(name='active_share')
Edit: Currently, it takes 5.82 minutes on a small subset of my dataset (as a future reference point)
You can to pivot the data first, then use broadcast to compute your actual data:
a = df.pivot_table(index='portfolio',
columns='stock',
values='weight',
fill_value=0)
idx = a.index
a = a.to_numpy()
pd.DataFrame(np.abs(a[:,None] - a[None,:]).sum(axis=-1),
index=idx, columns=idx)
Output:
portfolio P1 P2 P3
portfolio
P1 0.0 1.6 1.8
P2 1.6 0.0 1.0
P3 1.8 1.0 0.0
From there, you can stack to get the expected output in the question:
pd.DataFrame(np.abs(a[:,None] - a[None,:]).sum(axis=-1),
index=idx, columns=idx).stack()
gives:
portfolio portfolio
P1 P1 0.0
P2 1.6
P3 1.8
P2 P1 1.6
P2 0.0
P3 1.0
P3 P1 1.8
P2 1.0
P3 0.0
dtype: float64
Update: since you have two many Stocks/Portfolios, you can choose to loop over Stocks, which will reduce memory requirement:
a = df.pivot_table(index='portfolio',
columns='stock',
values='weight',
fill_value=0)
idx = a.index
ret = pd.DataFrame(0, index=idx, columns=idx)
for col in a:
u = a[col].to_numpy()
ret += np.abs(u-u[:,None])
Outcome:
portfolio P1 P2 P3
portfolio
P1 0.0 1.6 1.8
P2 1.6 0.0 1.0
P3 1.8 1.0 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With