I have a data set as I've shown below:
It shows which book is sold by which shop.
import pandas as pd
books = {'shop': ["A", "B", "C", "D", "E", "A", "B", "C", "D",],
'book_id': [1, 1, 2, 3, 3, 3, 4, 5, 1,]
}
df = pd.DataFrame(books, columns = ['shop', 'book_id'])
Here is the print:
shop book_id
0 A 1
1 B 1
2 C 2
3 D 3
4 E 3
5 A 3
6 B 4
7 C 5
8 D 1
In the data set,
So now, I want to calculate the jaccard index here. For instance, let's take shop A and shop B. There are three different books that are sold by A and B (book 1, book 3, book 4). However, only one product is sold by both shops (this is product 1). So, the Jaccard index here should be 33.3% (1/3).
Here is the sample of the desired data:
result = {'shop_1': ["A", "B", "A", "C", "A", "D", "A", "E",],
'shop_2': ["B", "A", "C", "A", "D", "A", "E", "A",],
'jaccard': [33.3, 33.33, 0, 0, 100, 100, 50, 50,]
}
desired_df = pd.DataFrame(result, columns = ['shop_1', 'shop_2', 'jaccard'])
Print
shop_1 shop_2 jaccard
0 A B 33.30
1 B A 33.33
2 A C 0.00
3 C A 0.00
4 A D 100.00
5 D A 100.00
6 A E 50.00
7 E A 50.00
. . . .
. . . .
. . . .
Can someone help me to do this? Is there any library to implement Jaccard Index?
If you data is not too big, you can use a broadcasting approach:
books = pd.crosstab(df.shop, df.book_id)
# underlying numpy
arr = books.values
common = (arr[None,...] | arr[:,None,:]).sum(-1)
output = (books @ books.T)/common
Output:
shop A B C D E
shop
A 1.000000 0.333333 0.0 1.000000 0.5
B 0.333333 1.000000 0.0 0.333333 0.0
C 0.000000 0.000000 1.0 0.000000 0.0
D 1.000000 0.333333 0.0 1.000000 0.5
E 0.500000 0.000000 0.0 0.500000 1.0
To match your expected output:
output = (output.stack().rename_axis(['shop_1','shop_2'])
.reset_index(name='jaccard')
.query('shop_1 != shop_2')
)
Output:
shop_1 shop_2 jaccard
1 A B 0.333333
2 A C 0.000000
3 A D 1.000000
4 A E 0.500000
5 B A 0.333333
7 B C 0.000000
8 B D 0.333333
9 B E 0.000000
10 C A 0.000000
11 C B 0.000000
13 C D 0.000000
14 C E 0.000000
15 D A 1.000000
16 D B 0.333333
17 D C 0.000000
19 D E 0.500000
20 E A 0.500000
21 E B 0.000000
22 E C 0.000000
23 E D 0.500000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With