I'm trying to get a correlation in pandas that's giving me a bit of difficulty. Essentially I want to answer the following question: given a sentence and a value and a dataframe, what word correlates the best with a higher value? What about the worst?
Trivial example:
Sentence | Score
"hello there" | 100
"hello kid" | 95
"there kid" | 5
I'm expecting to see a high correlation value here for the word "hello" and score. Hopefully this makes sense -- if this is possible natively in Pandas I'd really appreciate knowing!
If anything is unclear please let me know.
I'm not sure that pandas
is what you looking for, but yes, you can:
import pandas as pd
df = pd.DataFrame([ ["hello there", 100],
["hello kid", 95],
["there kid", 5]
], columns = ['Sentence','Score'])
s_corr = df.Sentence.str.get_dummies(sep=' ').corrwith(df.Score/df.Score.max())
print (s_corr)
Will return you
hello 0.998906
kid -0.539949
there -0.458957
for details see pandas
help
str.get_dummies()
corrwith()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With