I have a text file (tab separated) and I need to calculate the probability and entropy for each column in the text file. Here is what my text file looks like:
aaa 0.0520852296    0.1648703511    0.1648703511
bbb 0.1062639955    0.1632039268    0.1632039268
ccc 1.4112745088    4.3654577641    4.3654577641
ddd 0.4992644913    0.1648703511    0.1648703511
eeee    0.169058175 0.1632039268    0.1632039268
and so I can calculate the probability using the following code:
import pandas as pd
f=open(mydata,'r')
df = pd.DataFrame(pd.read_csv(f, sep='\t', header=None, names=['val1', 'val2', 'val3']))
print(df)
df.loc[:,"val1":"val3"] = df.loc[:,"val1":"val3"].div(df.sum(axis=0), axis=1)
print(df)
which outputs,
aaa 0.0232736716    0.0328321936    0.0328321936
bbb 0.0474828153    0.0325003428    0.0325003428
ccc 0.6306113983    0.8693349271    0.8693349271
ddd 0.2230904597    0.0328321936    0.0328321936
eeee    0.0755416551    0.0325003428    0.0325003428
And on that output I want to calculate the entropy and gave me the results as output file, and so I have the following code
import math
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ])
But I get the following error message:
TypeError: a float is required
Any help is much appreciated. Thank you all
To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.
This is the quantity that he called entropy, and it is represented by H in the following formula: H = p1 logs(1/p1) + p2 logs(1/p2) + ⋯ + pk logs(1/pk).
Meaning of Entropy At a conceptual level, Shannon's Entropy is simply the "amount of information" in a variable. More mundanely, that translates to the amount of storage (e.g. number of bits) required to store the variable, which can intuitively be understood to correspond to the amount of information in that variable.
Your problem is with this line
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ])
If you think about (or print out) what p for p in df is giving you (e.g. run print([p for p in df])), you can see that p contains only the headings of the columns.  So you are passing a text label into the math functions that expect a float.  Hence the error.
apply might work well for you here:
import math
def shannon(col):
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in col])
    return entropy
sh_df = df.loc[:,'val1':'val3'].apply(shannon,axis=0)
print(sh_df)
as others have pointed out, you might want to tidy up your dataframe by making column 0 an index - then you won't have to use
df.loc[:,'val1':'val3']
So you could import your data using:
df = pd.read_csv(f, sep='\t', header=None, index_col=0, names=['val1', 'val2', 'val3'])
and avoid the need to use the cumbersome loc[:,'val1':'val3'] syntax
Why don't you fix your data file instead of trying to do so in python code and reducing the readability. It's as simple as
sed 's/ \+/,/g' mydata > my_fixed_data
Just run this on the commandline if you are using linux. It will replace all the the spaces with ,.
mydata
aaa 0.0520852296    0.1648703511    0.1648703511
bbb 0.1062639955    0.1632039268    0.1632039268
ccc 1.4112745088    4.3654577641    4.3654577641
ddd 0.4992644913    0.1648703511    0.1648703511
eeee    0.169058175 0.1632039268    0.1632039268
my_fixed_data
aaa,0.0520852296,0.1648703511,0.1648703511
bbb,0.1062639955,0.1632039268,0.1632039268
ccc,1.4112745088,4.3654577641,4.3654577641
ddd,0.4992644913,0.1648703511,0.1648703511
eeee,0.169058175,0.1632039268,0.1632039268
Then you can simply use the read_csv function like
df = pd.read_csv('my_fixed_data', header=None, index_col=0, names=['val1', 'val2', 'val3'])
Here's what the dataframe now looks like:
          val1      val2      val3
aaa   0.052085  0.164870  0.164870
bbb   0.106264  0.163204  0.163204
ccc   1.411275  4.365458  4.365458
ddd   0.499264  0.164870  0.164870
eeee  0.169058  0.163204  0.163204
I'm sure there must be equivalents for Windows too. Just google it.
You get the TypeError: a float is required error because for p in df gives you the column names and not some float values. You may have to fix it accordingly.
>>> for p in df:
...     print p
...
val1
val2
val3
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With