What is an efficient method for determining the skew/kurtosis of a bar graph in python? Considering that bar graphs are not binned (unlike histograms) this question would not make a lot of sense but what I am trying to do is to determine the symmetry of a graph's height vs distance (rather than frequency vs bins). In other words, given a value of heights(y) measured along distance(x) i.e.
y = [6.18, 10.23, 33.15, 55.25, 84.19, 91.09, 106.6, 105.63, 114.26, 134.24, 137.44, 144.61, 143.14, 150.73, 156.44, 155.71, 145.88, 120.77, 99.81, 85.81, 55.81, 49.81, 37.81, 25.81, 5.81]
x = [0.03, 0.08, 0.14, 0.2, 0.25, 0.31, 0.36, 0.42, 0.48, 0.53, 0.59, 0.64, 0.7, 0.76, 0.81, 0.87, 0.92, 0.98, 1.04, 1.09, 1.15, 1.2, 1.26, 1.32, 1.37]
What is the symmetry of that height(y) distribution (skewness) and peakness (kurtosis) as measured over distance(x)? Are skewness/kurtosis appropriate measurements for determining the normal distribution of real values? Or does scipy/numpy offer something similar for that type of measurement?
I can achieve a skew/kurtosis estimate of height(y) frequency values binned along distance(x) by the following
freq=list(chain(*[[x_v]*int(round(y_v)) for x_v,y_v in zip(x,y)]))
x.extend([x[-1:][0]+x[0]])          #add one extra bin edge 
hist(freq,bins=x)
ylabel("Height Frequency")
xlabel("Distance(km) Bins")
print "Skewness,","Kurtosis:",stats.describe(freq)[4:]
Skewness, Kurtosis: (-0.019354300509997705, -0.7447085398785758)

In this case the height distribution is symmetrical (skew 0.02) around the midpoint distance and characterized by a platykurtic (-0.74 kurtosis i.e. broad) distribution.
Considering that I multiply each occurrence of x value by their height y to create a frequency, the size of the result list can sometimes get very large. I was wondering if there was a better method to approach this problem? I suppose that I could always try to normalize dataset y to a range of perhaps 0 - 100 without loosing too much information on the datasets skew/kurtosis.
This isn't a python question, nor is it really a programming question but the answer is simple nonetheless. Instead of skew and kurtosis, let's first consider the easier values based off the lower moments, the mean and standard deviation. To make it concrete, and to fit with your question, let's assume your data looks like:
X = 3, 3, 5, 5, 5, 7 = x1, x2, x3 ....
Which would give a "bar graph" that looks like:
{3:2, 5:3, 7:1} = {k1:p1, k2:p2, k3:p3}
The mean, u, is given by
E[X] = (1/N) * (x1 + x2 + x3 + ...) = (1/N) * (3 + 3 + 5 + ...)
Our data, however, has repeated values, so this can be rewritten as
E[X] = (1/N) * (p1*k1 + p2*k2 + ...) = (1/N) * (3*2 + 5*3 + 7*1)
The next term, the standard dev., s, is simply
sqrt(E[(X-u)^2]) = sqrt((1/N)*( (x1-u)^2 + (x2-u)^3 + ...))
But we can apply the same reduction to the E[(X-u)^2] term and write it as
E[(X-u)^2] = (1/N)*( p1*(k1-u)^2 + p2*(k2-u)^2 + ... )
           = (1/6)*( 2*(3-u)^2 + 3*(5-u)^2 + 1*(7-u)^2 )
Which means we don't have to have a multiple copy of each data item to do the sum as you indicated in your question.
The skew and kurtosis are quite simple as this point:
skew     = E[(x-u)^3] / (E[(x-u)^2])^(3/2)
kurtosis = ( E[(x-u)^4] / (E[(x-u)^2])^2 ) - 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With