... help and online documentation say the function scipy.stats.pareto.fit takes as variables the dataset to be fitted, and optionally b (the exponent), loc, scale. the result comes as triplet (exponent, loc, scale)
generating data from the same distribution should result in the fit finding the parameters used for generating the data, e.g. (using the python 3 colsole)
$  python
Python 3.3.0 (default, Dec 12 2012, 07:43:02) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
(in code lines below leaving out the python console prompt ">>>")
dataset=scipy.stats.pareto.rvs(1.5,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)
however this results in
(1.0, nan, 0.0)
(exponent 1, should be 1.5) and
dataset=scipy.stats.pareto.rvs(1.1,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)
results in
(1.0, nan, 0.0)
(exponent 1, should be 1.1) and
dataset=scipy.stats.pareto.rvs(4,loc=2.0,scale=0.4,size=10000)    #generating data
scipy.stats.pareto.fit(dataset)
(exponent should be 4, loc should be 2, scale should be 0.4) in
(1.0, nan, 0.0)
etc. giving another exponent when calling the fit function
scipy.stats.pareto.fit(dataset,1.4)
returns always exactly this exponent
(1.3999999999999999, nan, 0.0)
The obvious question would be: do I misunderstand the purpose of this fit function completely, is it used somehow differently, or is it simply broken?
a remark: before someone mentions that dedicated functions like those given on Aaron Clauset's web pages (http://tuvalu.santafe.edu/~aaronc/powerlaws/) are more reliable than the scipy.stats methods and should be used instead: that may be true, but they are also very very very very time consuming and do for datasets of 10000 points take many many hours (maybe days, weeks, years) on a normal PC.
edit: oh: the parameter of the fit function is not the exponent of the distribution but exponent minus 1 (but this does not change the above issue)
It looks like you must supply a guess for the loc and scale:
In [78]: import scipy.stats as stats
In [79]: b, loc, scale = 1.5, 0, 1
In [80]: data = stats.pareto.rvs(b, size=10000)
In [81]: stats.pareto.fit(data, 1, loc=0, scale=1)
Out[81]: (1.5237427002368424, -2.8457847787917788e-05, 1.0000329980475393)
and the guess has to be pretty accurate for the fit to succeed:
In [82]: stats.pareto.fit(data, 1, loc=0, scale=1.01)
Out[82]: (1.5254113096223709, -0.0015898489208676779, 1.0015943893384001)
In [83]: stats.pareto.fit(data, 1, loc=0, scale=1.05)
Out[83]: (1.5234726749064218, 0.00025804526532994751, 0.99974649559141171)
In [84]: stats.pareto.fit(data, 1, loc=0.05, scale=1.05)
Out[84]: (1.0, 0.050000000000000003, 1.05)
Hopefully the context of the problem will inform you what an appropriate guess for the loc and scale should be. Most likely, loc=0 and scale=1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With