Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cutting outliers in Histogram (Python)

I wanted to know, if there is a method that shows me how long my x-axis should be. I have a record with different outliers. I can just cut them with plt.xlim() but is there a statistical method to compute a senseful x-axis limit? In the added picture a logical cut would be after 150 km drived distance. To compute the threshold of the cutting would be perfect logical manual cut after 150 km

The dataframe that the definition gets is a standard pandas dataframe

Code:

def yearly_distribution(dataframe):


    df_distr = dataframe  

    h=sorted(df_distr['Distance'])
    l=len(h)    

    fig, ax =plt.subplots(figsize=(16,9))

    binwidth = np.arange(0,501,0.5)

    n, bins, patches = plt.hist(h, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')

    lnspc =np.arange(0,500.5,0.5)

    gevfit = gev.fit(h)  
    pdf_gev = gev.pdf(lnspc, *gevfit)  
    plt.plot(lnspc, pdf_gev, label="GEV")

    logfit = stats.lognorm.fit(h)  
    pdf_lognorm = stats.lognorm.pdf(lnspc, *logfit)  
    plt.plot(lnspc, pdf_lognorm, label="LogNormal")

    weibfit = stats.weibull_min.fit(h)  
    pdf_weib = stats.weibull_min.pdf(lnspc, *weibfit)  
    plt.plot(lnspc, pdf_weib, label="Weibull")

    burrfit = stats.burr.fit(h)  
    pdf_burr = stats.burr.pdf(lnspc, *burrfit)  
    plt.plot(lnspc, pdf_burr, label="Burr Distribution")

    genparetofit = stats.genpareto.fit(h)
    pdf_genpareto = stats.genpareto.pdf(lnspc, *genparetofit)
    plt.plot(lnspc, pdf_genpareto, label ="Generalized Pareto")

    myarray = np.array(h)

    clf = GMM(8,n_iter=500, random_state=3)
    myarray.shape = (myarray.shape[0],1)
    clf = clf.fit(myarray)
    lnspc.shape = (lnspc.shape[0],1)
    pdf_gmm = np.exp(clf.score(lnspc))
    plt.plot(lnspc, pdf_gmm, label = "GMM")

    plt.xlim(0,500)
    plt.xlabel('Distance')
    plt.ylabel('Probability')
    plt.title('Histogram')
    plt.ylim(0,0.05)
like image 767
Oguz Cebeci Avatar asked Jan 20 '26 23:01

Oguz Cebeci


1 Answers

you should remove outliers from your data before any plot or fitting :

h=sorted(df_distr['Distance'])

out_threshold= 150.0
h=[i for i in h if i<out_threshold]

EDIT that maybe not the fastest way but with numpy.std() :

out_threshold= 2.0*np.std(h+[-a for a in h])
like image 94
Dadep Avatar answered Jan 23 '26 13:01

Dadep



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!