Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy.optimize.curve_fit does not fit properly

The fit with scipy.optimize.curve_fit of the dataset is not sufficient good enough as can be seen between the first two peaks and at the third peak.

My code is the following

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
    
x = [58.1]
for j in x:
    if j < 71.9:
        x.append(0.05 + j)
    
y = np.array([12,10,16,18,13,14,15,21,26,16,16,16,15,15,31,22,23,22,23,31,31,36,29,30,44,43,45,39,67,67,62,83,79,92,93,128,177,187,251,289,334,414,509,549,623,680,751,769,816,742,701,721,599,540,498,425,339,293,212,172,157,142,115,95,84,86,81,78,58,47,52,62,47,44,39,44,50,34,28,36,36,28,34,24,28,28,37,40,33,45,44,32,32,47,41,46,41,56,46,50,59,71,72,63,62,77,89,62,93,96,114,97,123,140,137,146,138,187,213,219,239,255,287,314,336,386,401,434,467,499,534,635,597,588,597,671,725,693,751,759,727,758,763,720,706,655,650,595,608,565,522,511,433,499,448,387,371,365,324,312,326,306,292,266,282,238,265,278,242,209,226,254,236,233,232,225,220,205,209,234,214,234,234,205,194,223,201,226,214,234,241,226,269,272,303,261,272,277,284,308,324,286,307,329,281,280,257,227,239,234,221,181,195,190,164,158,131,118,118,106,108,70,104,74,76,70,52,57,56,36,44,40,48,40,33,23,28,25,24,20,20,24,31,19,23,24,15,18,12,14,11,14,14,21,20,14,12,10,11,13,5,14,11,10,11,8,8,6,10,5,11,13,8,14,6,8,3,7])

def _1_PseudoVoigt(x, alpha,amp,cen,sigma,b):
    return (1-alpha)*amp*(1/(sigma*(np.sqrt(2*np.pi))))*(np.exp((-1.0/2.0)*(((x-cen)/sigma)**2))) + \
            alpha*amp/np.pi*(sigma/((x-cen)**2+sigma**2)) + b

def _PseudoVoigt(x, alpha1,amp1,cen1,sigma1,alpha2,amp2,cen2,sigma2,alpha3,amp3,cen3,sigma3,b):
    return _1_PseudoVoigt(x,alpha1,amp1,cen1,sigma1,b) + _1_PseudoVoigt(x,alpha2,amp2,cen2,sigma2,b) + \
            _1_PseudoVoigt(x,alpha3,amp3,cen3,sigma3,b)

bounds_min = [0,0,60,0, 0,0,64,0, 0.5,0,68,0, 1]
bounds_max = [1,np.inf,62,1, 1,np.inf,66,1, 1,np.inf,69,1, np.inf]

alpha1,amp1,cen1,sigma1 = 0.5,600,60.4,1
alpha2,amp2,cen2,sigma2 = 0.5,1000,64.9,1
alpha3,amp3,cen3,sigma3 = 0.5,350,68.3,0.7
b = 1
    
popt_PseudoVoigt, pcov_PseudoVoigt = curve_fit(lambda x,alpha1,amp1,cen1,sigma1,alpha2,amp2,cen2,sigma2,alpha3,amp3,cen3,sigma3,b: \
                    _PseudoVoigt(x,alpha1,amp1,cen1,sigma1,alpha2,amp2,cen2,sigma2,alpha3,amp3,cen3,sigma3,b), \
                    x, y, \
                    p0=[alpha1,amp1,cen1,sigma1,alpha2,amp2,cen2,sigma2,alpha3,amp3,cen3,sigma3,b], bounds = (bounds_min,bounds_max))

pars_1 = np.append(popt_PseudoVoigt[0:4], popt_PseudoVoigt[12])
pars_2 = np.append(popt_PseudoVoigt[4:8], popt_PseudoVoigt[12])
pars_3 = np.append(popt_PseudoVoigt[8:12], popt_PseudoVoigt[12])

PseudoVoigt_peak_1 = _1_PseudoVoigt(x, *pars_1)
PseudoVoigt_peak_2 = _1_PseudoVoigt(x, *pars_2)
PseudoVoigt_peak_3 = _1_PseudoVoigt(x, *pars_3)

plt.figure(figsize=[10,6])
plt.plot(x,y, marker='.', markersize='5', linestyle='none')
plt.plot(x, PseudoVoigt_peak_1)
plt.plot(x, PseudoVoigt_peak_2)
plt.plot(x, PseudoVoigt_peak_3)
plt.plot(x,_PseudoVoigt(x, *popt_PseudoVoigt), 'k--')
plt.show()

which leads to this graph

Graph of no good fit

The fit as can be seen between the first two peaks and at the third peak is not quite well. Is there a way to improve it? Or should I try another packagage like lmfit?

Thank you very much for you help!

like image 992
Wechsler Avatar asked Dec 21 '25 06:12

Wechsler


1 Answers

TL;DR

The problem is more with the data than your fitting procedure. Mainly the third peak does not have enough data at its left side (too much coelution).

Additionally, fit can a bit more improved by relaxing a bit constraints on the third peak (center and alpha parameters).

Finally, it might be possible than the third peak does not follow the Voigt model, hence will not fit at all even if properly eluted.

MCVE

Lets show you are able to fit some synthetic dataset with your procedure:

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
from sklearn.metrics import r2_score

x = np.arange(58.1, 71.96, 0.05)
y = np.array([12, 10, 16, 18, 13, 14, 15, 21, 26, 16, 16, 16, 15, 15, 31, 22, 23, 22, 23, 31, 31, 36, 29, 30, 44, 43, 45, 39, 67, 67, 62, 83, 79, 92, 93, 128, 177, 187, 251, 289, 334, 414, 509, 549, 623, 680, 751, 769, 816, 742, 701, 721, 599, 540, 498, 425, 339, 293, 212, 172, 157, 142, 115, 95, 84, 86, 81, 78, 58, 47, 52, 62, 47, 44, 39, 44, 50, 34, 28, 36, 36, 28, 34, 24, 28, 28, 37, 40, 33, 45, 44, 32, 32, 47, 41, 46, 41, 56, 46, 50, 59, 71, 72, 63, 62, 77, 89, 62, 93, 96, 114, 97, 123, 140, 137, 146, 138, 187, 213, 219, 239, 255, 287, 314, 336, 386, 401, 434, 467, 499, 534, 635, 597, 588, 597, 671, 725, 693, 751, 759, 727, 758, 763, 720, 706, 655, 650, 595, 608, 565, 522, 511, 433, 499, 448, 387, 371, 365, 324, 312, 326, 306, 292, 266, 282, 238, 265, 278, 242, 209, 226, 254, 236, 233, 232, 225, 220, 205, 209, 234, 214, 234, 234, 205, 194, 223, 201, 226, 214, 234, 241, 226, 269, 272, 303, 261, 272, 277, 284, 308, 324, 286, 307, 329, 281, 280, 257, 227, 239, 234, 221, 181, 195, 190, 164, 158, 131, 118, 118, 106, 108, 70, 104, 74, 76, 70, 52, 57, 56, 36, 44, 40, 48, 40, 33, 23, 28, 25, 24, 20, 20, 24, 31, 19, 23, 24, 15, 18, 12, 14, 11, 14, 14, 21, 20, 14, 12, 10, 11, 13, 5, 14, 11, 10, 11, 8, 8, 6, 10, 5, 11, 13, 8, 14, 6, 8, 3, 7])

def peak(x, alpha, A, x0, s, b):
    return A * (
        (1 - alpha)*(1 / (s * (np.sqrt(2 * np.pi)))) * np.exp(-0.5 * np.power((x - x0) / s, 2)) + \
        alpha / np.pi * (s / (np.power((x - x0), 2) + np.power(s, 2)))
    ) + b

def model(x, alpha1, A1, x01, s1, alpha2, A2, x02, s2, alpha3, A3, x03, s3, b):
    return (
          peak(x, alpha1, A1, x01, s1, b)
        + peak(x, alpha2, A2, x02, s2, b)
        + peak(x, alpha3, A3, x03, s3, b)
    )

bounds_min = [
    0, 0, 60, 0,
    0, 0, 64, 0,
    0, 0, 67, 0,  # Enlarge alpha and x0
    -10,
]
bounds_max = [
    1, np.inf, 62, 2,
    1, np.inf, 66, 2,
    1, np.inf, 70, 2,
    10,
]

We use some parameters that can mimic your data:

p0 = [
    5.47454610e-01, 7.11445399e+02, 6.04956408e+01, 3.22783279e-01,
    6.07553375e-01, 1.48345124e+03, 6.51134457e+01, 7.05897429e-01,
    2.90916120e-02, 5.87380342e+02, 6.78878208e+01, 9.19652025e-01,
    1.40132572e-01,
]

We create a synthetic dataset with light noise:

np.random.seed(12345)
ys = model(x, *p0)
ys += 0.1 * np.random.normal(size=y.size)

The procedure can regress all parameters properly:

popt, pcov = optimize.curve_fit(
    model, x, ys,
    bounds=(bounds_min, bounds_max)
)
# array([5.47458889e-01, 7.11360396e+02, 6.04956820e+01, 3.22733587e-01,
#        6.07783229e-01, 1.48357521e+03, 6.51134710e+01, 7.05874792e-01,
#        2.77973904e-02, 5.87092002e+02, 6.78878046e+01, 9.19619386e-01,
#        1.42916997e-01])

spopt = np.sqrt(np.diag(pcov))
rsd = spopt / popt
rsd

# array([5.61128239e-04, 1.32138508e-04, 2.96521977e-07, 7.28917004e-05,
#        4.33596797e-04, 1.21671442e-04, 5.40000252e-07, 6.52427558e-05,
#        3.84556129e-02, 4.05365365e-04, 1.70150662e-06, 1.43308769e-04,
#        4.98309017e-02])

But you can see that alpha3 and b have greater RSD.

enter image description here

If we add more noise with a level comparable to your dataset, both parameters get meaningless:

np.random.seed(12345)
ys = model(x, *p0)
ys += 10 * np.random.normal(size=y.size)

popt
array([ 5.58589007e-01,  7.07259660e+02,  6.04996493e+01,  3.18390870e-01,
        6.36346065e-01,  1.49709562e+03,  6.51149588e+01,  7.03686126e-01,
        1.65684280e-11,  5.79736506e+02,  6.78855558e+01,  9.21699598e-01,
       -5.14052072e-02])

enter image description here

Its a strong hint than the third peak is to close the second and misses too much data at its left side to be regressed properly, and this effect is getting worse when noise increase. In this case the third peak does follow the Voigt model.

If we perform the same procedure on your dataset, we get:

popt, pcov = optimize.curve_fit(
    model, x, y,
    bounds=(bounds_min, bounds_max)
)
# array([5.47454610e-01, 7.11445399e+02, 6.04956408e+01, 3.22783279e-01,
#        6.07553375e-01, 1.48345124e+03, 6.51134457e+01, 7.05897429e-01,
#        2.90916120e-02, 5.87380342e+02, 6.78878208e+01, 9.19652025e-01,
#        1.40132572e-01])

rsd
# array([1.10523592e-01, 2.60282421e-02, 5.84082859e-05, 1.43564134e-02,
#        8.54478232e-02, 2.39650775e-02, 1.06361209e-04, 1.28483646e-02,
#        7.23149551e+00, 7.98235431e-02, 3.35089472e-04, 2.82361968e-02,
#        1.00113097e+01])

We can see above mentioned the effect on alpha3. We also see than alpha3 and x03 are a bit outside the initial bound you provided, hence in your fit they hit the constraint bound and stopped to be optimized. It also might be possible than the third peak does not follow Voigt model at all.

score = r2_score(y, model(x, *popt))
# 0.9918069181541809

Score is fairly acceptable even with the lack of fitness for the third peak.

enter image description here

Misc

Voigt and Pseudo Voigt functions can be written as follow:

from scipy import stats, special

def voigt(x, sigma, gamma, x0, A):
    z = (x - x0 + 1j * gamma) / (sigma * np.sqrt(2))
    y = special.erfcx(-1j * z) / (sigma * np.sqrt(2 * np.pi))
    return A * np.real(y)

def pseudo_voigt(x, eta, sigma, gamma, x0, A):
    G = stats.norm.pdf(x, scale=sigma, loc=x0)
    L = stats.cauchy.pdf(x, scale=2. * gamma, loc=x0) 
    return A*((1. - eta) * G + eta * L)

Using Voigt

Using Voigt peaks in place of pseudo Voigt peaks does not seem to improve the fit:

def model(x, sigma1, gamma1, x01, A1, sigma2, gamma2, x02, A2, sigma3, gamma3, x03, A3):
    return (
          voigt(x, sigma1, gamma1, x01, A1)
        + voigt(x, sigma2, gamma2, x02, A2)
        + voigt(x, sigma3, gamma3, x03, A3)
    )

bounds_min = [
    0, 0, 60, 0,
    0, 0, 64, 0,
    0, 0, 67, 0,
]
bounds_max = [
    1, 1, 62, np.inf,
    1, 1, 66, np.inf,
    1, 1, 70, np.inf,
]

popt
# array([2.21959138e-01, 1.65320386e-01, 6.04956801e+01, 7.12019620e+02,
#        4.37122515e-01, 4.20861584e-01, 6.51117879e+01, 1.49291511e+03,
#        9.32188697e-01, 4.90643611e-14, 6.78851635e+01, 5.82535170e+02])

rsd
# array([5.67079888e-02, 9.69470290e-02, 5.97135777e-05, 1.76340619e-02,
#        5.81668589e-02, 8.09311962e-02, 1.05311176e-04, 2.13988766e-02,
#        7.35697878e-02, 2.65030908e+12, 3.38815648e-04, 5.69179225e-02])

score
# 0.9914383339892325

enter image description here

Which also shows high RSD on the parameter controlling the shape of the third peak (gamma3). It seems the gradient descent really needs to have this peak gaussian because of the dominance of the right tail.

Conclusions

Fitting a mixed model with 13 parameters is not a trivial task and you managed it in the first place.

We showed than alpha3 (and gamma3 with Voigt) is the more sensitive parameter most probably because of a too strong coelution with the second peak.

At this point we should not question the fitting procedure but rather we should try to improve the dataset. You have few options:

  • Reduce noise on your dataset in order to reduce its effect on alpha3 to an acceptable level;
  • Improve peak resolution to reduce the coelution (get more points at left side of the third peak);
  • Change model of the third peak.
like image 139
jlandercy Avatar answered Dec 23 '25 19:12

jlandercy