I spent some time these days on a problem. I have a set of data:
y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.
The measurements follow an established model:
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.
And then I used a similar model:
y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7
By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:
http://i.imgur.com/XLa4LtL.png
The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.
The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?
Curve Fitting using Polynomial Terms in Linear Regression Despite its name, you can fit curves using linear regression. The most common method is to include polynomial terms in the linear model. Polynomial terms are independent variables that you raise to a power, such as squared or cubed terms.
With low-order polynomials, the curve is more likely to fall near the midpoint (it's even guaranteed to exactly run through the midpoint on a first degree polynomial). Low-order polynomials tend to be smooth and high order polynomial curves tend to be "lumpy".
The adjusted R-square statistic is generally the best indicator of the fit quality when you add additional coefficients to your model. The adjusted R-square statistic can take on any value less than or equal to 1, with a value closer to 1 indicating a better fit. A RMSE value closer to 0 indicates a better fit.
The most common way to fit curves to the data using linear regression is to include polynomial terms, such as squared or cubed predictors. Typically, you choose the model order by the number of bends you need in your line. Each increase in the exponent produces one more bend in the curved fitted line.
The docstring for optimize.curve_fit says,
p0 : None, scalar, or M-length sequence
    Initial guess for the parameters.  If None, then the initial
    values will all be 1 (if the number of parameters for the function
    can be determined using introspection, otherwise a ValueError
    is raised).
Thus, to begin with, the initial guess for the parameters is by default 1.
Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.
If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.
Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.
The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.
Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:
def leastsq(func, x0, args=(), Dfun=None, full_output=0,
            col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
            gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):
Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.
To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.
Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,
import struct
def floats_between(x, y):
    """
    http://stackoverflow.com/a/3587987/190597 (jsbueno)
    """
    a = struct.pack("<dd", x, y)
    b = struct.unpack("<qq", a)
    return b[1] - b[0]
In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894
This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7]. Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With