Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove outliers in data, keep the original trend

enter image description here

In my plot, there are just several noises as you can see. I tried to use scipy.signal savgol_filter, but the trend has changed. I just wanna remove these noises and make them fit the curve. Thank you.

like image 644
Harvey Xie Avatar asked Dec 12 '25 16:12

Harvey Xie


2 Answers

i think you are confusing noise with outliers, please refer to: the-basic-difference-between-noise-and-outliers. You may try to remove outliers on many different approaches, e.g. using z-score:

df = df.mask(np.abs(stats.zscore(df)) < 2) # here we are setting limit on z-score on 2 - you can experiment with values best suited to your data

Important: you should perform it after removing trend from your data.

like image 139
Matmozaur Avatar answered Dec 14 '25 06:12

Matmozaur


Let's recreate a dataset:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats, signal, optimize

np.random.seed(12345)

def model(x, a, b, c):
    return a*np.exp(-b*x) + c

x = np.linspace(0, 350, 200)
y = model(x, 100, 0.01, 75)
n = np.random.normal(size=x.size)
yn = y + n

yn[20] *= 0.75
yn[21] *= 0.5
yn[22] *= 1.75
yn[23] *= 0.25
yn[24] *= 0.20
yn[25] *= 0.75
yn[100] *= 0.5
yn[101] *= 1.75

If outliers are not too strong or too many, we can estimate trend by fitting the curve with outliers:

popt1, pcov1 = optimize.curve_fit(model, x, yn)
yhat1 = model(x, *popt1)

# (array([9.27557251e+01, 1.02647524e-02, 7.64660389e+01]),
#  array([[ 1.94284082e+01,  7.21272130e-04, -3.70396525e+00],
#         [ 7.21272130e-04,  1.80489353e-06,  3.75303063e-03],
#         [-3.70396525e+00,  3.75303063e-03,  1.05002199e+01]]))

Which is already close to optimal parameters but is dominated by outliers (see covariance).

Or we can smooth the curve with some filter as you suggested:

yhat1 = signal.savgol_filter(yn, 150, 3)

Then, as @Matmozaur suggested, the z-score is a good criterion to filter outliers:

zs = stats.zscore(yhat1 - yn)
mask = np.abs(zs) < 2.

Now we have identified outliers, we can fit function without them:

popt2, pcov2 = optimize.curve_fit(model, x[mask], yn[mask])
yhat2 = model(x, *popt2)

# (array([9.90714297e+01, 1.01604158e-02, 7.54550734e+01]),
# array([[ 5.81279449e-01,  1.70129801e-05, -1.13880755e-01],
#        [ 1.70129801e-05,  4.43252922e-08,  1.00312909e-04],
#        [-1.13880755e-01,  1.00312909e-04,  3.04817515e-01]]))

Which is a fairly acceptable fit for such setup.

enter image description here

As suggested by @mikuszefski, an alternative option is to estimate the trend using least_squares and using a loss that penalizes the outliers such as cauchy loss.

def residuals(args, x, y):
    return model(x, *args) - y

result = optimize.least_squares(residuals, x0=(10, 0.1, 10), args=(x, yn), loss="cauchy")
popt1b = result.x
# array([1.00306426e+02, 1.00068743e-02, 7.49353910e+01])

Which returns a proper set of parameters without the need to remove outliers so it can be used directly or with the previous procedure for outliers removal.

enter image description here

This figure shows the impact of outliers on naive trend estimation with curve_fit while least_squares equipped with cauchy loss is not dominated by outliers.

like image 43
jlandercy Avatar answered Dec 14 '25 07:12

jlandercy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!