Remove outliers in data, keep the original trend

Question

enter image description here

In my plot, there are just several noises as you can see. I tried to use scipy.signal savgol_filter, but the trend has changed. I just wanna remove these noises and make them fit the curve. Thank you.

Matmozaur · Accepted Answer

i think you are confusing noise with outliers, please refer to: the-basic-difference-between-noise-and-outliers. You may try to remove outliers on many different approaches, e.g. using z-score:

df = df.mask(np.abs(stats.zscore(df)) < 2) # here we are setting limit on z-score on 2 - you can experiment with values best suited to your data

Important: you should perform it after removing trend from your data.

jlandercy · Answer

Let's recreate a dataset:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats, signal, optimize

np.random.seed(12345)

def model(x, a, b, c):
    return a*np.exp(-b*x) + c

x = np.linspace(0, 350, 200)
y = model(x, 100, 0.01, 75)
n = np.random.normal(size=x.size)
yn = y + n

yn[20] *= 0.75
yn[21] *= 0.5
yn[22] *= 1.75
yn[23] *= 0.25
yn[24] *= 0.20
yn[25] *= 0.75
yn[100] *= 0.5
yn[101] *= 1.75

If outliers are not too strong or too many, we can estimate trend by fitting the curve with outliers:

popt1, pcov1 = optimize.curve_fit(model, x, yn)
yhat1 = model(x, *popt1)

# (array([9.27557251e+01, 1.02647524e-02, 7.64660389e+01]),
#  array([[ 1.94284082e+01,  7.21272130e-04, -3.70396525e+00],
#         [ 7.21272130e-04,  1.80489353e-06,  3.75303063e-03],
#         [-3.70396525e+00,  3.75303063e-03,  1.05002199e+01]]))

Which is already close to optimal parameters but is dominated by outliers (see covariance).

Or we can smooth the curve with some filter as you suggested:

yhat1 = signal.savgol_filter(yn, 150, 3)

Then, as @Matmozaur suggested, the z-score is a good criterion to filter outliers:

zs = stats.zscore(yhat1 - yn)
mask = np.abs(zs) < 2.

Now we have identified outliers, we can fit function without them:

popt2, pcov2 = optimize.curve_fit(model, x[mask], yn[mask])
yhat2 = model(x, *popt2)

# (array([9.90714297e+01, 1.01604158e-02, 7.54550734e+01]),
# array([[ 5.81279449e-01,  1.70129801e-05, -1.13880755e-01],
#        [ 1.70129801e-05,  4.43252922e-08,  1.00312909e-04],
#        [-1.13880755e-01,  1.00312909e-04,  3.04817515e-01]]))

Which is a fairly acceptable fit for such setup.

enter image description here

As suggested by @mikuszefski, an alternative option is to estimate the trend using least_squares and using a loss that penalizes the outliers such as cauchy loss.

def residuals(args, x, y):
    return model(x, *args) - y

result = optimize.least_squares(residuals, x0=(10, 0.1, 10), args=(x, yn), loss="cauchy")
popt1b = result.x
# array([1.00306426e+02, 1.00068743e-02, 7.49353910e+01])

Which returns a proper set of parameters without the need to remove outliers so it can be used directly or with the previous procedure for outliers removal.

enter image description here

This figure shows the impact of outliers on naive trend estimation with curve_fit while least_squares equipped with cauchy loss is not dominated by outliers.

Remove outliers in data, keep the original trend

Tags:

python

outliers

curve-fitting

data-cleaning

Harvey Xie

2 Answers

Matmozaur

jlandercy

Recent Activity

Donate For Us

Remove outliers in data, keep the original trend

Tags:

python

outliers

curve-fitting

data-cleaning

Harvey Xie

2 Answers

Matmozaur

jlandercy

Related questions

Recent Activity

Donate For Us