I have a set of x and y data. When plotted, the data is linear except for a particular section where the data deviates from a straight line, something like a hump. My target is to develop a procedure that identifies the points in the hump (i.e. those points that deviate from the straight line). The attached image shows exactly what I want to achieve.

I have tried fitting a linear trendline to the enter data, calculating the residuals and excluding the section with larger residuals; however, my approach hasn't been very successful. The trendline doesn't seem to pass through the right points. Here is my code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Data provided
x = np.array([134, 147, 161, 175, 190, 206, 222, 237, 251, 263, 275, 291, 300, 312, 324, 337, 349, 360, 372, 382]).reshape(-1, 1)
y = np.array([0.788875116, 0.692846919, 0.605305046, 0.738780558, 0.826074803, 0.871572936, 0.776701184, 
          0.646403726, 0.677606953, 0.615950052, 0.357934847, 0.267171728, 0.217483944, 0.155336037, 
          0.071882007, 0.029383778, -0.008773924, -0.050609993, -0.102372909, -0.148741651])
# Fit linear regression
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x)
# Calculate residuals
residuals = y - y_pred
std_dev = np.std(residuals)
# Identify hump points by adjusting alpha
alpha = 2 / 3
threshold = std_dev * alpha
hump_points = np.where(np.abs(residuals) > threshold)[0]
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(x, y, 'o-', label='Data', markersize=6)
plt.plot(x, y_pred, 'r--', label='Fitted Line', linewidth=2)
plt.axhline(0, color='gray', linestyle='--', label='Zero Residual')
plt.scatter(x[hump_points], y[hump_points], color='orange', label='Hump Point', s=100)
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Detecting Hump in Data')
plt.grid(True)
plt.show()
# Print hump points
hump_info = [(x[point][0], y[point]) for point in hump_points]

Your approach is the natural one. And it would work with a bigger line (with same size hump). But your problem here is that the "hump" is too big (in terms of number of points involved), so the linear regression find a compromise between being good on the line and bad on hump points, and the opposite.
But there is an algorithm that does exactly what you need: fit a line (or a model generally speaking, but here, a line) on your data, but not counting outliers : RANSAC.
from sklearn.linear_model import LinearRegression, RANSACRegressor
import numpy as np
import matplotlib.pyplot as plt
x = np.array([134, 147, 161, 175, 190, 206, 222, 237, 251, 263, 275, 291, 300, 312, 324, 337, 349, 360, 372, 382]).reshape(-1, 1)
y = np.array([0.788875116, 0.692846919, 0.605305046, 0.738780558, 0.826074803, 0.871572936, 0.776701184, 0.646403726, 0.677606953, 0.615950052, 0.357934847, 0.267171728, 0.217483944, 0.155336037, 0.071882007, 0.029383778, -0.008773924, -0.050609993, -0.102372909, -0.148741651])
reg=RANSACRegressor(max_trials=10000, residual_threshold=0.1)
reg.fit(x, y)
a = reg.estimator_.coef_
b = reg.estimator_.intercept_ 
inliers = reg.inlier_mask_
plt.scatter(x[inliers],y[inliers], marker='o')
plt.scatter(x[~inliers],y[~inliers], marker='x')
plt.plot(x, a*x+b)
plt.show()
What RANSAC does is selecting a random subset of your points. Try to fit a line on that subset. Remove the points that are two far from the regressed line, and add (among those that were not selected) those, on the contrary that are closed to the regressed line. And perform a regression, this time, not on a random subset, but on the subset of points that seem not to be outlier.
Then it starts again (because of the random selection, you need several trials to find line that fits well a subset, preferably the bigger one, of course)
Once that done, you can easily compute the distance between all points and your computed line to find what you want.
Or, even simpler, since RANSAC, in its internal logic also need to discriminate outliers from inliers, you can just hijack RANSAC inlier list

Note that this can be seen as a generalization of what lastchance invented. His algorithm assume that first and last point has to be on the line, when RANSAC makes no assumption at all. So lastchance's method is a bit like RANSAC but with a unique trial, and with intial subset being just first and last point (then, as RANSAC, from the line obtained from this subset, inliers are selected)
Tbh, when I run it several times, I find sometimes solution that wouldn't satisfy you. But that is because, sadly, those solutions are perfectly valid.
See this one for example

Hard to say that it is invalid, and that there exist an objective criteria to rule this solution out. It has found a very good line, once removed some (few) outliers. It is not like your line that pass in the middle of the cloud of points, preferring two half wrong than 1 ok and 1 clearly wrong.
So, unless lastchance's criteria happens to be correct (the first and last points are, no matter what, on the line, or at least, they are not outliers), there are no objective way to tell why the good solution (according to you, and I am sure that from applicative standpoint that is true, but not to mathematical objective criteria — not any you told us about, at least) is the good one, and the other the bad one.
Note that if lastchance's criteria is correct, you can do that with RANSAC too.
from sklearn.linear_model import LinearRegression, RANSACRegressor
import numpy as np
import matplotlib.pyplot as plt
x = np.array([134, 147, 161, 175, 190, 206, 222, 237, 251, 263, 275, 291, 300, 312, 324, 337, 349, 360, 372, 382]).reshape(-1, 1)
y = np.array([0.788875116, 0.692846919, 0.605305046, 0.738780558, 0.826074803, 0.871572936, 0.776701184, 0.646403726, 0.677606953, 0.615950052, 0.357934847, 0.267171728, 0.217483944, 0.155336037, 0.071882007, 0.029383778, -0.008773924, -0.050609993, -0.102372909, -0.148741651])
def isDataValid(XX, YY):
    return (x[0] in XX) and (x[-1] in XX)
reg=RANSACRegressor(max_trials=10000, residual_threshold=0.1, is_data_valid=isDataValid)
reg.fit(x, y)
a = reg.estimator_.coef_
b = reg.estimator_.intercept_ 
inliers = reg.inlier_mask_
plt.scatter(x[inliers],y[inliers], marker='o')
plt.scatter(x[~inliers],y[~inliers], marker='x')
plt.plot(x, a*x+b)
plt.show()
What the point? you may ask, I mean, why then not just use lastchance strategy? The difference is that first and last points are forced then to be inliers (any random subselection that does not include first and last point are invalid), but not right on the line: the algorithm still keeps the liberty to adjust the line, so that first and last points are not exactly on it (while still being "almost" on it) to better fit other points.
With this is_data_valid you are guaranteed that whatever the solution is, it will include as inlier (not necessarily exactly on the line, but that is a good thing, since it let a degree of freedom to the algorithm to better fit the other inliers) the 1st and last point, and wouldn't never end up with my second plot.
You may play with is_data_valid argument to implement other criteria you may think about from an applicative standpoint. Or (but that cost more), with is_model_valid that is the same, but after linear regression, to eject a posteriori some solutions.
Note: the parameter residual_threshold, as you may guess, is exactly the same thing as what you call threshold in your code: tolerated residual error of the linear model. So, you could still use this with your alpha and std_dev, with the same meaning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With