Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect outlier data points on my database

I currently have a database that has some has mislabeled prices on it.

Example Data:

Product - Price - SalesDate
ProdA  - 10  - 1/1/2016
ProdB  - 20  - 1/2/2016
ProdA  - 100 - 1/3/2016
ProdB  - 20  - 1/4/2016
ProdB  - 21  - 1/5/2016
ProdA  - 11  - 1/6/2016

On this data set the record "ProdA - 100 - 1/3/2016" is the one with an error. Must probably the person who entered the price made a typing mistake. Furthermore, ProdA on different days can changes it price and that makes this problem interesting.

What kind of tool can help my identify this type of records? Can SQL help me detect Outlier data points? Should I start looking into Machine Learning for this?

like image 362
user1261620 Avatar asked Sep 03 '25 06:09

user1261620


1 Answers

This is a bit subjective, but you can identify the rows whose values are furthest from the average. I would do this by calculating the z-score and looking at the largest/smallest z-scores.

The z-score is the value minus the average divided by the standard deviation. Here is an example of the calculation:

select t.*,
       (price - avg_price) / nullif(std_price, 0) as z_price
from t join
     (select product, avg(price) as avg_price, stdev(price) as std_price
      from t
      group by product
     ) tt
     on t.product = tt.product
order by abs(z_price) desc;

The function for the standard deviation might vary, depending on the database you are using, but most databases support such a function.

like image 58
Gordon Linoff Avatar answered Sep 04 '25 23:09

Gordon Linoff