I have data like this, but sometimes I have wrong mileage. Mileage should increase, but sometimes there is wrong number - too low or too high. Is possible to clean that data in R? Do you have any ideas? For this mistakes I can use average from below and above record but how to catch the error in than sequence?
CarID FuelTransactionDate Mileage
AAA555 05.01.2019 5060
AAA555 30.01.2019 7800
AAA555 14.02.2019 9100
AAA555 24.02.2019 9900
AAA555 07.04.2019 101110 <- mistake
AAA555 12.04.2019 12500
AAA555 15.05.2019 13000
AAA555 09.06.2019 13422
BBB788 15.05.2018 15000
BBB788 04.06.2018 15200
BBB788 19.06.2018 16150
BBB788 16.07.2018 100 <- mistake
BBB788 27.08.2018 17500
BBB788 10.09.2018 17999
BBB788 13.10.2018 18200
BBB788 02.11.2018 18555
If you want to identify where the mistakes occur, here might be option using ave + cummax + cummin with base R
within(
df,
err <- ave(
Mileage,
CarID,
FUN = function(x) replace(cummax(x) == rev(cummax(rev(x))), length(x), 0) + replace(cummin(x) == rev(cummin(rev(x))), 1, 0)
)
)
which gives
CarID FuelTransactionDate Mileage err
1 AAA555 05.01.2019 5060 0
2 AAA555 30.01.2019 7800 0
3 AAA555 14.02.2019 9100 0
4 AAA555 24.02.2019 9900 0
5 AAA555 07.04.2019 101110 1
6 AAA555 12.04.2019 12500 0
7 AAA555 15.05.2019 13000 0
8 AAA555 09.06.2019 13422 0
9 BBB788 15.05.2018 15000 0
10 BBB788 04.06.2018 15200 0
11 BBB788 19.06.2018 16150 0
12 BBB788 16.07.2018 100 1
13 BBB788 27.08.2018 17500 0
14 BBB788 10.09.2018 17999 0
15 BBB788 13.10.2018 18200 0
16 BBB788 02.11.2018 18555 0
Here's a method showing how to identify outliers and then fill them in using approx. I start by looking for decreases in mileage - you can put whatever additional conditions you want to check in the if_else to identify outliers:
dd %>%
group_by(CarID) %>%
dplyr::mutate(
# replace mistakes with NA
MileageNA = if_else(Mileage < lag(Mileage, 1, default = 0), NA_integer_, Mileage),
# fill in missing values with approx
# approx is nicely robust in case you have multiple mistakes in a row
# See the help page and the rule argument to control behavior
# in case you have mistakes as the first or last observations
MileageCorrected = approx(MileageNA, xout = 1:n())$y
)
# # A tibble: 16 x 5
# # Groups: CarID [2]
# CarID FuelTransactionDate Mileage MileageNA MileageCorrected
# <chr> <chr> <int> <int> <dbl>
# 1 AAA555 05.01.2019 5060 5060 5060
# 2 AAA555 30.01.2019 7800 7800 7800
# 3 AAA555 14.02.2019 9100 9100 9100
# 4 AAA555 24.02.2019 9900 9900 9900
# 5 AAA555 07.04.2019 101110 101110 101110
# 6 AAA555 12.04.2019 12500 NA 57055
# 7 AAA555 15.05.2019 13000 13000 13000
# 8 AAA555 09.06.2019 13422 13422 13422
# 9 BBB788 15.05.2018 15000 15000 15000
# 10 BBB788 04.06.2018 15200 15200 15200
# 11 BBB788 19.06.2018 16150 16150 16150
# 12 BBB788 16.07.2018 100 NA 16825
# 13 BBB788 27.08.2018 17500 17500 17500
# 14 BBB788 10.09.2018 17999 17999 17999
# 15 BBB788 13.10.2018 18200 18200 18200
# 16 BBB788 02.11.2018 18555 18555 18555
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With