R data cleaning in seqence

Question

I have data like this, but sometimes I have wrong mileage. Mileage should increase, but sometimes there is wrong number - too low or too high. Is possible to clean that data in R? Do you have any ideas? For this mistakes I can use average from below and above record but how to catch the error in than sequence?

CarID   FuelTransactionDate Mileage  
AAA555  05.01.2019      5060     
AAA555  30.01.2019      7800     
AAA555  14.02.2019      9100     
AAA555  24.02.2019      9900     
AAA555  07.04.2019      101110  <- mistake
AAA555  12.04.2019      12500    
AAA555  15.05.2019      13000    
AAA555  09.06.2019      13422    
BBB788  15.05.2018      15000    
BBB788  04.06.2018      15200    
BBB788  19.06.2018      16150    
BBB788  16.07.2018      100    <- mistake
BBB788  27.08.2018      17500    
BBB788  10.09.2018      17999    
BBB788  13.10.2018      18200    
BBB788  02.11.2018      18555

ThomasIsCoding · Accepted Answer

If you want to identify where the mistakes occur, here might be option using ave + cummax + cummin with base R

within(
  df,
  err <- ave(
    Mileage,
    CarID,
    FUN = function(x) replace(cummax(x) == rev(cummax(rev(x))), length(x), 0) + replace(cummin(x) == rev(cummin(rev(x))), 1, 0)
  )
)

which gives

    CarID FuelTransactionDate Mileage err
1  AAA555          05.01.2019    5060   0
2  AAA555          30.01.2019    7800   0
3  AAA555          14.02.2019    9100   0
4  AAA555          24.02.2019    9900   0
5  AAA555          07.04.2019  101110   1
6  AAA555          12.04.2019   12500   0
7  AAA555          15.05.2019   13000   0
8  AAA555          09.06.2019   13422   0
9  BBB788          15.05.2018   15000   0
10 BBB788          04.06.2018   15200   0
11 BBB788          19.06.2018   16150   0
12 BBB788          16.07.2018     100   1
13 BBB788          27.08.2018   17500   0
14 BBB788          10.09.2018   17999   0
15 BBB788          13.10.2018   18200   0
16 BBB788          02.11.2018   18555   0

Gregor Thomas · Answer

Here's a method showing how to identify outliers and then fill them in using approx. I start by looking for decreases in mileage - you can put whatever additional conditions you want to check in the if_else to identify outliers:

dd %>%
  group_by(CarID) %>%
  dplyr::mutate(
    # replace mistakes with NA
    MileageNA = if_else(Mileage < lag(Mileage, 1, default = 0), NA_integer_, Mileage),
    # fill in missing values with approx
    # approx is nicely robust in case you have multiple mistakes in a row
    #   See the help page and the rule argument to control behavior
    #   in case you have mistakes as the first or last observations
    MileageCorrected = approx(MileageNA, xout = 1:n())$y
  )
# # A tibble: 16 x 5
# # Groups:   CarID [2]
#    CarID  FuelTransactionDate Mileage MileageNA MileageCorrected
#    <chr>  <chr>                 <int>     <int>            <dbl>
#  1 AAA555 05.01.2019             5060      5060             5060
#  2 AAA555 30.01.2019             7800      7800             7800
#  3 AAA555 14.02.2019             9100      9100             9100
#  4 AAA555 24.02.2019             9900      9900             9900
#  5 AAA555 07.04.2019           101110    101110           101110
#  6 AAA555 12.04.2019            12500        NA            57055
#  7 AAA555 15.05.2019            13000     13000            13000
#  8 AAA555 09.06.2019            13422     13422            13422
#  9 BBB788 15.05.2018            15000     15000            15000
# 10 BBB788 04.06.2018            15200     15200            15200
# 11 BBB788 19.06.2018            16150     16150            16150
# 12 BBB788 16.07.2018              100        NA            16825
# 13 BBB788 27.08.2018            17500     17500            17500
# 14 BBB788 10.09.2018            17999     17999            17999
# 15 BBB788 13.10.2018            18200     18200            18200
# 16 BBB788 02.11.2018            18555     18555            18555

R data cleaning in seqence

Tags:

r

data-cleaning

NatR

2 Answers

ThomasIsCoding

Gregor Thomas

Recent Activity

Donate For Us

R data cleaning in seqence

Tags:

r

data-cleaning

NatR

2 Answers

ThomasIsCoding

Gregor Thomas

Related questions

Recent Activity

Donate For Us