Clean R data frame so that in a column no row value is bigger than 2 times next row value

Question

I have a data frame exemplified by the following

dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4)
id <- rep("A",length(dist))
df<-cbind.data.frame(id,dist)

df

  id dist
1  A  1.1
2  A  1.0
3  A 10.0
4  A  5.0
5  A  2.1
6  A 12.2
7  A  3.3
8  A  3.4

I need to clean it up so no row values in the dist column is bigger than 2 times the next row value at any time. A cleaned up data frame would look like this:

  id dist
1  A  1.1
2  A  1.0
5  A  2.1
7  A  3.3
8  A  3.4

I have tried making a function with a for loop and if statement to clean it

cleaner <-  function (df,dist,times_larger) {

              for (i in 1:(nrow(df)-1)) {

                  if (df$dist[i] > df$dist[i+1]*times_larger){
                    df<-df[-i,]
                    break       
                  }
              }
              df
            }

Obviously if I dont break the loop it will create an error because the number of rows in df will change in the process. If I manually run the loop on df several times:

df<-cleaner(df,"dist",2)

it will clean up as I want.

I have also tried different function constructions and applying it to the data frame with apply, but without any luck.

Do any have a good suggestion of either how to repeat the function on the data frame until it does not change anymore, a better function structure or maybe a better way of cleaning?

Any suggestions are most appreciated

Marat Talipov · Accepted Answer

You can shift your dist column one element left, multiply it by two, and compare with the original dist:

subset(df,dist < c(2*dist[-1],Inf))
#  id dist
#1  A  1.1
#2  A  1.0
#5  A  2.1
#7  A  3.3
#8  A  3.4

akrun · Answer

You could try lead from dplyr

library(dplyr) #dplyr_0.4.0
filter(df, dist < 2 * lead(dist, default = Inf)) 
#    id dist
#1  A  1.1
#2  A  1.0
#3  A  2.1
#4  A  3.3
#5  A  3.4

Or using the similar method in data.table. A new function shift is introduced in the devel version of data.table. We can specify the type to lead. By default, it is lag and fill is NA. Modify the fill to 'Inf' (inspired from @Marat Talipov's post).

library(data.table) #data.table_1.9.5
setDT(df)[dist <2 *shift(dist,type='lead', fill=Inf)]
#   id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Update

If the value of 'dist' is equal to '2' times the next value, the above solutions removes that row. In such cases,

setDT(df)[dist <2 *(shift(dist,type='lead',
             fill=Inf)+.Machine$double.eps)]
#    id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Using a different example as commented by @Henrik.

df1 <- data.frame(dist= as.numeric(3:1))
setDT(df1)[dist <2 *(shift(dist,type='lead', 
            fill=Inf)+.Machine$double.eps)]
#    dist
#1:    3
#2:    2
#3:    1

Benchmarks

set.seed(49)
df <- data.frame(id='A', dist=rnorm(1e7,20))
df1 <- copy(df)
akrun1 <- function() {filter(df, dist < 2 * lead(dist,
                                 default = Inf)) }
akrun2 <- function() {setDT(df1)[dist <2 *shift(dist,type='lead',
                                     fill=Inf)]}
marat <- function() {subset(df,dist < c(2*dist[-1],Inf))}
Colonel <- function() {df[with(df, dist<2*c(dist[-1], tail(dist,1))),]}

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), marat(), Colonel(), 
                                unit='relative', times=20L)
#Unit: relative
#    expr      min       lq     mean   median       uq      max neval  cld
# akrun1() 2.029087 1.990739 1.864697 1.965247 1.773722 1.727474    20  b  
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a   
# marat() 8.032147 8.137982 7.359821 7.937062 7.134686 5.837623     20  d
#Colonel() 7.094465 7.045000 6.473552 6.903460 6.197737 5.359575    20  c

Clean R data frame so that in a column no row value is bigger than 2 times next row value

Tags:

dataframe

r

data-manipulation

data-cleaning

Kristian

2 Answers

Marat Talipov

Update

Benchmarks

akrun

Recent Activity

Donate For Us

Clean R data frame so that in a column no row value is bigger than 2 times next row value

Tags:

dataframe

r

data-manipulation

data-cleaning

Kristian

2 Answers

Marat Talipov

Update

Benchmarks

akrun

Related questions

Recent Activity

Donate For Us