Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify and clean/delete the rows?

Tags:

r

For example, here is the dataframe. I want to delete the rows if the age difference in each ProjectNumber is less than 5. Within ProjectNumber = 1, the age difference are 6(56-50) so rows in ProjectNumber 1 is maintained. In ProjectNumber2, the age difference are 2(37-35), so 3 rows in ProjectNumber 2 will be deleted. How can I do that? There are thousands of Project Number in real data set so can't really do it mannually. The key problem is how to specify these rows and delete them. Thank you.

ProjectNumber Age
1 50
1 52
1 53
1 55
1 56
2 35
2 36
2 37
3 40
3 41
3 42
3 43
3 45
3 46
like image 424
Couch Tomato Avatar asked Dec 10 '22 23:12

Couch Tomato


2 Answers

Base R variants:

  1. aggregate then merge. This version is good if you have more than one "key" (ProjectNumber here).

    keep <- subset(aggregate(Age ~ ProjectNumber, data = df1,
                             FUN = function(z) diff(range((z))) >= 5), Age)
    keep
    #   ProjectNumber  Age
    # 1             1 TRUE
    # 3             3 TRUE
    
    merge(df1, keep[,1,drop=FALSE], by = "ProjectNumber")
    #    ProjectNumber Age
    # 1              1  50
    # 2              1  52
    # 3              1  53
    # 4              1  55
    # 5              1  56
    # 9              3  40
    # 10             3  41
    # 11             3  42
    # 12             3  43
    # 13             3  45
    # 14             3  46
    
  2. aggregate then %in%:

    ### 'keep' from above
    subset(df1, ProjectNumber %in% keep$ProjectNumber)
    
  3. ave then simple subset:

    keep <- ave(df1$Age, df1$ProjectNumber, 
                FUN = function(z) diff(range(z)) >= 5) > 0
    keep
    #  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    df1[keep,]
    #    ProjectNumber Age
    # 1              1  50
    # 2              1  52
    # 3              1  53
    # 4              1  55
    # 5              1  56
    # 9              3  40
    # 10             3  41
    # 11             3  42
    # 12             3  43
    # 13             3  45
    # 14             3  46
    
like image 122
r2evans Avatar answered Dec 12 '22 12:12

r2evans


Akrun's answer is top. Here: cumbersome but leads to the goal:

df %>% 
  group_by(ProjectNumber) %>% 
  mutate(diff_Age = max(Age) - min(Age)) %>% 
  subset(!diff_Age < 5) %>% 
  select(-diff_Age)

Output:

   ProjectNumber   Age
           <int> <int>
 1             1    50
 2             1    52
 3             1    53
 4             1    55
 5             1    56
 6             3    40
 7             3    41
 8             3    42
 9             3    43
10             3    45
11             3    46
like image 32
TarJae Avatar answered Dec 12 '22 11:12

TarJae