For example, here is the dataframe. I want to delete the rows if the age difference in each ProjectNumber is less than 5. Within ProjectNumber = 1, the age difference are 6(56-50) so rows in ProjectNumber 1 is maintained. In ProjectNumber2, the age difference are 2(37-35), so 3 rows in ProjectNumber 2 will be deleted. How can I do that? There are thousands of Project Number in real data set so can't really do it mannually. The key problem is how to specify these rows and delete them. Thank you.
| ProjectNumber | Age |
|---|---|
| 1 | 50 |
| 1 | 52 |
| 1 | 53 |
| 1 | 55 |
| 1 | 56 |
| 2 | 35 |
| 2 | 36 |
| 2 | 37 |
| 3 | 40 |
| 3 | 41 |
| 3 | 42 |
| 3 | 43 |
| 3 | 45 |
| 3 | 46 |
Base R variants:
aggregate then merge. This version is good if you have more than one "key" (ProjectNumber here).
keep <- subset(aggregate(Age ~ ProjectNumber, data = df1,
FUN = function(z) diff(range((z))) >= 5), Age)
keep
# ProjectNumber Age
# 1 1 TRUE
# 3 3 TRUE
merge(df1, keep[,1,drop=FALSE], by = "ProjectNumber")
# ProjectNumber Age
# 1 1 50
# 2 1 52
# 3 1 53
# 4 1 55
# 5 1 56
# 9 3 40
# 10 3 41
# 11 3 42
# 12 3 43
# 13 3 45
# 14 3 46
aggregate then %in%:
### 'keep' from above
subset(df1, ProjectNumber %in% keep$ProjectNumber)
ave then simple subset:
keep <- ave(df1$Age, df1$ProjectNumber,
FUN = function(z) diff(range(z)) >= 5) > 0
keep
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
df1[keep,]
# ProjectNumber Age
# 1 1 50
# 2 1 52
# 3 1 53
# 4 1 55
# 5 1 56
# 9 3 40
# 10 3 41
# 11 3 42
# 12 3 43
# 13 3 45
# 14 3 46
Akrun's answer is top. Here: cumbersome but leads to the goal:
df %>%
group_by(ProjectNumber) %>%
mutate(diff_Age = max(Age) - min(Age)) %>%
subset(!diff_Age < 5) %>%
select(-diff_Age)
Output:
ProjectNumber Age
<int> <int>
1 1 50
2 1 52
3 1 53
4 1 55
5 1 56
6 3 40
7 3 41
8 3 42
9 3 43
10 3 45
11 3 46
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With