For example, here is the dataframe. I want to delete the rows if the age difference in each ProjectNumber is less than 5. Within ProjectNumber = 1, the age difference are 6(56-50) so rows in ProjectNumber 1 is maintained. In ProjectNumber2, the age difference are 2(37-35), so 3 rows in ProjectNumber 2 will be deleted. How can I do that? There are thousands of Project Number in real data set so can't really do it mannually. The key problem is how to specify these rows and delete them. Thank you.
ProjectNumber | Age |
---|---|
1 | 50 |
1 | 52 |
1 | 53 |
1 | 55 |
1 | 56 |
2 | 35 |
2 | 36 |
2 | 37 |
3 | 40 |
3 | 41 |
3 | 42 |
3 | 43 |
3 | 45 |
3 | 46 |
Base R variants:
aggregate
then merge
. This version is good if you have more than one "key" (ProjectNumber
here).
keep <- subset(aggregate(Age ~ ProjectNumber, data = df1,
FUN = function(z) diff(range((z))) >= 5), Age)
keep
# ProjectNumber Age
# 1 1 TRUE
# 3 3 TRUE
merge(df1, keep[,1,drop=FALSE], by = "ProjectNumber")
# ProjectNumber Age
# 1 1 50
# 2 1 52
# 3 1 53
# 4 1 55
# 5 1 56
# 9 3 40
# 10 3 41
# 11 3 42
# 12 3 43
# 13 3 45
# 14 3 46
aggregate
then %in%
:
### 'keep' from above
subset(df1, ProjectNumber %in% keep$ProjectNumber)
ave
then simple subset:
keep <- ave(df1$Age, df1$ProjectNumber,
FUN = function(z) diff(range(z)) >= 5) > 0
keep
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
df1[keep,]
# ProjectNumber Age
# 1 1 50
# 2 1 52
# 3 1 53
# 4 1 55
# 5 1 56
# 9 3 40
# 10 3 41
# 11 3 42
# 12 3 43
# 13 3 45
# 14 3 46
Akrun's answer is top. Here: cumbersome but leads to the goal:
df %>%
group_by(ProjectNumber) %>%
mutate(diff_Age = max(Age) - min(Age)) %>%
subset(!diff_Age < 5) %>%
select(-diff_Age)
Output:
ProjectNumber Age
<int> <int>
1 1 50
2 1 52
3 1 53
4 1 55
5 1 56
6 3 40
7 3 41
8 3 42
9 3 43
10 3 45
11 3 46
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With