I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation.
For example, here's an example data set:
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
Then we have:
   city value
1     1     3
2     1     5
3     2     7
4     3     8
5     3     2
6     4     5
7     5     4
8     6     2
9     7     3
10    7     2
11    8     3
The ideal outcome would be:
   city value
2     1     5
3     2     7
5     3     2
6     4     5
7     5     4
8     6     2
10    7     2
11    8     3
Any help is appreciated!
How to Count Observations by Group in R Often you may be interested in counting the number of observationsby group in R. Fortunately this is easy to do using the count()function from the dplyrlibrary. This tutorial explains several examples of how to use this function in practice using the following data frame:
This example shows how to keep only the N observations with the highest values by group using the functions of the dplyr package. Next, we can use the arrange, desc, group_by, and slice functions to return a tibble containing only the three highest values in each group:
Two of the most common tasks that you’ll perform in data analysis are grouping and summarizing data. Fortunately the dplyr package in R allows you to quickly group and summarize data. This tutorial provides a quick guide to getting started with dplyr. Before you can use the functions in the dplyr package, you must first load the package:
In Example 1, I’ll show how to return the N highest data points of each group using the basic installation of the R programming language. For this, we first have to sort our data based on the value column in descending order: As next step, we have to apply the Reduce, rbind, and head functions as shown below:
library(dplyr)
mydata %>%
  group_by(city) %>%
  filter(n() == 1L | row_number() == 2L) %>%
  ungroup()
# # A tibble: 8 x 2
#    city value
#   <dbl> <dbl>
# 1     1     5
# 2     2     7
# 3     3     2
# 4     4     5
# 5     5     4
# 6     6     2
# 7     7     2
# 8     8     3
or slightly different
mydata %>%
  group_by(city) %>%
  slice(min(n(), 2)) %>%
  ungroup()
ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
           FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
#  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
mydata[ind,]
#    city value
# 2     1     5
# 3     2     7
# 5     3     2
# 6     4     5
# 7     5     4
# 8     6     2
# 10    7     2
# 11    8     3
Since you mentioned "is way bigger", you might consider data.table at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)
library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
#     city value
#    <num> <num>
# 1:     1     5
# 2:     2     7
# 3:     3     2
# 4:     4     5
# 5:     5     4
# 6:     6     2
# 7:     7     2
# 8:     8     3
Here is logic that uses pmin() to choose either 2 or 1 depending on the length of the vector of value-values:
aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] )
  city value
1    1     5
2    2     7
3    3     2
4    4     5
5    5     4
6    6     2
7    7     2
8    8     3
The aggregate function delivers vectors of value split on the basis of city-values.
You may try
library(dplyr)
mydata %>%
  group_by(city) %>%
  filter(case_when(n()> 1 ~ row_number() == 2,
                   TRUE ~ row_number()== 1)) 
  
   city value
  <dbl> <dbl>
1     1     5
2     2     7
3     3     2
4     4     5
5     5     4
6     6     2
7     7     2
8     8     3
Another dplyr solution:
mydata %>% group_by(city) %>%    
  summarize(value=value[pmin(2, n())])
Or:
mydata %>% group_by(city) %>%
  summarize(value=ifelse(n() >= 2, value[2], value[1]))
Both Output:
   city value
  <dbl> <dbl>
1     1     5
2     2     7
3     3     2
4     4     5
5     5     4
6     6     2
7     7     2
8     8     3
If base R is ok try this:
EDIT (since performance really seems to be important):
Using if as a function, should give a 100-fold speed-up in some cases.
aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )
  city value
1    1     5
2    2     7
3    3     2
4    4     5
5    5     4
6    6     2
7    7     2
8    8     3
Here're some benchmarks because I was curious. I gathered all solutions and let them run through microbenchmark.
Bottom line is 'if'(cond,T,F) is fastest (22.3% faster than ifelse and 17-times faster than the slowest), followed by ifelse and aggregate(pmin). Keep in mind that the data.table solution only ran on one core. So all speed-up in that package comes from parallelization. No real shocker but interesting nonetheless.
library(microbenchmark)
lengths( mydata )
 city value 
20000 20000
c( class(mydata$value), class(mydata$value) )
[1] "integer" "integer"
microbenchmark("aggr_if_function" = { res <- aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )},
"aggr_ifelse" = { res <- aggregate( value ~ city, mydata, function(x) ifelse(!is.na(x[2]),x[2],x[1]) ) },
"dplyr_filter" = { res <- mydata %>% group_by(city) %>% filter(n() == 1L | row_number() == 2L) %>% ungroup() },
"dplyr_slice" = { res <- mydata %>% group_by(city) %>% slice(min(n(), 2)) %>% ungroup() },
"data.table_single_core" = { res <- DT[, .SD[min(.N, 2),], by = city] },
"aggr_pmin" = { res <- aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] ) },
"dplyr_filter_case_when" = { res <- mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) },
"group_split_purrr" = { res <- group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() }, times=50)
Unit: milliseconds
                   expr       min        lq      mean    median        uq
       aggr_if_function  175.5104  179.3273  184.5157  182.1778  186.8963
            aggr_ifelse  214.5846  220.7074  229.2062  228.0688  234.1087
           dplyr_filter  585.5275  607.7011  643.6320  632.0794  660.8184
            dplyr_slice  713.4047  762.9887  792.7491  780.8475  803.7191
 data.table_single_core 2080.3869 2164.3829 2240.8578 2229.5310 2298.9002
              aggr_pmin  321.5265  330.5491  343.2752  341.7866  352.2880
 dplyr_filter_case_when 3171.4859 3337.1669 3492.6915 3500.7783 3608.1809
      group_split_purrr 1466.4527 1543.2597 1590.9994 1588.0186 1630.5590
       max neval     cld
  212.6006    50 a      
  253.0433    50 a      
 1066.6018    50   c    
 1304.4045    50    d   
 2702.4201    50      f 
  457.3435    50  b     
 4195.0774    50       g
 1786.5310    50     e  
Combining group_split and map_if:
library(tidyverse)
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
value2 <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
group_split(mydata, city) %>% 
    map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows()
#> # A tibble: 8 × 2
#>    city value
#>   <dbl> <dbl>
#> 1     1     5
#> 2     2     7
#> 3     3     2
#> 4     4     5
#> 5     5     4
#> 6     6     2
#> 7     7     2
#> 8     8     3
Created on 2021-11-30 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With