Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Fill in Implicit Missing Values and Groups to the Entire Time Span of the Data

Objective: Fill in implicit missing values for each ID and each group over the entire time span of the data, including groups that are entirely missing from the dataset.

Background: Each ID has several groups, and each group should have values for each day. However, some values are missing certain dates (such as 1b 2019-01-01), and some groups are entirely missing from the dataset (such as 2b).

# A tibble: 8 x 4
     ID group date       value
  <dbl> <chr> <chr>      <dbl>
1     1 a     2019-01-01     1
2     1 a     2019-01-02     3
3     1 a     2019-01-03     4
4     1 b     2019-01-02     4
5     1 b     2019-01-03     5
6     2 a     2019-01-01     8
7     2 a     2019-01-02     9
8     2 a     2019-01-03     1

Code to generate the dataframe

df <- tibble(ID = c(1, 1, 1, 1, 1, 2, 2, 2),
             group = c('a', 'a', 'a', 'b', 'b', 'a', 'a', 'a'),
             date = c('2019-01-01', '2019-01-02','2019-01-03', '2019-01-02', '2019-01-03', '2019-01-01', '2019-01-02', '2019-01-03'),
             value = c(1, 3, 4, 4, 5, 8, 9, 1))

Attempt 1:

library(tsibble)
df %>% 
  # tsibble format
  as_tsibble(key = c(ID, group), index = date) %>%
  # group by
  group_by(ID, group) %>%
  # fill gaps
  fill_gaps(.full = TRUE)

Attempt 2:

library(tidyverse)
complete(df, expand(df, nesting(ID, group), date = full_seq(date,1)))

Both yield the same result: note that group 2b is missing

# A tsibble: 9 x 4 [1D]
# Key:       ID, group [3]
# Groups:    ID, group [3]
     ID group date       value
  <dbl> <chr> <date>     <dbl>
1     1 a     2019-01-01     1
2     1 a     2019-01-02     3
3     1 a     2019-01-03     4
4     1 b     2019-01-01    NA
5     1 b     2019-01-02     4
6     1 b     2019-01-03     5
7     2 a     2019-01-01     8
8     2 a     2019-01-02     9
9     2 a     2019-01-03     1

I expect the output to be

# A tibble: 12 x 4
      ID group date       value
   <dbl> <chr> <chr>      <dbl>
 1     1 a     2019-01-01     1
 2     1 a     2019-01-02     3
 3     1 a     2019-01-03     4
 4     1 b     2019-01-01     NA
 5     1 b     2019-01-02     4
 6     1 b     2019-01-03     5
 7     2 a     2019-01-01     8
 8     2 a     2019-01-02     9
 9     2 a     2019-01-03     1
10     2 b     2019-01-01     NA
11     2 b     2019-01-02     NA
12     2 b     2019-01-03     NA

where group b, which is implicitly missing from the dataset, appears in the result.

Note that my dataset has eight million rows and it keeps growing every day. The code will be executed every day so I am really looking forward to a fast and convenient way of performing the task. But any idea or answer is welcome!

like image 672
June Avatar asked Feb 01 '26 03:02

June


2 Answers

Actually your second attempt is very close to what you expect, but need to replace nesting() with crossing(). nesting() only looks for the combinations appeared in the data, but crossing() looks for all possible combinations.

library(tidyr)
df <- tibble(ID = c(1, 1, 1, 1, 1, 2, 2, 2),
             group = c('a', 'a', 'a', 'b', 'b', 'a', 'a', 'a'),
             date = as.Date(c('2019-01-01', '2019-01-02','2019-01-03', '2019-01-02', '2019-01-03', '2019-01-01', '2019-01-02', '2019-01-03')),
             value = c(1, 3, 4, 4, 5, 8, 9, 1))
complete(df, expand(df, crossing(ID, group), date = full_seq(date, 1)))
#> # A tibble: 12 x 4
#>       ID group date       value
#>    <dbl> <chr> <date>     <dbl>
#>  1     1 a     2019-01-01     1
#>  2     1 a     2019-01-02     3
#>  3     1 a     2019-01-03     4
#>  4     1 b     2019-01-01    NA
#>  5     1 b     2019-01-02     4
#>  6     1 b     2019-01-03     5
#>  7     2 a     2019-01-01     8
#>  8     2 a     2019-01-02     9
#>  9     2 a     2019-01-03     1
#> 10     2 b     2019-01-01    NA
#> 11     2 b     2019-01-02    NA
#> 12     2 b     2019-01-03    NA

Created on 2019-10-13 by the reprex package (v0.3.0)

like image 56
Earo Wang Avatar answered Feb 02 '26 15:02

Earo Wang


You could create a dataframe containing the cartesian product of unique values of ID, group and date, then merge that with the original dataframe to obtain the desired result.

zz <- merge(merge(unique(df$ID), unique(df$group), by = NULL, all = TRUE), 
            unique(df$date), by = NULL, all = TRUE)
merge(df, zz, by.x = c('ID','group', 'date'), by.y = c('x','y.x', 'y.y'), all = TRUE)

It's expensive, of course, if you have a large number of components and dates. You could, in that case, first merge the ID/group combination against your original dataframe to obtain just the missing combinations, then merge that against the set of dates, again picking up only the missing combinations, if you catch my drift.

like image 25
user2474226 Avatar answered Feb 02 '26 17:02

user2474226



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!