Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create dataframe of rows of sequence of years from rows with start/end dates

I'm a relative new user to R and coding in general and I have searched and not been able to solve this. I have the following data:

groupid  start.date   end.date    Status
1        2014-01-01   2017-01-01  A
1        2018-01-01   2020-01-01  D
2        2014-01-01   2017-01-01  B

How do I generate a dataframe where every observation is a year instead of combination of groupid and a time period.

The output I'm looking for is:

groupid  year   status
1        2014  A
1        2015  A
1        2016  A
1        2017  A
1        2018  D
1        2019  D
1        2020  D
2        2014  B
2        2015  B
2        2016  B
2        2017  B

I've tried a number of approaches, but I think my best attempt was:

df <- df %>% 
group_by(rn=row_number()) %>% 
  mutate(d = list(seq(start.date, end.date, by='1 year'))) 
%>%
  unnest()

But I got

Error: Each column must either be a list of vectors or a list of data frames 
[d]

Searching the error hasn't helped me get closer to figuring out what is wrong. Start.date and end.date are stored as dates. In case it matters, they were generated from two column vectors with just four digit year numbers, so I applied the following code to change them to a useable date format:

df$start.date <- as.Date(ISOdate(df$from, 1, 1)) 
df$end.date <- as.Date(ISOdate(df$to, 1, 1))  
like image 737
Ella Wind Avatar asked Sep 06 '25 14:09

Ella Wind


2 Answers

The answer by camille is based on the implicit assumption that the combinations of groupid and Status are unique. However, this can not be guaranteed. Furthermore, the OP herself has chosen to group by row number to be on the safe side.

Grouping is required because seq() and the single colon operator : do not accept vectors as input.

dplyr/tidyr solution

This approach groups by row number and extracts the year before creating the sequences. df1 is the tibble as given by the OP (see Data section below).

library(dplyr)
library(tidyr)
library(lubridate)
df1 %>% 
  group_by(rn = row_number()) %>% 
  mutate(year = list(year(start.date):year(end.date))) %>% 
  unnest() %>% 
  ungroup() %>% 
  select(groupid, year, Status)
# A tibble: 11 x 3
   groupid  year Status
     <int> <int> <chr> 
 1       1  2014 A     
 2       1  2015 A     
 3       1  2016 A     
 4       1  2017 A     
 5       1  2018 D     
 6       1  2019 D     
 7       1  2020 D     
 8       2  2014 B     
 9       2  2015 B     
10       2  2016 B     
11       2  2017 B

data.table approach

data.table allows to achieve the same result with a more concise code:

library(data.table)
setDT(df1)[, .(groupid, year = year(start.date):year(end.date), Status), 
  by = .(rn = 1:nrow(df1))][
    , rn := NULL][] 
    groupid year Status
 1:       1 2014      A
 2:       1 2015      A
 3:       1 2016      A
 4:       1 2017      A
 5:       1 2018      D
 6:       1 2019      D
 7:       1 2020      D
 8:       2 2014      B
 9:       2 2015      B
10:       2 2016      B
11:       2 2017      B

OP's original problem

The OP has disclosed that start.date and end.date were generated from two column vectors with just four digit year numbers.

It is not necessary to convert these year numbers to date beforehand. They can be used directly to create the sequences of years:

library(dplyr)
library(tidyr)
df2 %>% 
  group_by(rn = row_number()) %>% 
  mutate(year = list(from:to)) %>% 
  unnest() %>% 
  ungroup() %>% 
  select(groupid, year, Status)
# A tibble: 11 x 3
   groupid  year Status
     <int> <int> <chr> 
 1       1  2014 A     
 2       1  2015 A     
 3       1  2016 A     
 4       1  2017 A     
 5       1  2018 D     
 6       1  2019 D     
 7       1  2020 D     
 8       2  2014 B     
 9       2  2015 B     
10       2  2016 B     
11       2  2017 B

Or, in data.table syntax:

library(data.table)
setDT(df2)[, .(groupid, year = from:to, Status), by = .(rn = 1:nrow(df2))][
    , rn := NULL][] 

According to help(":"), character arguments are coerced to numeric so no explicit coersion is required.

Data

df1 <- readr::read_table(
  "groupid  start.date   end.date    Status
1        2014-01-01   2017-01-01  A
1        2018-01-01   2020-01-01  D
2        2014-01-01   2017-01-01  B"
)

df2 <- readr::read_table(
  "groupid  from   to    Status
1        2014   2017  A
1        2018   2020  D
2        2014   2017  B"
)
like image 159
Uwe Avatar answered Sep 08 '25 11:09

Uwe


You're almost there! Group the data by ID and status, since the combination of those two variables is where the start & end dates comes from.

library(tidyverse)

df <- "groupid  start.date   end.date    Status
1        2014-01-01   2017-01-01  A
1        2018-01-01   2020-01-01  D
2        2014-01-01   2017-01-01  B" %>% read_table2()

df %>%
  group_by(groupid, Status) %>%
  mutate(dates = list(seq(from = start.date, to = end.date, by = "1 year"))) %>%
  unnest()
#> # A tibble: 11 x 5
#> # Groups:   groupid, Status [3]
#>    groupid start.date end.date   Status dates     
#>      <int> <date>     <date>     <chr>  <date>    
#>  1       1 2014-01-01 2017-01-01 A      2014-01-01
#>  2       1 2014-01-01 2017-01-01 A      2015-01-01
#>  3       1 2014-01-01 2017-01-01 A      2016-01-01
#>  4       1 2014-01-01 2017-01-01 A      2017-01-01
#>  5       1 2018-01-01 2020-01-01 D      2018-01-01
#>  6       1 2018-01-01 2020-01-01 D      2019-01-01
#>  7       1 2018-01-01 2020-01-01 D      2020-01-01
#>  8       2 2014-01-01 2017-01-01 B      2014-01-01
#>  9       2 2014-01-01 2017-01-01 B      2015-01-01
#> 10       2 2014-01-01 2017-01-01 B      2016-01-01
#> 11       2 2014-01-01 2017-01-01 B      2017-01-01

To get the format you're looking for, you can then extract the year from the date sequence and drop the extra columns:

df %>%
  group_by(groupid, Status) %>%
  mutate(dates = list(seq(from = start.date, to = end.date, by = "1 year"))) %>%
  unnest() %>%
  mutate(year = lubridate::year(dates)) %>%
  select(groupid, year, Status)
#> # A tibble: 11 x 3
#> # Groups:   groupid, Status [3]
#>    groupid  year Status
#>      <int> <dbl> <chr> 
#>  1       1  2014 A     
#>  2       1  2015 A     
#>  3       1  2016 A     
#>  4       1  2017 A     
#>  5       1  2018 D     
#>  6       1  2019 D     
#>  7       1  2020 D     
#>  8       2  2014 B     
#>  9       2  2015 B     
#> 10       2  2016 B     
#> 11       2  2017 B

Created on 2018-06-22 by the reprex package (v0.2.0).

like image 21
camille Avatar answered Sep 08 '25 11:09

camille