I'm a relative new user to R and coding in general and I have searched and not been able to solve this. I have the following data:
groupid start.date end.date Status
1 2014-01-01 2017-01-01 A
1 2018-01-01 2020-01-01 D
2 2014-01-01 2017-01-01 B
How do I generate a dataframe where every observation is a year instead of combination of groupid and a time period.
The output I'm looking for is:
groupid year status
1 2014 A
1 2015 A
1 2016 A
1 2017 A
1 2018 D
1 2019 D
1 2020 D
2 2014 B
2 2015 B
2 2016 B
2 2017 B
I've tried a number of approaches, but I think my best attempt was:
df <- df %>%
group_by(rn=row_number()) %>%
mutate(d = list(seq(start.date, end.date, by='1 year')))
%>%
unnest()
But I got
Error: Each column must either be a list of vectors or a list of data frames
[d]
Searching the error hasn't helped me get closer to figuring out what is wrong. Start.date and end.date are stored as dates. In case it matters, they were generated from two column vectors with just four digit year numbers, so I applied the following code to change them to a useable date format:
df$start.date <- as.Date(ISOdate(df$from, 1, 1))
df$end.date <- as.Date(ISOdate(df$to, 1, 1))
The answer by camille is based on the implicit assumption that the combinations of groupid and Status are unique. However, this can not be guaranteed. Furthermore, the OP herself has chosen to group by row number to be on the safe side.
Grouping is required because seq() and the single colon operator : do not accept vectors as input.
dplyr/tidyr solutionThis approach groups by row number and extracts the year before creating the sequences.
df1 is the tibble as given by the OP (see Data section below).
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
group_by(rn = row_number()) %>%
mutate(year = list(year(start.date):year(end.date))) %>%
unnest() %>%
ungroup() %>%
select(groupid, year, Status)
# A tibble: 11 x 3 groupid year Status <int> <int> <chr> 1 1 2014 A 2 1 2015 A 3 1 2016 A 4 1 2017 A 5 1 2018 D 6 1 2019 D 7 1 2020 D 8 2 2014 B 9 2 2015 B 10 2 2016 B 11 2 2017 B
data.table approachdata.table allows to achieve the same result with a more concise code:
library(data.table)
setDT(df1)[, .(groupid, year = year(start.date):year(end.date), Status),
by = .(rn = 1:nrow(df1))][
, rn := NULL][]
groupid year Status 1: 1 2014 A 2: 1 2015 A 3: 1 2016 A 4: 1 2017 A 5: 1 2018 D 6: 1 2019 D 7: 1 2020 D 8: 2 2014 B 9: 2 2015 B 10: 2 2016 B 11: 2 2017 B
The OP has disclosed that start.date and end.date were generated from two column vectors with just four digit year numbers.
It is not necessary to convert these year numbers to date beforehand. They can be used directly to create the sequences of years:
library(dplyr)
library(tidyr)
df2 %>%
group_by(rn = row_number()) %>%
mutate(year = list(from:to)) %>%
unnest() %>%
ungroup() %>%
select(groupid, year, Status)
# A tibble: 11 x 3 groupid year Status <int> <int> <chr> 1 1 2014 A 2 1 2015 A 3 1 2016 A 4 1 2017 A 5 1 2018 D 6 1 2019 D 7 1 2020 D 8 2 2014 B 9 2 2015 B 10 2 2016 B 11 2 2017 B
Or, in data.table syntax:
library(data.table)
setDT(df2)[, .(groupid, year = from:to, Status), by = .(rn = 1:nrow(df2))][
, rn := NULL][]
According to help(":"), character arguments are coerced to numeric so no explicit coersion is required.
df1 <- readr::read_table(
"groupid start.date end.date Status
1 2014-01-01 2017-01-01 A
1 2018-01-01 2020-01-01 D
2 2014-01-01 2017-01-01 B"
)
df2 <- readr::read_table(
"groupid from to Status
1 2014 2017 A
1 2018 2020 D
2 2014 2017 B"
)
You're almost there! Group the data by ID and status, since the combination of those two variables is where the start & end dates comes from.
library(tidyverse)
df <- "groupid start.date end.date Status
1 2014-01-01 2017-01-01 A
1 2018-01-01 2020-01-01 D
2 2014-01-01 2017-01-01 B" %>% read_table2()
df %>%
group_by(groupid, Status) %>%
mutate(dates = list(seq(from = start.date, to = end.date, by = "1 year"))) %>%
unnest()
#> # A tibble: 11 x 5
#> # Groups: groupid, Status [3]
#> groupid start.date end.date Status dates
#> <int> <date> <date> <chr> <date>
#> 1 1 2014-01-01 2017-01-01 A 2014-01-01
#> 2 1 2014-01-01 2017-01-01 A 2015-01-01
#> 3 1 2014-01-01 2017-01-01 A 2016-01-01
#> 4 1 2014-01-01 2017-01-01 A 2017-01-01
#> 5 1 2018-01-01 2020-01-01 D 2018-01-01
#> 6 1 2018-01-01 2020-01-01 D 2019-01-01
#> 7 1 2018-01-01 2020-01-01 D 2020-01-01
#> 8 2 2014-01-01 2017-01-01 B 2014-01-01
#> 9 2 2014-01-01 2017-01-01 B 2015-01-01
#> 10 2 2014-01-01 2017-01-01 B 2016-01-01
#> 11 2 2014-01-01 2017-01-01 B 2017-01-01
To get the format you're looking for, you can then extract the year from the date sequence and drop the extra columns:
df %>%
group_by(groupid, Status) %>%
mutate(dates = list(seq(from = start.date, to = end.date, by = "1 year"))) %>%
unnest() %>%
mutate(year = lubridate::year(dates)) %>%
select(groupid, year, Status)
#> # A tibble: 11 x 3
#> # Groups: groupid, Status [3]
#> groupid year Status
#> <int> <dbl> <chr>
#> 1 1 2014 A
#> 2 1 2015 A
#> 3 1 2016 A
#> 4 1 2017 A
#> 5 1 2018 D
#> 6 1 2019 D
#> 7 1 2020 D
#> 8 2 2014 B
#> 9 2 2015 B
#> 10 2 2016 B
#> 11 2 2017 B
Created on 2018-06-22 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With