I'm a relative new user to R and coding in general and I have searched and not been able to solve this. I have the following data:
groupid start.date end.date Status
1 2014-01-01 2017-01-01 A
1 2018-01-01 2020-01-01 D
2 2014-01-01 2017-01-01 B
How do I generate a dataframe where every observation is a year instead of combination of groupid and a time period.
The output I'm looking for is:
groupid year status
1 2014 A
1 2015 A
1 2016 A
1 2017 A
1 2018 D
1 2019 D
1 2020 D
2 2014 B
2 2015 B
2 2016 B
2 2017 B
I've tried a number of approaches, but I think my best attempt was:
df <- df %>%
group_by(rn=row_number()) %>%
mutate(d = list(seq(start.date, end.date, by='1 year')))
%>%
unnest()
But I got
Error: Each column must either be a list of vectors or a list of data frames
[d]
Searching the error hasn't helped me get closer to figuring out what is wrong. Start.date and end.date are stored as dates. In case it matters, they were generated from two column vectors with just four digit year numbers, so I applied the following code to change them to a useable date format:
df$start.date <- as.Date(ISOdate(df$from, 1, 1))
df$end.date <- as.Date(ISOdate(df$to, 1, 1))
The answer by camille is based on the implicit assumption that the combinations of groupid
and Status
are unique. However, this can not be guaranteed. Furthermore, the OP herself has chosen to group by row number to be on the safe side.
Grouping is required because seq()
and the single colon operator :
do not accept vectors as input.
dplyr
/tidyr
solutionThis approach groups by row number and extracts the year before creating the sequences.
df1
is the tibble as given by the OP (see Data
section below).
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
group_by(rn = row_number()) %>%
mutate(year = list(year(start.date):year(end.date))) %>%
unnest() %>%
ungroup() %>%
select(groupid, year, Status)
# A tibble: 11 x 3 groupid year Status <int> <int> <chr> 1 1 2014 A 2 1 2015 A 3 1 2016 A 4 1 2017 A 5 1 2018 D 6 1 2019 D 7 1 2020 D 8 2 2014 B 9 2 2015 B 10 2 2016 B 11 2 2017 B
data.table
approachdata.table
allows to achieve the same result with a more concise code:
library(data.table)
setDT(df1)[, .(groupid, year = year(start.date):year(end.date), Status),
by = .(rn = 1:nrow(df1))][
, rn := NULL][]
groupid year Status 1: 1 2014 A 2: 1 2015 A 3: 1 2016 A 4: 1 2017 A 5: 1 2018 D 6: 1 2019 D 7: 1 2020 D 8: 2 2014 B 9: 2 2015 B 10: 2 2016 B 11: 2 2017 B
The OP has disclosed that start.date
and end.date
were generated from two column vectors with just four digit year numbers.
It is not necessary to convert these year numbers to date beforehand. They can be used directly to create the sequences of years:
library(dplyr)
library(tidyr)
df2 %>%
group_by(rn = row_number()) %>%
mutate(year = list(from:to)) %>%
unnest() %>%
ungroup() %>%
select(groupid, year, Status)
# A tibble: 11 x 3 groupid year Status <int> <int> <chr> 1 1 2014 A 2 1 2015 A 3 1 2016 A 4 1 2017 A 5 1 2018 D 6 1 2019 D 7 1 2020 D 8 2 2014 B 9 2 2015 B 10 2 2016 B 11 2 2017 B
Or, in data.table
syntax:
library(data.table)
setDT(df2)[, .(groupid, year = from:to, Status), by = .(rn = 1:nrow(df2))][
, rn := NULL][]
According to help(":")
, character arguments are coerced to numeric so no explicit coersion is required.
df1 <- readr::read_table(
"groupid start.date end.date Status
1 2014-01-01 2017-01-01 A
1 2018-01-01 2020-01-01 D
2 2014-01-01 2017-01-01 B"
)
df2 <- readr::read_table(
"groupid from to Status
1 2014 2017 A
1 2018 2020 D
2 2014 2017 B"
)
You're almost there! Group the data by ID and status, since the combination of those two variables is where the start & end dates comes from.
library(tidyverse)
df <- "groupid start.date end.date Status
1 2014-01-01 2017-01-01 A
1 2018-01-01 2020-01-01 D
2 2014-01-01 2017-01-01 B" %>% read_table2()
df %>%
group_by(groupid, Status) %>%
mutate(dates = list(seq(from = start.date, to = end.date, by = "1 year"))) %>%
unnest()
#> # A tibble: 11 x 5
#> # Groups: groupid, Status [3]
#> groupid start.date end.date Status dates
#> <int> <date> <date> <chr> <date>
#> 1 1 2014-01-01 2017-01-01 A 2014-01-01
#> 2 1 2014-01-01 2017-01-01 A 2015-01-01
#> 3 1 2014-01-01 2017-01-01 A 2016-01-01
#> 4 1 2014-01-01 2017-01-01 A 2017-01-01
#> 5 1 2018-01-01 2020-01-01 D 2018-01-01
#> 6 1 2018-01-01 2020-01-01 D 2019-01-01
#> 7 1 2018-01-01 2020-01-01 D 2020-01-01
#> 8 2 2014-01-01 2017-01-01 B 2014-01-01
#> 9 2 2014-01-01 2017-01-01 B 2015-01-01
#> 10 2 2014-01-01 2017-01-01 B 2016-01-01
#> 11 2 2014-01-01 2017-01-01 B 2017-01-01
To get the format you're looking for, you can then extract the year from the date sequence and drop the extra columns:
df %>%
group_by(groupid, Status) %>%
mutate(dates = list(seq(from = start.date, to = end.date, by = "1 year"))) %>%
unnest() %>%
mutate(year = lubridate::year(dates)) %>%
select(groupid, year, Status)
#> # A tibble: 11 x 3
#> # Groups: groupid, Status [3]
#> groupid year Status
#> <int> <dbl> <chr>
#> 1 1 2014 A
#> 2 1 2015 A
#> 3 1 2016 A
#> 4 1 2017 A
#> 5 1 2018 D
#> 6 1 2019 D
#> 7 1 2020 D
#> 8 2 2014 B
#> 9 2 2015 B
#> 10 2 2016 B
#> 11 2 2017 B
Created on 2018-06-22 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With