I am trying to create a column that increases in value if a condition is met, and resets to 1 if the condition is not met.
df <- data.frame(
occ = c("11","11","12","13","13","13","13","13","11"))
df$occPrev <- lag(df$occ)
df
occ occPrev
1 11 <NA>
2 11 11
3 12 11
4 13 12
5 13 13
6 13 13
7 13 13
8 13 13
9 11 13
I want to create a tenure variable ten such that, starting from 1, it increases by 1 if occ == occPrev and resets to 1 if occ != occPrev. The output I want to get is
occ occPrev ten
1 11 <NA> 1
2 11 11 2
3 12 11 1
4 13 12 1
5 13 13 2
6 13 13 3
7 13 13 4
8 13 13 5
9 11 13 1
I want to achieve this result without relying on for loops ideally. My attempts so far have not been fruitful since I cannot find a way to reference the previous row dynamically.
Edit: While @Edward abd @whuber's solutions are neat and correct, I also wanted to share my solution that utilizes the base function rle. This is also generalized for different workers, i.e. ids.
df <- df %>%
group_by(id) %>%
arrange(year) %>%
mutate(ten = sequence(rle(occ)$lengths)) %>%
ungroup()
First, the NA indicates you are using plyr::lag rather than the base function stats::lag.
Second, your example does not conform with your description. Compare it to the output of this solution (which uses only base R functions):
df$ten <- with(df, Reduce(\(a,b) ifelse(b, a+1, 1), occ %==% occPrev, acc = TRUE))
Here's df afterwards:
occ occPrev ten 1 11 <NA> 1 2 11 11 2 3 12 11 1 4 13 12 1 5 13 13 2 6 13 13 3 7 13 13 4 8 13 13 5 9 11 13 1
Scanning row by row, we see df$ten incremented every time occ==occPrev or reset to 1 otherwise, as requested.
Oh, yes, I nearly forgot: to handle that NA as suggested by the example in the question, I first modified the == operator:
`%==%` <- function(a, b) is.na(a) | is.na(b) | a == b
This helped with the readability and helped to test the solution.
If you're interested in "referencing the previous row dynamically," look into constructs like head(,-1) and tail(,-1), which usefully strip the last or first component of the first argument, respectively. Comparing these two expressions with a vectorized relational operator like == compares each component to the previous one and does so very efficiently. You will have to pad the result at the beginning to specify the initial value (where no previous component exists). Although this can be faster than Reduce, the Reduce solution is short and readable.
If these 9 rows all belong to the same person, then you can use cumsum, grouping on the occ column, and dplyr::lag, which has an additional default argument that pads NA with a specified value (compared to stats::lag).
library(dplyr)
The following code may give what you want, if these 9 rows are the same person.
mutate(df, ten = 1 + cumsum(occ==lag(occ, default="0")), .by=occ)
___
occ ten
1 11 1
2 11 2
3 12 1
4 13 1
5 13 2
6 13 3
7 13 4
8 13 5
9 11 3 # Not reset to 1 because 11 already occurred in rows 1 and 2.
If the 9 rows belong to two different people, then we can create an id using dplyr::consecutive_id and group on this.
mutate(df, id=consecutive_id(occ)) |>
mutate(ten = 1 + cumsum(occ==lag(occ, default="0")), .by=id)
occ id ten
1 11 1 1
2 11 1 2
3 12 2 1 # Reset - new id
4 13 3 1 # Reset - new id
5 13 3 2
6 13 3 3
7 13 3 4
8 13 3 5
9 11 4 1 # Reset - new id.
If the data frame already had an id, then we add occ to the grouping level (together with id):
df <- data.frame(
id=c(rep(1,8), 2),
occ = c("11","11","12","13","13","13","13","13","11"))
mutate(df, ten = 1 + cumsum(occ==lag(occ, default="0")), .by=c(id, occ))
id occ ten
1 1 11 1
2 1 11 2
3 1 12 1 # Reset - same id but new occ
4 1 13 1 # Reset - same id but new occ
5 1 13 2
6 1 13 3
7 1 13 4
8 1 13 5
9 2 11 1 # Reset - new id
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With