I'm working with a dataset where each row represents where an individual used services. This implicitly also tracks whether someone uses services, because if they did not, the column value for the month is NA
. I want to identify cases where there are breaks (periods of absence followed by return) in an individual's service based on transitions between consecutive month columns.
In other words, I am specifically interested in identifying when someone goes from using services (there are values in a month's column) to not using services (the value of 1+ subsequent month is NA
) and then back to using services (values in 1+ subsequent month columns, preceded by NA
s). My hope is to have a binary TRUE/FALSE
column ("Breaks_in_Service"). When someone begins services does not matter (meaning, NA
s only matter after the first month-column with a value in it).
Here's a simplified version of my dataset:
# Sample Data
simp_2021 <- data.frame(
ID = c(1, 2, 3, 4, 5),
jan21_ORG_NAME = c("Org A", NA, NA, "Org B", "Org B"),
feb21_ORG_NAME = c(NA, "Org A", "Org B", NA, "Org B"),
mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA),
apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")
)
# Initialize Breaks_in_Service column as FALSE
simp_2021$Breaks_in_Service <- FALSE
# View
print(simp_2021)
Expected output: In the sample data, Breaks_in_Service should be TRUE
for IDs 1, 4, and 5, and FALSE
for IDs 2 and 3.
I've tried building a for
loop, but it's gotten messy and isn't working:
# Loop over each row to check for breaks in service
for (i in 1:nrow(simp_2021)) {
row_values <- simp_2021[i, 2:ncol(simp_2021)] # Extract service columns for the current row
# Initialize flags to track service usage
in_service <- FALSE
found_break <- FALSE
# Check transitions within the row
for (j in 1:(length(row_values) - 1)) {
current_value <- row_values[[j]]
next_value <- row_values[[j + 1]]
if (is.na(current_value) && !is.na(next_value)) {
# Transition from not using service to using service
in_service <- TRUE
} else if (!is.na(current_value) && is.na(next_value)) {
# Transition from using service to not using service
if (in_service) {
found_break <- TRUE
break # Found a break, no need to check further
}
}
}
# Set Breaks_in_Service based on found breaks
if (found_break) {
simp_2021$Breaks_in_Service[i] <- TRUE
}
}
# View the updated dataframe with the new 'Breaks_in_Service' column
print(simp_2021)
For some vector x
x <- c("Org A", NA, NA, "Org B")
One can calculate a 'run-length encoding' of non-NA values
> rle(!is.na(x))
Run Length Encoding
lengths: int [1:3] 1 2 1
values : logi [1:3] TRUE FALSE TRUE
and if there is a break in service, there will be more than 1 TRUE value. So here's a function that tests for break in service
break_in_service <- function(x)
sum(rle(!is.na(x))$values) > 1
You'd like to do this for each ID. One way is to use apply()
on each row, excluding the first column
> apply(simp_2021[,-1], 1, break_in_service)
[1] TRUE FALSE FALSE TRUE TRUE
I like using the 'tidy' approach with dplyr / tidyr
library(tidyr); library(dplyr)
simp_2021 |>
## convert to 'long' format, where each row is an ID, name, value tuple
pivot_longer(ends_with("ORG_NAME")) |>
## identify the groups in your data
group_by(ID) |>
## summarize each group
summarize(has_break_in_service = break_in_service(value))
The result is
> simp_2021 |>
+ ## convert to 'long' format
+ pivot_longer(ends_with("ORG_NAME")) |>
+ ## identify the groups in your data
+ group_by(ID) |>
+ ## summarize each group
+ summarize(has_break_in_service = break_in_service(value))
# A tibble: 5 × 2
ID has_break_in_service
<dbl> <lgl>
1 1 TRUE
2 2 FALSE
3 3 FALSE
4 4 TRUE
5 5 TRUE
You could, first, equal the month columns to binary !NA
using +!is.na()
and paste
to binary strings, next sub
away all starting 0
s, finally grepl
for the pattern 01
.
> (tmp <- apply(+!is.na(simp_2021[-1]), 1, paste, collapse=''))
[1] "1001" "0100" "0111" "1010" "1101"
> (tmp <- sub(tmp, pat='^0+', rep=''))
[1] "1001" "100" "111" "1010" "1101"
> (tmp <- grepl(tmp, pat='01'))
[1] TRUE FALSE FALSE TRUE TRUE
Altogether in a nice pipe:
> transform(simp_2021,
+ Breaks_in_Service=apply(+!is.na(simp_2021[-1]), 1, paste, collapse='') |>
+ sub(pat='^0+', rep='') |>
+ grepl(pat='01'))
ID jan21_ORG_NAME feb21_ORG_NAME mar21_ORG_NAME apr21_ORG_NAME Breaks_in_Service
1 1 Org A <NA> <NA> Org B TRUE
2 2 <NA> Org A <NA> <NA> FALSE
3 3 <NA> Org B Org B Org C FALSE
4 4 Org B <NA> Org D <NA> TRUE
5 5 Org B Org B <NA> Org E TRUE
Note: this approach also uses row-wise apply
, but on a "matrix"
for which it is designed for and is therefore efficient.
> is.matrix(+!is.na(simp_2021[-1]))
[1] TRUE
Data:
> dput(simp_2021)
structure(list(ID = c(1, 2, 3, 4, 5), jan21_ORG_NAME = c("Org A",
NA, NA, "Org B", "Org B"), feb21_ORG_NAME = c(NA, "Org A", "Org B",
NA, "Org B"), mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA
), apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")), class = "data.frame", row.names = c(NA,
-5L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With