Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for break patterns in row-wise sequences of values and NA's

I'm working with a dataset where each row represents where an individual used services. This implicitly also tracks whether someone uses services, because if they did not, the column value for the month is NA. I want to identify cases where there are breaks (periods of absence followed by return) in an individual's service based on transitions between consecutive month columns.

In other words, I am specifically interested in identifying when someone goes from using services (there are values in a month's column) to not using services (the value of 1+ subsequent month is NA) and then back to using services (values in 1+ subsequent month columns, preceded by NAs). My hope is to have a binary TRUE/FALSE column ("Breaks_in_Service"). When someone begins services does not matter (meaning, NAs only matter after the first month-column with a value in it).

Here's a simplified version of my dataset:

# Sample Data
simp_2021 <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  jan21_ORG_NAME = c("Org A", NA, NA, "Org B", "Org B"),
  feb21_ORG_NAME = c(NA, "Org A", "Org B", NA, "Org B"),
  mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA),
  apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")
)

# Initialize Breaks_in_Service column as FALSE
simp_2021$Breaks_in_Service <- FALSE

# View
print(simp_2021)

Expected output: In the sample data, Breaks_in_Service should be TRUE for IDs 1, 4, and 5, and FALSE for IDs 2 and 3.

I've tried building a for loop, but it's gotten messy and isn't working:

# Loop over each row to check for breaks in service
for (i in 1:nrow(simp_2021)) {
  row_values <- simp_2021[i, 2:ncol(simp_2021)]  # Extract service columns for the current row
  
  # Initialize flags to track service usage
  in_service <- FALSE
  found_break <- FALSE
  
  # Check transitions within the row
  for (j in 1:(length(row_values) - 1)) {
    current_value <- row_values[[j]]
    next_value <- row_values[[j + 1]]
    
    if (is.na(current_value) && !is.na(next_value)) {
      # Transition from not using service to using service
      in_service <- TRUE
    } else if (!is.na(current_value) && is.na(next_value)) {
      # Transition from using service to not using service
      if (in_service) {
        found_break <- TRUE
        break  # Found a break, no need to check further
      }
    }
  }
  
  # Set Breaks_in_Service based on found breaks
  if (found_break) {
    simp_2021$Breaks_in_Service[i] <- TRUE
  }
}

# View the updated dataframe with the new 'Breaks_in_Service' column
print(simp_2021)
like image 542
Violet Brooks Avatar asked Oct 16 '25 00:10

Violet Brooks


2 Answers

For some vector x

x <- c("Org A", NA, NA, "Org B")

One can calculate a 'run-length encoding' of non-NA values

> rle(!is.na(x))
Run Length Encoding
  lengths: int [1:3] 1 2 1
  values : logi [1:3] TRUE FALSE TRUE

and if there is a break in service, there will be more than 1 TRUE value. So here's a function that tests for break in service

break_in_service <- function(x)
    sum(rle(!is.na(x))$values) > 1

You'd like to do this for each ID. One way is to use apply() on each row, excluding the first column

> apply(simp_2021[,-1], 1, break_in_service)
[1]  TRUE FALSE FALSE  TRUE  TRUE

I like using the 'tidy' approach with dplyr / tidyr

library(tidyr); library(dplyr)
simp_2021 |>
    ## convert to 'long' format, where each row is an ID, name, value tuple
    pivot_longer(ends_with("ORG_NAME")) |>
    ## identify the groups in your data
    group_by(ID) |>
    ## summarize each group
    summarize(has_break_in_service = break_in_service(value))

The result is

> simp_2021 |>
+     ## convert to 'long' format
+     pivot_longer(ends_with("ORG_NAME")) |>
+     ## identify the groups in your data
+     group_by(ID) |>
+     ## summarize each group
+     summarize(has_break_in_service = break_in_service(value))
# A tibble: 5 × 2
     ID has_break_in_service
  <dbl> <lgl>
1     1 TRUE
2     2 FALSE
3     3 FALSE
4     4 TRUE
5     5 TRUE
like image 142
Martin Morgan Avatar answered Oct 17 '25 15:10

Martin Morgan


You could, first, equal the month columns to binary !NA using +!is.na() and paste to binary strings, next sub away all starting 0s, finally grepl for the pattern 01.

> (tmp <- apply(+!is.na(simp_2021[-1]), 1, paste, collapse=''))
[1] "1001" "0100" "0111" "1010" "1101"
> (tmp <- sub(tmp, pat='^0+', rep=''))
[1] "1001" "100"  "111"  "1010" "1101"
> (tmp <- grepl(tmp, pat='01'))
[1]  TRUE FALSE FALSE  TRUE  TRUE

Altogether in a nice pipe:

> transform(simp_2021, 
+           Breaks_in_Service=apply(+!is.na(simp_2021[-1]), 1, paste, collapse='') |> 
+             sub(pat='^0+', rep='') |> 
+             grepl(pat='01'))
  ID jan21_ORG_NAME feb21_ORG_NAME mar21_ORG_NAME apr21_ORG_NAME Breaks_in_Service
1  1          Org A           <NA>           <NA>          Org B              TRUE
2  2           <NA>          Org A           <NA>           <NA>             FALSE
3  3           <NA>          Org B          Org B          Org C             FALSE
4  4          Org B           <NA>          Org D           <NA>              TRUE
5  5          Org B          Org B           <NA>          Org E              TRUE

Note: this approach also uses row-wise apply, but on a "matrix" for which it is designed for and is therefore efficient.

> is.matrix(+!is.na(simp_2021[-1]))
[1] TRUE

Data:

> dput(simp_2021)
structure(list(ID = c(1, 2, 3, 4, 5), jan21_ORG_NAME = c("Org A", 
NA, NA, "Org B", "Org B"), feb21_ORG_NAME = c(NA, "Org A", "Org B", 
NA, "Org B"), mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA
), apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")), class = "data.frame", row.names = c(NA, 
-5L))
like image 27
jay.sf Avatar answered Oct 17 '25 15:10

jay.sf