Firstly, I apologize for the vagueness of the title. I have a dataset which contains dichotomous values coded 0 and 1 for a certain variable X. v001 is the subject identifier and the values from v1pc10le8 to v9pc10le8 are the values for X at each of the nine visits. In addition, firstpc10 and lastpc10 signify the first (baseline) and last measurements for X respectively.
v001 firstpc10 lastpc10 v1pc10le8 v2pc10le8 v3pc10le8 v4pc10le8 v5pc10le8 v6pc10le8 v7pc10le8 v8pc10le8 v9pc10le8
1473 28084 0 0 0 <NA> 0 <NA> <NA> 0 0 <NA> <NA>
1474 28089 0 0 <NA> <NA> <NA> 0 <NA> 0 <NA> <NA> <NA>
1475 28102 0 1 <NA> <NA> 0 0 0 0 1 <NA> <NA>
1476 28103 0 1 <NA> <NA> <NA> 0 0 0 0 1 1
1477 28119 0 0 <NA> <NA> <NA> 0 <NA> 0 0 0 <NA>
1478 28184 0 1 <NA> <NA> 0 <NA> <NA> 0 <NA> <NA> 1
1479 28202 1 1 <NA> <NA> 1 <NA> 0 0 0 1 1
1480 28211 0 0 0 <NA> 0 0 <NA> <NA> <NA> <NA> <NA>
1481 28212 0 1 0 <NA> <NA> 1 <NA> <NA> <NA> <NA> <NA>
1482 28213 0 0 <NA> <NA> 0 <NA> <NA> 0 <NA> <NA> <NA>
1483 28214 0 0 <NA> <NA> <NA> 0 0 0 <NA> 1 0
1484 28215 0 0 <NA> <NA> <NA> 0 <NA> 0 0 0 0
1485 28232 0 1 <NA> <NA> 0 <NA> 0 1 <NA> <NA> <NA>
1486 28244 1 1 1 <NA> <NA> <NA> 0 0 0 0 1
1487 28258 0 1 <NA> <NA> <NA> 0 <NA> 0 1 <NA> 1
1488 28281 0 1 <NA> <NA> <NA> 0 0 0 1 <NA> <NA>
1489 28303 0 0 0 <NA> <NA> <NA> <NA> 0 0 0 <NA>
1490 28337 0 1 <NA> <NA> 0 <NA> <NA> 0 <NA> 1 <NA>
1491 28355 1 1 <NA> <NA> 1 <NA> 0 <NA> 0 1 <NA>
1492 29983 0 0 <NA> <NA> <NA> 0 0 <NA> 0 0 0
I want to ignore all the NA and compute a new variable called "change" which has the following values:
1 - if subjects were 0 at baseline and remained 0 throughout
2 - if subjects were 1 at baseline and remained 1 throughout
3 - if subjects were 1 at baseline and changed to 0 (and remained 0 throughout)
4 - if subjects were 0 at baseline and changed to 1 (and remained 1 throughout)
5 - if subjects fluctuated between values of 0 and 1 without a trend (e.g subject #28214) - these are subjects who don't fit in the above 4 catagories
This is the output I expect to see:
v001 change
1473 28084 1
1474 28089 1
1475 28102 4
1476 28103 4
1477 28119 1
1478 28184 4
1479 28202 5
1480 28211 1
1481 28212 4
1482 28213 1
1483 28214 5
1484 28215 1
1485 28232 4
1486 28244 5
1487 28258 4
1488 28281 4
1489 28303 1
1490 28337 4
1491 28355 5
1492 29983 1
I tried to do this with SPSS and R but I am having huge difficulties and I will greatly appreciate any help. (I have included the dput output from R below).
Thank you!
structure(list(v001 = c(28084, 28089, 28102, 28103, 28119, 28184,
28202, 28211, 28212, 28213, 28214, 28215, 28232, 28244, 28258,
28281, 28303, 28337, 28355, 29983), firstpc10 = c(0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0), lastpc10 = c(0,
0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0), v1pc10le8 = c(0,
NA, NA, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 1, NA, NA, 0, NA,
NA, NA), v2pc10le8 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), v3pc10le8 = c(0, NA, 0, NA, NA, 0, 1, 0,
NA, 0, NA, NA, 0, NA, NA, NA, NA, 0, 1, NA), v4pc10le8 = c(NA,
0, 0, 0, 0, NA, NA, 0, 1, NA, 0, 0, NA, NA, 0, 0, NA, NA, NA,
0), v5pc10le8 = c(NA, NA, 0, 0, NA, NA, 0, NA, NA, NA, 0, NA,
0, 0, NA, 0, NA, NA, 0, 0), v6pc10le8 = c(0, 0, 0, 0, 0, 0, 0,
NA, NA, 0, 0, 0, 1, 0, 0, 0, 0, 0, NA, NA), v7pc10le8 = c(0,
NA, 1, 0, 0, NA, 0, NA, NA, NA, NA, 0, NA, 0, 1, 1, 0, NA, 0,
0), v8pc10le8 = c(NA, NA, NA, 1, 0, NA, 1, NA, NA, NA, 1, 0,
NA, 0, NA, NA, 0, 1, 1, 0), v9pc10le8 = c(NA, NA, NA, 1, NA,
1, 1, NA, NA, NA, 0, 0, NA, 1, 1, NA, NA, NA, NA, 0)), .Names = c("v001",
"firstpc10", "lastpc10", "v1pc10le8", "v2pc10le8", "v3pc10le8",
"v4pc10le8", "v5pc10le8", "v6pc10le8", "v7pc10le8", "v8pc10le8",
"v9pc10le8"), row.names = 1473:1492, class = "data.frame")
@qdread's solution is great in terms of compactness and neatness. Adding to that great approach, I would like to post a solution that demonstrates how can one approach such problems in a functional way.
.
The first step is identifying the columns that should be used as the base, and the visits, which is basically straight forward:
library(magrittr)
# Define the columns to be used
col.visits = colnames(df)[4:ncol(df)] # Visits are represented from column 4 on
col.baseline = "firstpc10"
col.final = "lastpc10"
.
A second step is thinking about how would you define "remained 0/1 throughout":
# Define unit functions
single_change_to_1 = function(numeric_array){
positive_change = (diff(numeric_array) == 1) # True if 0 -> 1 change occured
return(sum(positive_change, na.rm = T) == 1) # Return True if only 1 change occured
}
single_change_to_0 = function(numeric_array){
negative_change = (diff(numeric_array) == -1) # True if 1 -> 0 change occured
return(sum(negative_change, na.rm = T) == 1) # Return True if only 1 change occured
}
.
A third step is putting together your conditions in a function:
calculate_change = function(patientInfo){
# Extract data
patient.base = patientInfo[[col.baseline]]
patient.visits = patientInfo[col.visits] %>% as.numeric %>% .[!is.na(.)] # Turn to vector, and Discard NAs
# Apply if-else
if(patient.base == 0 && all(patient.visits == 0)) return(1)
if(patient.base == 1 && all(patient.visits == 1)) return(2)
if(patient.base == 1 && single_change_to_0(patient.visits) && !single_change_to_1(patient.visits)) return(3)
if(patient.base == 0 && single_change_to_1(patient.visits) && !single_change_to_0(patient.visits)) return(4)
# If the entry didnt match any of the previous conditions, return 5
return(5)
}
.
And finally, apply the change function to each row:
df[["change"]] = apply(df, 1, calculate_change)
df[["change"]]
# [1] 1 1 4 4 1 4 5 1 4 1 5 1 4 5 4 4 1 4 5 1
I defined a function to output 1-5 depending on the starting condition and the number of times the status changed from 0 to 1. I used the rowwise() function from the package dplyr to apply that function to each row of the data frame. I called the input data frame dat. The function I defined uses diff() to count the number of times the status "flips" from 0 to 1 and tests whether it does so exactly once, and depending on the baseline status, returns 1,2,3,4,or 5.
classify_change <- function(x) {
baseline <- x$firstpc10
visits <- na.omit(as.numeric(x[grepl('le8', names(x))]))
# Count number of times the status flips from 0 to 1 between visits
n_flips <- sum(diff(visits) != 0)
answer <- 5
if (baseline == 0 & n_flips == 0) answer <- 1
if (baseline == 1 & n_flips == 0) answer <- 2
if (baseline == 1 & n_flips == 1) answer <- 3
if (baseline == 0 & n_flips == 1) answer <- 4
return(data.frame(change = answer))
}
library(dplyr)
dat %>%
rowwise %>%
do(classify_change(.))
I notice your expected output contains zeroes but the description of the categories only has 1-5 as possible outcomes. This function returns 1 for those rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With