Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a single data frame into list of data frames (parsing column names into prefixes and suffixes)

Tags:

list

dataframe

r

I am hoping to determine an efficient way to convert a single data frame into a list of data frames. Below is my reproducible MWE:

set.seed(1)
ABAge = runif(100)
ABPoints = rnorm(100)
ACAge = runif(100)
ACPoints = rnorm(100)
BCAge = runif(100)
BCPoints = rnorm(100)

A_B <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ABAge, Points = ABPoints)
A_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ACAge, Points = ACPoints)
B_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = BCAge, Points = BCPoints)
A_B$ID <- as.character(A_B$ID)
A_C$ID <- as.character(A_C$ID)
B_C$ID <- as.character(B_C$ID)

listFormat <- list("A_B" = A_B, "A_C" = A_C, "B_C" = B_C)

dfFormat <- data.frame(ID = as.character(paste0("ID", 1:100)), A_B.Age = ABAge, A_B.Points = ABPoints, A_C.Age = ACAge, A_C.Points = ACPoints, B_C.Age = BCAge, B_C.Points = BCPoints)
dfFormat$ID = as.character(dfFormat$ID)

This results in a data frame format (dfFormat) that looks like this:

'data.frame':   100 obs. of  7 variables:
 $ ID        : chr  "ID1" "ID2" "ID3" "ID4" ...
 $ A_B.Age   : num  0.266 0.372 0.573 0.908 0.202 ...
 $ A_B.Points: num  0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C.Age   : num  0.6737 0.0949 0.4926 0.4616 0.3752 ...
 $ A_C.Points: num  0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C.Age   : num  0.814 0.929 0.147 0.75 0.976 ...
 $ B_C.Points: num  1.474 0.677 0.38 -0.193 1.578 ...

and a list of data frames listFormat that looks like this:

List of 3
 $ A_B:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
  ..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
  ..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
  ..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...

I am hoping to come up with an automated way to convert the dfFormat to listFormat. As can be seen in the above objects there are two main conditions:

  1. The column ID is always the first column in dfFormat and it is always the first column in each sublist of listFormat.

  2. The number of sublists is equal to the number of unique column names in dfFormat before the underscore ('_'). In this case, that is three prefixes (ex. "A_B", "A_C", and "B_C"). These prefixes are also the names of the three sublists.

  3. Within each sublist, it contains the number of columns that had the associated prefix ("A_B"). For each sublist, this was two ("Age" and "Points"). These suffixes are the names of the columns.

I asked the reverse question here (i.e. how to go from listFormat to dfFormat) and got some helpful answers that I am learning from. I need to have code to reverse both directions and it seems the reverse direction may need new types of code. I put my attempt below to show how I am stuck!

conUnd <- which(sapply(colnames(dfFormat), function(x) grepl("_", x)))
listName <- sapply(colnames(dfFormat[,conUnd]), function(x) strsplit(x, "[.]")[[1]][1])
uListName <- unique(sapply(colnames(dfFormat[,conUnd]), function(x) strsplit(x, "[.]")[[1]][1]))
listCol <- sapply(colnames(dfFormat[,conUnd]), function(x) strsplit(x, "[.]")[[1]][2])

listFormat = list()
for (i in 1:length(uListName)){
   [Gets messy here trying to define column names based on string variables]
}

Any advice would be greatly appreciated. I know my code is not efficient.

like image 446
lavenderGem Avatar asked Nov 21 '25 20:11

lavenderGem


1 Answers

You can use split.default in base R -

output <- lapply(split.default(dfFormat[-1], sub("\\..*", "",names(dfFormat[-1]))), 
          function(x) cbind(dfFormat[1], setNames(x, sub(".*\\.", "", names(x)))))
str(output)

#List of 3
# $ A_B:'data.frame':   100 obs. of  3 variables:
#  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
#  ..$ Age   : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
#  ..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
# $ A_C:'data.frame':   100 obs. of  3 variables:
#  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
#  ..$ Age   : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
#  ..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
# $ B_C:'data.frame':   100 obs. of  3 variables:
#  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
#  ..$ Age   : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
#  ..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...
like image 63
Ronak Shah Avatar answered Nov 24 '25 12:11

Ronak Shah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!