I have data from multiple participants and I am trying to use information from one set of files to aggregate data in another set of files.
I have one data file (videodata.csv) with four columns: Event.Type (video name), Name (start or end), Row (row number), and Source (the participant id code). Set up like this:
videodata <- data.frame(
Event.Type = c("Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
"Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
"Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3"),
Name = c("start", "end", "start", "end","start", "end",
"start", "end", "start", "end","start", "end",
"start", "end", "start", "end","start", "end"),
Row = c(1, 10, 11, 20, 21, 30,
8, 16, 19, 28, 1, 7,
2, 8, 21, 26, 9, 20),
Source = c("012", "012", "012", "012", "012", "012",
"013", "013", "013", "013", "013", "013",
"014", "014", "014", "014", "014", "014")
)
I have like 80 participants, so each of them have another data file in csv format (id_data.csv), with two columns: time and score. Each row corresponds to one second, so the time column is simply numbered from 1 to somewhere around 1,800 or whatever it goes to. score is a binary code of 0 or 1.
For participants 012, 013, and 014 data would look like this:
`012_data` <- data.frame(
time = c(1:30),
score = c(0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)
`013_data` <- data.frame(
time = c(1:30),
score = c(0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)
`014_data` <- data.frame(
time = c(1:30),
score = c(0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)
I want to use the information from videodata.csv to slice up id_data.csv and save them into individual files corresponding to Event.Type.
Specifically, I would look in the videodata.csv file to find the id number. For a given id number, I would start with the first Event.Type (eg Movie1) and find the start row and end row. I would use this information to go to the data file for that id number and select those rows of score and save them to a file called movie1.csv. I would repeat for each participant until movie1.csv contained data from all participants.
For example, for participant 012, I would like to slice up the id_data.csv file into three chunks corresponding to Movie1, Movie2, and Movie3. To do this I would:
Row which indicates the Start and End) by rounding up to the nearest row numberscore column to a file called Movie1 with the column header 012 from the Source column013 and append their score column next to the previous row.The resulting file for Movie1 might look like this:

It includes values of score from rows 1-10 (for 012) 8-16 (for 013) and 2-8 (for 014).
I would then have another file corresponding to Movie2 (by selecting rows 11-20, 19-28, and 21-26), and so on for each participant id. Each participant might have a slightly different number of rows.
This is just beyond my skill and I would appreciate any help as to how I can accomplish this. Thanks!
As I assume that the numer of rows in each of the _data.csv files might differ I would suppose to combine the data in the following way:
data_list <- list()
data_list[["012_data"]] <- `012_data`
data_list[["013_data"]] <- `013_data`
data_list[["014_data"]] <- `014_data`
I attract your attention that here is a list not a dataframe because of possible variations of data structure and row count. The same effect can be achieved while reading files in a lapply function applied to list.files.
Then we create the "long" data.frame with data from such files.
rm(dl)
library(dplyr)
library(tidyr)
for (filename in names(data_list)) {
if (exists("dl")) {
dl <- dl %>%
union_all(
data_list[[filename]] %>%
mutate(fileName = strsplit(filename, "_")[[1]][1]) # Here we use base R function
)
} else {
dl <- data_list[[filename]] %>%
mutate(fileName = strsplit(filename, "_")[[1]][1])
}
}
All these steps can be done at file upload.
dl is long and thus handy for further manipulations:
> dl
time score fileName
1 1 0 012
2 2 0 012
3 3 0 012
4 4 0 012
5 5 1 012
...
All the rest is done in a single pipe:
videodata %>%
pivot_wider(names_from = Name, values_from = Row) %>%
left_join(dl, by = c("Source" = "fileName")) %>%
filter(time >= start& time <= end) %>%
select(-c(start, end, time)) %>%
group_by(Event.Type, Source) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = Source, values_from = score)
You can see the intermediate results executing the pipe stepwise. The pipe returns the following output:
# A tibble: 32 × 5
# Groups: Event.Type [3]
Event.Type id `012` `013` `014`
<chr> <int> <dbl> <dbl> <dbl>
1 Movie1 1 0 0 1
2 Movie1 2 0 1 0
3 Movie1 3 0 0 0
4 Movie1 4 0 0 0
5 Movie1 5 1 1 1
6 Movie1 6 0 0 0
7 Movie1 7 0 0 1
8 Movie1 8 0 0 NA
9 Movie1 9 1 1 NA
10 Movie1 10 0 NA NA
11 Movie2 1 0 0 0
12 Movie2 2 0 0 0
13 Movie2 3 0 0 0
14 Movie2 4 0 0 0
15 Movie2 5 0 0 0
16 Movie2 6 0 0 0
17 Movie2 7 0 0 NA
18 Movie2 8 0 0 NA
19 Movie2 9 0 0 NA
20 Movie2 10 0 0 NA
21 Movie3 1 0 0 0
22 Movie3 2 0 0 0
23 Movie3 3 0 0 0
24 Movie3 4 0 0 0
25 Movie3 5 0 0 0
26 Movie3 6 0 0 0
27 Movie3 7 0 0 0
28 Movie3 8 0 NA 0
29 Movie3 9 0 NA 0
30 Movie3 10 0 NA 0
31 Movie3 11 NA NA 0
32 Movie3 12 NA NA 0
The other advantage of this code is that it does not contain any references to movie names and thus scalable without modifications as in other answer. It is robust to differences in number of items selected (see number of elements in Movie1 == 10 and Movie3 == 12). It will also work propwerly if the number of elements for 012 is less then for rightwards columns 013, 014 etc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With