Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define calculations in a data frame depending on values from another data frame?

I have to calculate a coefficient based on a dataset stored in a data frame (A) of size 4936 obs. x 1025 var.

In a first row [1,] time in seconds is presented, each row is a sample collected from a different place. A sample of the data frame A:

#        V1   V2   V3   V4
# [1,] 26.4 26.5 26.6 26.7
# [2,]  -15   -5    2    3
# [3,]    6   -7    5    8
# [4,]    9    4    4   -2

In another data frame (B) I stored the time from which I should start my calculations for each row in A. An example of the data frame B:

#      time
# [1,] 26.4
# [2,] 26.6
# [3,] 26.5

Let's simplify that the coefficient is a sum of the data collected in one place (data frame A), depending on the time of their collection (data frame B). For the example above, the calculation should work like this:

sum1=-15+(-5)+2+3
sum2=5+8
sum3=4+4+(-2)

The results of the calculations I would like to store in a new data frame, which would look like this:

#       Sum
# [1,]  -15
# [2,]   13
# [3,]    6

How to link the calculations between the two data frames depending on a value stored in the second data frame?

like image 675
barbrka Avatar asked Nov 29 '25 20:11

barbrka


2 Answers

Solution using sapply to iterate and select columns depending on time of collection:

# Time from original table
foo <- df1[1, ]
# Time from table B
time <- c(26.4, 26.6, 26.5)

# Remove time row from original table
df1 <- df1[-1, ]

# Iterate over and select columns with foo >= time
sapply(1:length(time), function(x)
    sum(df1[x, which(foo >= time[x])])
)

# [1] -15  13   6
like image 133
pogibas Avatar answered Dec 01 '25 10:12

pogibas


I came across this already answered question and felt urged to propose an alternative solution.

  • Reading the title immediately made me think of join or merge.
  • The OP claims to use data frames but the printed output seems to originate from matrices.
  • The data is stored transposed: The time series are stored row-wise horizontally where the first row contains no observations but the time in seconds. This is considered untidy.

None of the other answers bothered to question these oddities although they made the proposed solutions more complex.

Reshaping the data

As a wild guess, the data seem to be collected in an Excel sheet. However, for an efficient processing we need the data to be stored column-wise and preferably in long format:

library(data.table)
long <- as.data.table(t(A))[
  , setnames(.SD, "V1", "time")][
    , melt(.SD, id.vars = "time", variable.name = "site_id")][
      , site_id := as.integer(site_id)][]

long
    time site_id value
 1: 26.4       1   -15
 2: 26.5       1    -5
 3: 26.6       1     2
 4: 26.7       1     3
 5: 26.4       2     6
 6: 26.5       2    -7
 7: 26.6       2     5
 8: 26.7       2     8
 9: 26.4       3     9
10: 26.5       3     4
11: 26.6       3     4
12: 26.7       3    -2

Aggregating in a non-equi join

Now, the OP has requested to aggregate the observations for each site but only observations above a specific time need to be included. A data frame B with the starting times for each site is supplied.

The observations in long can be combined with the starting times in B as follows:

B <- data.table(
  site_id = 1:3,
  time = c(26.4, 26.6, 26.5))

B
   site_id time
1:       1 26.4
2:       2 26.6
3:       3 26.5
# aggregating in a non-equi join grouped by the join conditions
long[B, on = .(site_id, time >= time), by = .EACHI, sum(value)] 
   site_id time  V1
1:       1 26.4 -15
2:       2 26.6  13
3:       3 26.5   6

Edit: Limit the number of observations in the aggregation

The OP has asked in a comment and in another question how to limit the number of observations to be aggregated after the starting time. This can be achieved by a slight modification:

max_values <- 2L
long[B, on = .(site_id, time >= time), by = .EACHI, sum(value[1:max_values])]  
   site_id time  V1
1:       1 26.4 -20
2:       2 26.6  13
3:       3 26.5   8

Note that max_values is set to 2L here for illustration.

like image 21
Uwe Avatar answered Dec 01 '25 12:12

Uwe