I have two set of samples that are time independent. I would like to merge them and calculate the missing values for the times where I do not have values of both. Simplified example:
A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
           Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)
   time Avalue Bvalue
1    10      1     NA
2    15     NA    100
3    20      2     NA
4    30      3    200
5    40      2     NA
6    45     NA    300
7    50      1     NA
8    60      2    400
9    70      3     NA
10   80      2     NA
11   90      1     NA
12  100      2     NA
By assuming linear change between each sample, it is possible to calculate the missing NA values. Intuitively it is easy to see that the A value at time 15 and 45 should be 1.5. But a proper calculation for B for instance at time 20 would be
100 + (20 - 15) * (200 - 100) / (30 - 15)
which equals 133.33333. The first parenthesis being the time between estimate time and the last sample available. The second parenthesis being the difference between the nearest samples. The third parenthesis being the time between the nearest samples.
How can I use R to calculate the NA values?
To fill in the missing values, we can highlight the range starting before and after the missing values, then click Home > Editing > Fill > Series. What is this? If we select the Type as Growth and click the box next to Trend, Excel automatically identifies the growth trend in the data and fills in the missing values.
You can interpolate missing values ( NaN ) in pandas. DataFrame and Series with interpolate() . This article describes the following contents. Use dropna() and fillna() to remove missing values NaN or to fill them with a specific value.
The formula is y = y1 + ((x - x1) / (x2 - x1)) * (y2 - y1), where x is the known value, y is the unknown value, x1 and y1 are the coordinates that are below the known x value, and x2 and y2 are the coordinates that are above the x value.
Using the zoo package:
library(zoo)
Cz <- zoo(C)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)
The proper way to do this statistically and still get valid confidence intervals is to use Multiple Imputation.  See Rubin's classic book, and there's an excellent R package for this (mi).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With