I would like to ask if someone is familiar with the imputation of missing values in the distance matrix. For the ordinary data (tables with continuous and nominal variables) there are a lot of imputation techniques, e.g. hot deck and cold deck, prediction models and so on. However there are almost no information about how to deal with distances matrices.
Example:
distance <- dist(rnorm(20))
distance[c(10, 20, 30, 40, 50, 60)] <- NA
How to impute missing values in this case?
There are two procedures that allow completing of a partial distance matrix: one is based on the ultrametric inequality, and the other is based on the additive procedure using the four-point condition (algorithms described in detail in Makarenkov & Lapointe, 2004). Both methods are implemented in the ape package in R.
The choice of a method depends on the distance properties.
# Generate a distance matrix with five missing values
set.seed(111)
dd <- dist(1:10)
dd[sample(x = 1:length(dd), size = 5)] <- NA
dd
1 2 3 4 5 6 7 8 9
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 NA NA 1
8 7 6 NA 4 3 2 1
9 8 NA 6 5 4 3 2 NA
10 9 8 7 6 5 4 3 2 1
# Replace missing data
library(ape)
as.dist( additive(dd) ) # additive procedure
as.dist( ultrametric(dd) ) # ultrametric procedure
Makarenkov V, Lapointe FJ (2004). A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics, 20(13), 2113-2121, DOI: 10.1093/bioinformatics/bth211.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With