Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Imputation of missing data in distance matrix

Tags:

r

missing-data

I would like to ask if someone is familiar with the imputation of missing values in the distance matrix. For the ordinary data (tables with continuous and nominal variables) there are a lot of imputation techniques, e.g. hot deck and cold deck, prediction models and so on. However there are almost no information about how to deal with distances matrices.

Example:

distance <- dist(rnorm(20))
distance[c(10, 20, 30, 40, 50, 60)] <- NA

How to impute missing values in this case?

like image 348
Felix Bristow Avatar asked Aug 30 '25 18:08

Felix Bristow


1 Answers

There are two procedures that allow completing of a partial distance matrix: one is based on the ultrametric inequality, and the other is based on the additive procedure using the four-point condition (algorithms described in detail in Makarenkov & Lapointe, 2004). Both methods are implemented in the ape package in R.

The choice of a method depends on the distance properties.

# Generate a distance matrix with five missing values
set.seed(111)
dd <- dist(1:10)
dd[sample(x = 1:length(dd), size = 5)] <- NA
dd

    1  2  3  4  5  6  7  8  9
2   1                        
3   2  1                     
4   3  2  1                  
5   4  3  2  1               
6   5  4  3  2  1            
7   6  5  4 NA NA  1         
8   7  6 NA  4  3  2  1      
9   8 NA  6  5  4  3  2 NA   
10  9  8  7  6  5  4  3  2  1

# Replace missing data
library(ape)
as.dist( additive(dd) )         # additive procedure
as.dist( ultrametric(dd) )      # ultrametric procedure

Makarenkov V, Lapointe FJ (2004). A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics, 20(13), 2113-2121, DOI: 10.1093/bioinformatics/bth211.

like image 79
Vladimir Mikryukov Avatar answered Sep 02 '25 12:09

Vladimir Mikryukov