Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create distance matrix for strings

Tags:

r

I would like to speed up the following code. Could some one please be so kind and make some suggestions?

library(dplyr)
library(fuzzywuzzyR)

set.seed(42)
rm(list = ls())
options(scipen = 999)

init = FuzzMatcher$new()

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

distance_function <- function(string_1, string_2) {
    init$Token_set_ratio(string1 = string_1, string2 = string_2)
}

combinations <- combn(nrow(data), 2)
distances <- matrix(, nrow = 1, ncol = ncol(combinations))

distance_matrix <- matrix(NA, nrow = nrow(data), ncol = nrow(data), dimnames = list(data$string, data$string))

for (i in 1:ncol(combinations)) {

    distance <- distance_function(data[combinations[1, i], 1], data[combinations[2, i], 1])

    #print(data[combinations[1, i], 1])
    #print(data[combinations[2, i], 1])
    #print(distance)

    distance_matrix[combinations[1, i], combinations[2, i]] <- distance
    distance_matrix[combinations[2, i], combinations[1, i]] <- distance

}

distance_matrix

By the way I tried to use proxy::dist and various other approaches without success. I also do not think that the string distance function works as expected but that's another story.

Ultimately, I want to use the distance matrix to perform some clustering to group similar stings (independent of word order).

like image 519
cs0815 Avatar asked May 08 '26 13:05

cs0815


1 Answers

If you want a matrix, you can use the stringdist package. From what I could tell, the package you were using calculated Levenshtein Distance so I included method = "lv" (you could try other methods too). Let me know if you have issues, or if a format other than a matrix would be preferred. Also, you may consider using a method other than Levenshtein Distance (i.e., a change of 2 in a four letter word appears the same as a change of two in a 20 word sentence). Good luck!!!

library(dplyr)
library(stringdist)

set.seed(42)
rm(list = ls())
options(scipen = 999)

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

dist_mat <- stringdist::stringdistmatrix(data$string, data$string, method = "lv")

rownames(dist_mat) <- data$string
colnames(dist_mat) <- data$string

dist_mat
                        hello world hello vorld hello world 1 hello world hello world hello world
hello world                       0           1             2           0                      12
hello vorld                       1           0             3           1                      13
hello world 1                     2           3             0           2                      11
hello world                       0           1             2           0                      12
hello world hello world          12          13            11          12                       0
like image 53
Andrew Avatar answered May 10 '26 03:05

Andrew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!