I would like to speed up the following code. Could some one please be so kind and make some suggestions?
library(dplyr)
library(fuzzywuzzyR)
set.seed(42)
rm(list = ls())
options(scipen = 999)
init = FuzzMatcher$new()
data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)
distance_function <- function(string_1, string_2) {
init$Token_set_ratio(string1 = string_1, string2 = string_2)
}
combinations <- combn(nrow(data), 2)
distances <- matrix(, nrow = 1, ncol = ncol(combinations))
distance_matrix <- matrix(NA, nrow = nrow(data), ncol = nrow(data), dimnames = list(data$string, data$string))
for (i in 1:ncol(combinations)) {
distance <- distance_function(data[combinations[1, i], 1], data[combinations[2, i], 1])
#print(data[combinations[1, i], 1])
#print(data[combinations[2, i], 1])
#print(distance)
distance_matrix[combinations[1, i], combinations[2, i]] <- distance
distance_matrix[combinations[2, i], combinations[1, i]] <- distance
}
distance_matrix
By the way I tried to use proxy::dist and various other approaches without success. I also do not think that the string distance function works as expected but that's another story.
Ultimately, I want to use the distance matrix to perform some clustering to group similar stings (independent of word order).
If you want a matrix, you can use the stringdist package. From what I could tell, the package you were using calculated Levenshtein Distance so I included method = "lv" (you could try other methods too). Let me know if you have issues, or if a format other than a matrix would be preferred. Also, you may consider using a method other than Levenshtein Distance (i.e., a change of 2 in a four letter word appears the same as a change of two in a 20 word sentence). Good luck!!!
library(dplyr)
library(stringdist)
set.seed(42)
rm(list = ls())
options(scipen = 999)
data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)
dist_mat <- stringdist::stringdistmatrix(data$string, data$string, method = "lv")
rownames(dist_mat) <- data$string
colnames(dist_mat) <- data$string
dist_mat
hello world hello vorld hello world 1 hello world hello world hello world
hello world 0 1 2 0 12
hello vorld 1 0 3 1 13
hello world 1 2 3 0 2 11
hello world 0 1 2 0 12
hello world hello world 12 13 11 12 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With