The %in% operator in R checks if something is in something else, kind of obviously. But I'm curious about the performance. In Python, searching for an item a set or dict keys is O(1) because the set is hash table, I think. But searching for an item in a list in Python could be O(n) with an n-length list, because it will search element-by-element. So how is %in% working behind the scenes for different datatypes in R? It seems to take 5 times longer to search for something in a factor dtype in R as opposed to a vector, but it seems like %in% searches a vector linearly. At first I thought a factor datatype might be like a set in Python, since they both reduce something to it's unique values, but not at all: https://www.tutorialspoint.com/r/r_data_types.htm. Here is some example code so you can see what I mean with the runtimes:
library(microbenchmark)
s <- seq(5000)
microbenchmark(1 %in% s, times = 100000)
# searching for a term further in the list takes longer
microbenchmark(4999 %in% s, times = 100000)
s <- as.factor(s)
# searching for something in a factor takes way longer than a vector
# I think because everything is converted to a character dtype
microbenchmark(4999 %in% s, times = 100000)
My main question is this: is there a way to make %in% O(1) in R? A related question: is there an equivalent (in R) of the set() datatype in Python?
As we discussed in the comments, there is an inherent set-like mechanism in R, though it is admittedly a little hackish and perhaps not quite what was intended. (Some of the limitations of this hack are documented in the hashmap package.)
Environments in R are internally hashed. This can be used for storage of arbitary objects with random access (both read and write). To check some benchmarks, I'll generate a few types of vectors to substantiate your initial concern and show the improvement using environments can make.
We'll first generate some similar data, ordered in various fashions to highlight an issue you brought up:
library(microbenchmark)
set.seed(2)
s1 <- seq(5000)
s2 <- rev(s1) # to highlight the bias you highlighted, since the vector is sorted
s3 <- sample(s1) # to shake things up a little
s4 <- as.character(s3) # comparison with character-based named in 'l' and 'e'
l <- list()
e <- new.env(parent = emptyenv())
for (i in s4) {
assign(i, TRUE, envir = e)
l[[i]] <- TRUE
}
head(names(l)) # unordered
# [1] "925" "3512" "2866" "840" "4716" "4713"
The list does have ordinality within its objects, which supports the assumption that its objects are not hashed:
which(names(l) == "1")
# [1] 2291
Environments do not have this:
e[[1]]
# Error in e[[1]] : wrong arguments for subsetting an environment
Some quick membership testing: I used a logical for the value, though that is completely arbitary. Anything other than NULL would suffice for our needs here. We'll use a simple !is.null(e[[...]]) to test for specific membership:
!is.null(e[["1"]])
# [1] TRUE
!is.null(e[["10000"]])
# [1] FALSE
!is.null(l[["1"]])
# [1] TRUE
!is.null(l[["10000"]])
# [1] FALSE
microbenchmark(
vec1 = 1 %in% s1,
vec2 = 1 %in% s2,
vec3 = 1 %in% s3,
vec4 = "1" %in% s4,
lst = is.null(l[["1"]]),
env = is.null(e[["1"]]),
times = 1000
)
# Warning in microbenchmark(vec1 = 1 %in% s1, vec2 = 1 %in% s2, vec3 = 1 %in% :
# Could not measure a positive execution time for 6 evaluations.
# Unit: nanoseconds
# expr min lq mean median uq max neval
# vec1 5835 6929 12493.25 7294 9482 3214588 1000
# vec2 9117 9847 16660.73 10212 12764 4081050 1000
# vec3 7294 8388 19983.63 8752 10576 3274759 1000
# vec4 11670 12400 15423.03 12764 14223 74394 1000
# lst 20787 21517 24561.72 21881 22975 143317 1000
# env 0 1 461.25 365 366 18235 1000
Not surprisingly, the list does not perform well, though it seems to do better than the vectors (in the max case, relatively meaningless). Also not surprisingly, based on our claim that the environment uses an internal has, it performs quite well. Is it O(1)?
microbenchmark(
samp5 = sapply(as.character(sample(5000, size = 5)), function(a) is.null(e[[a]])),
samp50 = sapply(as.character(sample(5000, size = 50)), function(a) is.null(e[[a]])),
samp500 = sapply(as.character(sample(5000, size = 500)), function(a) is.null(e[[a]])),
samp5000 = sapply(as.character(sample(5000, size = 5000)), function(a) is.null(e[[a]]))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# samp5 25.893 32.4565 49.58154 40.4795 58.3485 169.573 100
# samp50 108.309 119.4310 156.45244 135.8410 167.3850 681.938 100
# samp500 935.750 1023.2715 1265.29732 1073.9610 1172.6055 6841.985 100
# samp5000 9410.008 10337.5520 11137.82968 10650.0765 11280.0485 15455.548 100
The first one, samp5, appears to take a bit longer. This isn't too surprising, since there is overhead associated with sapply, sampling, and other things. However, the remaining rows seem to scale somewhat well with the numbers of samples done. This suggests it is indeed O(1) for some basic set operations.
NB: I had to use the whole sapply(...) trick because, unlike vectors and lists, R's environments do not allow subsetting with a vector.
e[[c("1")]]
# [1] TRUE
e[[c("1","10")]]
# Error in e[[c("1", "10")]] :
# wrong arguments for subsetting an environment
This is one of the claims made by (and fixed by) hashmap.
Extra credit: in order to facilitate using an environment as a set, you could use simple adders and deleters:
newset <- function() new.env(parent = emptyenv())
setadd <- function(set, n) set[[n]] <- TRUE
setdel <- function(set, n) set[[n]] <- NULL
setcontains <- function(set, n) !is.null(set[[n]])
setmembers <- function(set) names(set)
e <- newset()
setcontains(e, "a")
# [1] FALSE
setadd(e, "a")
setcontains(e, "a")
# [1] TRUE
setmembers(e)
# [1] "a"
setdel(e, "a")
setcontains(e, "a")
# [1] FALSE
(There is a similar but much more extensive blog post on this by Jeffrey Horner here: http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With