Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time complexity of %in% in R; is there a way to make it O(1) like with sets in Python?

The %in% operator in R checks if something is in something else, kind of obviously. But I'm curious about the performance. In Python, searching for an item a set or dict keys is O(1) because the set is hash table, I think. But searching for an item in a list in Python could be O(n) with an n-length list, because it will search element-by-element. So how is %in% working behind the scenes for different datatypes in R? It seems to take 5 times longer to search for something in a factor dtype in R as opposed to a vector, but it seems like %in% searches a vector linearly. At first I thought a factor datatype might be like a set in Python, since they both reduce something to it's unique values, but not at all: https://www.tutorialspoint.com/r/r_data_types.htm. Here is some example code so you can see what I mean with the runtimes:

library(microbenchmark)
s <- seq(5000)
microbenchmark(1 %in% s, times = 100000)
# searching for a term further in the list takes longer
microbenchmark(4999 %in% s, times = 100000)
s <- as.factor(s)
# searching for something in a factor takes way longer than a vector
# I think because everything is converted to a character dtype
microbenchmark(4999 %in% s, times = 100000)

My main question is this: is there a way to make %in% O(1) in R? A related question: is there an equivalent (in R) of the set() datatype in Python?

like image 762
wordsforthewise Avatar asked Oct 25 '25 08:10

wordsforthewise


1 Answers

As we discussed in the comments, there is an inherent set-like mechanism in R, though it is admittedly a little hackish and perhaps not quite what was intended. (Some of the limitations of this hack are documented in the hashmap package.)

Environments in R are internally hashed. This can be used for storage of arbitary objects with random access (both read and write). To check some benchmarks, I'll generate a few types of vectors to substantiate your initial concern and show the improvement using environments can make.

We'll first generate some similar data, ordered in various fashions to highlight an issue you brought up:

library(microbenchmark)
set.seed(2)
s1 <- seq(5000)
s2 <- rev(s1) # to highlight the bias you highlighted, since the vector is sorted
s3 <- sample(s1) # to shake things up a little
s4 <- as.character(s3) # comparison with character-based named in 'l' and 'e'

l <- list()
e <- new.env(parent = emptyenv())
for (i in s4) {
  assign(i, TRUE, envir = e)
  l[[i]] <- TRUE
}
head(names(l)) # unordered
# [1] "925"  "3512" "2866" "840"  "4716" "4713"

The list does have ordinality within its objects, which supports the assumption that its objects are not hashed:

which(names(l) == "1")
# [1] 2291

Environments do not have this:

e[[1]]
# Error in e[[1]] : wrong arguments for subsetting an environment

Some quick membership testing: I used a logical for the value, though that is completely arbitary. Anything other than NULL would suffice for our needs here. We'll use a simple !is.null(e[[...]]) to test for specific membership:

!is.null(e[["1"]])
# [1] TRUE
!is.null(e[["10000"]])
# [1] FALSE
!is.null(l[["1"]])
# [1] TRUE
!is.null(l[["10000"]])
# [1] FALSE

microbenchmark(
  vec1 =  1  %in% s1,
  vec2 =  1  %in% s2,
  vec3 =  1  %in% s3,
  vec4 = "1" %in% s4,
  lst = is.null(l[["1"]]),
  env = is.null(e[["1"]]),
  times = 1000
)
# Warning in microbenchmark(vec1 = 1 %in% s1, vec2 = 1 %in% s2, vec3 = 1 %in%  :
#   Could not measure a positive execution time for 6 evaluations.
# Unit: nanoseconds
#  expr   min    lq     mean median    uq     max neval
#  vec1  5835  6929 12493.25   7294  9482 3214588  1000
#  vec2  9117  9847 16660.73  10212 12764 4081050  1000
#  vec3  7294  8388 19983.63   8752 10576 3274759  1000
#  vec4 11670 12400 15423.03  12764 14223   74394  1000
#   lst 20787 21517 24561.72  21881 22975  143317  1000
#   env     0     1   461.25    365   366   18235  1000

Not surprisingly, the list does not perform well, though it seems to do better than the vectors (in the max case, relatively meaningless). Also not surprisingly, based on our claim that the environment uses an internal has, it performs quite well. Is it O(1)?

microbenchmark(
     samp5 = sapply(as.character(sample(5000, size =    5)), function(a) is.null(e[[a]])),
    samp50 = sapply(as.character(sample(5000, size =   50)), function(a) is.null(e[[a]])),
   samp500 = sapply(as.character(sample(5000, size =  500)), function(a) is.null(e[[a]])),
  samp5000 = sapply(as.character(sample(5000, size = 5000)), function(a) is.null(e[[a]]))
)
# Unit: microseconds
#      expr      min         lq        mean     median         uq       max neval
#     samp5   25.893    32.4565    49.58154    40.4795    58.3485   169.573   100
#    samp50  108.309   119.4310   156.45244   135.8410   167.3850   681.938   100
#   samp500  935.750  1023.2715  1265.29732  1073.9610  1172.6055  6841.985   100
#  samp5000 9410.008 10337.5520 11137.82968 10650.0765 11280.0485 15455.548   100

The first one, samp5, appears to take a bit longer. This isn't too surprising, since there is overhead associated with sapply, sampling, and other things. However, the remaining rows seem to scale somewhat well with the numbers of samples done. This suggests it is indeed O(1) for some basic set operations.

NB: I had to use the whole sapply(...) trick because, unlike vectors and lists, R's environments do not allow subsetting with a vector.

e[[c("1")]]
# [1] TRUE
e[[c("1","10")]]
# Error in e[[c("1", "10")]] : 
#   wrong arguments for subsetting an environment

This is one of the claims made by (and fixed by) hashmap.

Extra credit: in order to facilitate using an environment as a set, you could use simple adders and deleters:

newset <- function() new.env(parent = emptyenv())
setadd <- function(set, n) set[[n]] <- TRUE
setdel <- function(set, n) set[[n]] <- NULL
setcontains <- function(set, n) !is.null(set[[n]])
setmembers <- function(set) names(set)

e <- newset()
setcontains(e, "a")
# [1] FALSE
setadd(e, "a")
setcontains(e, "a")
# [1] TRUE
setmembers(e)
# [1] "a"
setdel(e, "a")
setcontains(e, "a")
# [1] FALSE

(There is a similar but much more extensive blog post on this by Jeffrey Horner here: http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i.)

like image 164
r2evans Avatar answered Oct 26 '25 23:10

r2evans



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!