Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Factor from numeric vector drops every 100.000th element from its levels

Tags:

r

r-factor

Consider a vector of type numeric with over 100.000 elements. In the example below, it's simply the range 1:500001.

n <- 500001
arr <- as.numeric(1:n)

The following sequence of factor calls causes odd behaviour:

First call factor with the levels argument specified as the exact same range that arr was defined with. Predictably, the resulting variable has exactly n levels:

> tmp <- factor(arr, levels=1:n)
> nlevels(tmp)
[1] 500001

Now call factor again on the result from before. The outcome is that the new value, tmp2, is missing some values from its levels:

> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 499996 

Checking to see which items are missing, we find it's every 100.000th element (which, in this case, have value equal to their index):

> which(!levels(tmp) %in% levels(tmp2))
[1] 100000 200000 300000 400000 500000 

Decreasing n to <=100.000 eliminates this unexpected behaviour. However, it occurs for any n > 100.000.

> n <- 99999
> arr <- as.integer(1:n)
> tmp <- factor(arr)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 99999
> which(!levels(tmp) %in% levels(tmp2))
integer(0)

This also does not happen when the arr vector has a type other than numeric:

> n <- 500001
> arr <- as.integer(1:n)
> tmp <- factor(arr, levels=1:n)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 500001

Finally, the problem does not occur when the levels argument is left unspecified in the first call to factor().

What could be causing this behaviour? Tested in R 4.3.2

like image 752
stelioslogothetis Avatar asked Sep 08 '25 07:09

stelioslogothetis


2 Answers

Building on ThomasIsCoding's answer, it is due to the scientific notation rule applying to real numbers, but not applying to integers...

For example, in the console...

options(scipen = 0) #uses scientific notation if fewer characters than normal

500000L
[1] 500000   #integer displayed in normal notation

500000
[1] 5e+05    #numeric displayed in shorter scientific notation

So the names cause a mismatch with the factor levels for each multiple of 100000 using numeric values.

The problem can be solved by increased scipen.

I thought scipen was primarily to control displayed values, so it is odd that it is being used for factor levels.

like image 57
Andrew Gustar Avatar answered Sep 10 '25 23:09

Andrew Gustar


In your second call of factor, all the NAs are not recorded as levels, for example

> factor(c(NA, 1))
[1] <NA> 1
Levels: 1

In your case, you can see that, the levels are recognized as NA in tmp

> tail(tmp)
[1] 499996 499997 499998 499999 <NA>   500001
500001 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 500001

> setdiff(levels(tmp), levels(tmp2))
[1] "100000" "200000" "300000" "400000" "500000"

so in tmp2, those 5 NAs in tmp (corresponding to 100000, 200000, 300000, 400000, 500000) are not added into the levels of tmp2


If you don't specify levels = 1:n in generating tmp, you will see that the

> tail(tmp)
[1] 499996 499997 499998 499999 5e+05  500001
500001 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 500001

> tail(levels(tmp))
[1] "499996" "499997" "499998" "499999" "5e+05"  "500001"

where you have 5e+05 instead of NA in tmp, and you naturally know that all those NAs are the trouble-makers.

like image 25
ThomasIsCoding Avatar answered Sep 10 '25 22:09

ThomasIsCoding