Consider a vector of type numeric
with over 100.000 elements. In the example below, it's simply the range 1:500001.
n <- 500001
arr <- as.numeric(1:n)
The following sequence of factor
calls causes odd behaviour:
First call factor
with the levels
argument specified as the exact same range that arr
was defined with. Predictably, the resulting variable has exactly n
levels:
> tmp <- factor(arr, levels=1:n)
> nlevels(tmp)
[1] 500001
Now call factor
again on the result from before. The outcome is that the new value, tmp2
, is missing some values from its levels:
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 499996
Checking to see which items are missing, we find it's every 100.000th element (which, in this case, have value equal to their index):
> which(!levels(tmp) %in% levels(tmp2))
[1] 100000 200000 300000 400000 500000
Decreasing n
to <=100.000 eliminates this unexpected behaviour. However, it occurs for any n
> 100.000.
> n <- 99999
> arr <- as.integer(1:n)
> tmp <- factor(arr)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 99999
> which(!levels(tmp) %in% levels(tmp2))
integer(0)
This also does not happen when the arr
vector has a type other than numeric
:
> n <- 500001
> arr <- as.integer(1:n)
> tmp <- factor(arr, levels=1:n)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 500001
Finally, the problem does not occur when the levels
argument is left unspecified in the first call to factor()
.
What could be causing this behaviour? Tested in R 4.3.2
Building on ThomasIsCoding's answer, it is due to the scientific notation rule applying to real numbers, but not applying to integers...
For example, in the console...
options(scipen = 0) #uses scientific notation if fewer characters than normal
500000L
[1] 500000 #integer displayed in normal notation
500000
[1] 5e+05 #numeric displayed in shorter scientific notation
So the names cause a mismatch with the factor levels for each multiple of 100000 using numeric values.
The problem can be solved by increased scipen
.
I thought scipen
was primarily to control displayed values, so it is odd that it is being used for factor levels.
In your second call of factor
, all the NA
s are not recorded as levels, for example
> factor(c(NA, 1))
[1] <NA> 1
Levels: 1
In your case, you can see that, the levels are recognized as NA
in tmp
> tail(tmp)
[1] 499996 499997 499998 499999 <NA> 500001
500001 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 500001
> setdiff(levels(tmp), levels(tmp2))
[1] "100000" "200000" "300000" "400000" "500000"
so in tmp2
, those 5 NA
s in tmp
(corresponding to 100000, 200000, 300000, 400000, 500000
) are not added into the levels of tmp2
If you don't specify levels = 1:n
in generating tmp
, you will see that the
> tail(tmp)
[1] 499996 499997 499998 499999 5e+05 500001
500001 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 500001
> tail(levels(tmp))
[1] "499996" "499997" "499998" "499999" "5e+05" "500001"
where you have 5e+05
instead of NA
in tmp
, and you naturally know that all those NA
s are the trouble-makers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With