Factor from numeric vector drops every 100.000th element from its levels

Question

Consider a vector of type numeric with over 100.000 elements. In the example below, it's simply the range 1:500001.

n <- 500001
arr <- as.numeric(1:n)

The following sequence of factor calls causes odd behaviour:

First call factor with the levels argument specified as the exact same range that arr was defined with. Predictably, the resulting variable has exactly n levels:

> tmp <- factor(arr, levels=1:n)
> nlevels(tmp)
[1] 500001

Now call factor again on the result from before. The outcome is that the new value, tmp2, is missing some values from its levels:

> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 499996

Checking to see which items are missing, we find it's every 100.000th element (which, in this case, have value equal to their index):

> which(!levels(tmp) %in% levels(tmp2))
[1] 100000 200000 300000 400000 500000

Decreasing n to <=100.000 eliminates this unexpected behaviour. However, it occurs for any n > 100.000.

> n <- 99999
> arr <- as.integer(1:n)
> tmp <- factor(arr)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 99999
> which(!levels(tmp) %in% levels(tmp2))
integer(0)

This also does not happen when the arr vector has a type other than numeric:

> n <- 500001
> arr <- as.integer(1:n)
> tmp <- factor(arr, levels=1:n)
> tmp2 <- factor(tmp)
> nlevels(tmp2)
[1] 500001

Finally, the problem does not occur when the levels argument is left unspecified in the first call to factor().

What could be causing this behaviour? Tested in R 4.3.2

Andrew Gustar · Accepted Answer

Building on ThomasIsCoding's answer, it is due to the scientific notation rule applying to real numbers, but not applying to integers...

For example, in the console...

options(scipen = 0) #uses scientific notation if fewer characters than normal

500000L
[1] 500000   #integer displayed in normal notation

500000
[1] 5e+05    #numeric displayed in shorter scientific notation

So the names cause a mismatch with the factor levels for each multiple of 100000 using numeric values.

The problem can be solved by increased scipen.

I thought scipen was primarily to control displayed values, so it is odd that it is being used for factor levels.

ThomasIsCoding · Answer

In your second call of factor, all the NAs are not recorded as levels, for example

> factor(c(NA, 1))
[1] <NA> 1
Levels: 1

In your case, you can see that, the levels are recognized as NA in tmp

> tail(tmp)
[1] 499996 499997 499998 499999 <NA>   500001
500001 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 500001

> setdiff(levels(tmp), levels(tmp2))
[1] "100000" "200000" "300000" "400000" "500000"

so in tmp2, those 5 NAs in tmp (corresponding to 100000, 200000, 300000, 400000, 500000) are not added into the levels of tmp2

If you don't specify levels = 1:n in generating tmp, you will see that the

> tail(tmp)
[1] 499996 499997 499998 499999 5e+05  500001
500001 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 500001

> tail(levels(tmp))
[1] "499996" "499997" "499998" "499999" "5e+05"  "500001"

where you have 5e+05 instead of NA in tmp, and you naturally know that all those NAs are the trouble-makers.

Factor from numeric vector drops every 100.000th element from its levels

Tags:

r

r-factor

stelioslogothetis

2 Answers

Andrew Gustar

ThomasIsCoding

Recent Activity

Donate For Us

Factor from numeric vector drops every 100.000th element from its levels

Tags:

r

r-factor

stelioslogothetis

2 Answers

Andrew Gustar

ThomasIsCoding

Related questions

Recent Activity

Donate For Us