I have a big data.frame called "mat" of 49952 obs. of 7597 variables and I'm trying to replace NAs with zeros. Here is and example how my data.frame looks like:
    A   B   C   E   F   D   Q   Z   . . .
1   1   1   0   NA  NA  0   NA  NA
2   0   0   1   NA  NA  0   NA  NA
3   0   0   0   NA  NA  1   NA  NA
4   NA  NA  NA  NA  NA  NA  NA  NA
5   0   1   0   1   NA  0   NA  NA 
6   1   1   1   0   NA  0   NA  NA
7   0   0   1   0   NA  1   NA  NA 
.
.
.
I need realy fast tool to replace them. The result should look like:
    A   B   C   E   F   D   Q   Z   . . .
1   1   1   0   0   0   0   0   0
2   0   0   1   0   0   0   0   0 
3   0   0   0   0   0   1   0   0
4   0   0   0   0   0   0   0   0
5   0   1   0   1   0   0   0   0 
6   1   1   1   0   0   0   0   0
7   0   0   1   0   0   1   0   0 
.
.
.
I already tried lapply(mat, function(x){replace(x, is.na(x),0)}) - didn't work - mat[is.na(mat)] <- 0 - error and and maybe too slow - and also link - didn't work too.
@Sotos already advised me plyr::rbind.fill(lapply(L, as.data.frame)) but it didn't work, because it makes data.frame of 379485344 observations and 1 variable (which is 49952x7597) so I have to also trafnsform it back. Is there any better way to do this?
The real structure of my data.frame:
> str(mat)
'data.frame':   49952 obs. of  7597 variables:
 $ 6794602   : num  1 NA NA NA NA 0 0 0 0 0 ...
 $ 1008667   : num  NA 1 0 NA NA 0 0 0 0 0 ...
 $ 8009082   : num  NA 0 1 NA NA NA NA NA NA NA ...
 $ 6740421   : num  NA NA NA 1 NA 0 0 0 0 0 ...
 $ 6777805   : num  NA NA NA NA 1 NA NA NA NA NA ...
 $ 1001682   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1001990   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1002541   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1002790   : num  NA NA NA NA NA 0 0 0 0 0 ...
Note:
when I tried mat[is.na(mat)] <- 0 there was a warning:
> mat[is.na(mat)] <- 0
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
  invalid factor level, NA generated
> nlevels(mat)
[1] 0
Data.frame mat after using mat[is.na(mat)] <- 0:
> str(mat)
'data.frame':   49952 obs. of  7597 variables:
 $ 6794602   : num  1 0 0 0 0 0 0 0 0 0 ...
 $ 1008667   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ 8009082   : num  0 0 1 0 0 0 0 0 0 0 ...
 $ 6740421   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ 6777805   : num  0 0 0 0 1 0 0 0 0 0 ...
 $ 1001682   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1001990   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002541   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002790   : num  0 0 0 0 0 0 0 0 0 0 ...
So the questions are:
mat[is.na(mat)] <- 0 looks like what I want, but there are too many values, so I can't check if they are all right.The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.
locf() function from the zoo package to carry the last observation forward to replace your NA values.
replace_na() will not work if the variable is a factor, and the replacement is not already a level for your factor. If this is the issue, you can add another level to your factor variable for 0 before running replace_na(), or you can convert the variable to numeric or character first.
Try the following:
mat %>% replace(is.na(.), 0)
If suspect that some of your columns are factor, you can use the following code to detect and change them to numeric.
inx <- sapply(mat, inherits, "factor")
mat[inx] <- lapply(mat[inx], function(x) as.numeric(as.character(x)))
Then try the following.
mat[] <- lapply(mat, function(x) {x[is.na(x)] <- 0; x})
mat
And here's the data.
mat <-
structure(list(A = c(1L, 0L, 0L, NA, 0L, 1L, 0L), B = c(1L, 0L, 
0L, NA, 1L, 1L, 0L), C = c(0L, 1L, 0L, NA, 0L, 1L, 1L), E = c(NA, 
NA, NA, NA, 1L, 0L, 0L), F = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), D = c(0L, 0L, 1L, NA, 
0L, 0L, 1L), Q = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), Z = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("A", "B", "C", "E", 
"F", "D", "Q", "Z"), row.names = c("1", "2", "3", "4", "5", "6", 
"7"), class = "data.frame")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With