I use fread to import very big .CSV-files. Some columns have whitespace after the text that I need to remove. This takes too much time (hours).
The following code works but the command at system.time is very slow (about 12 seconds on my computer, and the real files are much bigger).
library(data.table)
library(stringr)
# Create example-data
df.1 <- rbind(c("Text1 ", 1, 2), c("Text2 ", 3, 4), c("Text99 ", 5, 6))
colnames(df.1) <- c("Tx", "Nr1", "Nr2")
dt.1 <- data.table(df.1)
for (i in 1:15) {
dt.1 <- rbind(dt.1, dt.1)
}
# Trim the "Tx"-column
dt.1[, rowid := 1:nrow(dt.1)]
setkey(dt.1, rowid)
system.time( dt.1[, Tx2 :={ str_trim(Tx) }, by=rowid] )
dt.1[, rowid:=NULL]
dt.1[, Tx:=NULL]
setnames(dt.1, "Tx2", "Tx")
Is there a faster way to trim whitespace in data.tables?
trimws() function is used to remove or strip, leading and trailing space of the column in R. trimws() function is used to strip leading, trailing and strip all the spaces in R Let's see an example on how to strip leading, trailing and all space of the column in R.
The trim() method removes whitespace from both ends of a string and returns a new string, without modifying the original string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.).
You can operate only on unique values of "Tx" (presuming you actually have some repetition, as in your example):
dt.1[, Tx2:=str_trim(Tx), by=1:nrow(dt.1)]
dt.1[, Tx3:=str_trim(Tx), by=Tx]
dt.1[, all.equal(Tx2,Tx3)] # TRUE
Using gsub
instead of str_trim
as in @DWin's answer also speeds things up, whether or not you have duplicated "Tx" values.
EDIT: As @DWin pointed out, there's no reason to do it by row to begin with, so str_trim
doesn't need to be vectorized. So, I've changed my answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With