Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trim whitespace from data.table column

Tags:

r

data.table

I use fread to import very big .CSV-files. Some columns have whitespace after the text that I need to remove. This takes too much time (hours).

The following code works but the command at system.time is very slow (about 12 seconds on my computer, and the real files are much bigger).

library(data.table)
library(stringr)

# Create example-data
df.1 <- rbind(c("Text1        ", 1, 2), c("Text2        ", 3, 4), c("Text99       ", 5, 6))

colnames(df.1) <- c("Tx", "Nr1", "Nr2")
dt.1 <- data.table(df.1)
for (i in 1:15) {
  dt.1 <- rbind(dt.1, dt.1)
}

# Trim the "Tx"-column
dt.1[, rowid := 1:nrow(dt.1)]
setkey(dt.1, rowid)
system.time( dt.1[, Tx2 :={ str_trim(Tx) }, by=rowid] )
dt.1[, rowid:=NULL]
dt.1[, Tx:=NULL]
setnames(dt.1, "Tx2", "Tx")

Is there a faster way to trim whitespace in data.tables?

like image 249
Chris Avatar asked Oct 08 '13 19:10

Chris


People also ask

How do I trim a column in R?

trimws() function is used to remove or strip, leading and trailing space of the column in R. trimws() function is used to strip leading, trailing and strip all the spaces in R Let's see an example on how to strip leading, trailing and all space of the column in R.

What is trimming whitespace?

The trim() method removes whitespace from both ends of a string and returns a new string, without modifying the original string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.).


1 Answers

You can operate only on unique values of "Tx" (presuming you actually have some repetition, as in your example):

dt.1[, Tx2:=str_trim(Tx),     by=1:nrow(dt.1)]
dt.1[, Tx3:=str_trim(Tx),     by=Tx]

dt.1[, all.equal(Tx2,Tx3)]    # TRUE

Using gsub instead of str_trim as in @DWin's answer also speeds things up, whether or not you have duplicated "Tx" values.

EDIT: As @DWin pointed out, there's no reason to do it by row to begin with, so str_trim doesn't need to be vectorized. So, I've changed my answer.

like image 193
Frank Avatar answered Nov 08 '22 05:11

Frank