I use fread to import very big .CSV-files. Some columns have whitespace after the text that I need to remove. This takes too much time (hours). The following code works but the command at system.time is very slow (about 12 seconds on my computer, and the real files are much bigger). <pre class="prettyprint"><code>library(data.table) library(stringr) # Create example-data df.1 <- rbind(c("Text1 ", 1, 2), c("Text2 ", 3, 4), c("Text99 ", 5, 6)) colnames(df.1) <- c("Tx", "Nr1", "Nr2") dt.1 <- data.table(df.1) for (i in 1:15) { dt.1 <- rbind(dt.1, dt.1) } # Trim the "Tx"-column dt.1[, rowid := 1:nrow(dt.1)] setkey(dt.1, rowid) system.time( dt.1[, Tx2 :={ str_trim(Tx) }, by=rowid] ) dt.1[, rowid:=NULL] dt.1[, Tx:=NULL] setnames(dt.1, "Tx2", "Tx") </code></pre> Is there a faster way to trim whitespace in data.tables?

You can operate only on unique values of "Tx" (presuming you actually have some repetition, as in your example): <pre class="prettyprint"><code>dt.1[, Tx2:=str_trim(Tx), by=1:nrow(dt.1)] dt.1[, Tx3:=str_trim(Tx), by=Tx] dt.1[, all.equal(Tx2,Tx3)] # TRUE </code></pre> Using <code>gsub</code> instead of <code>str_trim</code> as in @DWin's answer also speeds things up, whether or not you have duplicated "Tx" values. EDIT: As @DWin pointed out, there's no reason to do it by row to begin with, so <code>str_trim</code> doesn't need to be vectorized. So, I've changed my answer.

Trim whitespace from data.table column

Tags:

r

data.table

I use fread to import very big .CSV-files. Some columns have whitespace after the text that I need to remove. This takes too much time (hours).

The following code works but the command at system.time is very slow (about 12 seconds on my computer, and the real files are much bigger).

library(data.table)
library(stringr)

# Create example-data
df.1 <- rbind(c("Text1        ", 1, 2), c("Text2        ", 3, 4), c("Text99       ", 5, 6))

colnames(df.1) <- c("Tx", "Nr1", "Nr2")
dt.1 <- data.table(df.1)
for (i in 1:15) {
  dt.1 <- rbind(dt.1, dt.1)
}

# Trim the "Tx"-column
dt.1[, rowid := 1:nrow(dt.1)]
setkey(dt.1, rowid)
system.time( dt.1[, Tx2 :={ str_trim(Tx) }, by=rowid] )
dt.1[, rowid:=NULL]
dt.1[, Tx:=NULL]
setnames(dt.1, "Tx2", "Tx")

Is there a faster way to trim whitespace in data.tables?

249

asked Oct 08 '13 19:10

Chris

1 Answers

You can operate only on unique values of "Tx" (presuming you actually have some repetition, as in your example):

dt.1[, Tx2:=str_trim(Tx),     by=1:nrow(dt.1)]
dt.1[, Tx3:=str_trim(Tx),     by=Tx]

dt.1[, all.equal(Tx2,Tx3)]    # TRUE

Using gsub instead of str_trim as in @DWin's answer also speeds things up, whether or not you have duplicated "Tx" values.

EDIT: As @DWin pointed out, there's no reason to do it by row to begin with, so str_trim doesn't need to be vectorized. So, I've changed my answer.

193

answered Nov 08 '22 05:11

Frank

Related questions
                            
                                R programming - counting the occurrence of a certain range of numbers
                            
                                R : How to write an XYZ file from a SpatialPointsDataFrame?
                            
                                Split vector at unknown index
                            
                                merging endpoints of a range with a sequence
                            
                                What is the name of this syntax trick & where is it documented?
                            
                                Is it possible to have zip iterator (i.e. "zip" two iterators together) in foreach?
                            
                                Saving workspace (in a particular frame) for post-mortem debugging in R
                            
                                invalid line type: must be length 2, 4, 6 or 8
                            
                                Flip facet label and x axis with ggplot2
                            
                                Join results in more than 2^31 rows (internal vecseq reached physical limit)
                            
                                How to define an S4 prototype for inherited slots
                            
                                Multiple boxplots using ggplot
                            
                                How to work with the orderbook with R "by" function?
                            
                                Is it possible to create an ellipsis (`...`) object from scratch?
                            
                                How to Sample a specific proportion of lines from a big file in R?
                            
                                Efficient way to perform matrix multiplication repeatedly
                            
                                How to use `[[` and `$` as a function?
                            
                                Aligning like rows in a character matrix in R
                            
                                lme4 and languageR compatibility error: "input model is not a mer object”
                            
                                rChart nPlot - update yAxis label

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With