Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance problem transforming JSON data

Tags:

json

r

I've got some data in JSON format that I want to do some visualization on. The data (approximately 10MB of JSON) loads pretty fast, but reshaping it into a usable form takes a couple of minutes for just under 100,000 rows. I have something that works, but I think it can be done much better.

It may be easiest to understand by starting with my sample data.

Assuming you run the following command in /tmp:

curl http://public.west.spy.net/so/time-series.json.gz \
    | gzip -dc - > time-series.json

You should be able to see my desired output (after a while) here:

require(rjson)

trades <- fromJSON(file="/tmp/time-series.json")$rows


data <- do.call(rbind,
                lapply(trades,
                       function(row)
                           data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
                                      price=unlist(row$value)[1],
                                      volume=unlist(row$value)[2])))

someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
                               space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")

days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
     labels=strftime(days, "%F"),
     tick=FALSE)
like image 665
Dustin Avatar asked May 25 '26 04:05

Dustin


2 Answers

You can get a 40x speedup by using plyr. Here is the code and the benchmarking comparison. The conversion to date can be done once you have the data frame and hence I have removed it from the code to facilitate apples-to-apples comparison. I am sure a faster solution exists.

f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin  = function(n) do.call(rbind, lapply(trades[1:n], 
                function(row) data.frame(
                    date   = unlist(row$key)[2],
                    price  = unlist(row$value)[1],
                    volume = unlist(row$value)[2]))
                )
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n], 
               function(x){
                   list(date=x$key[2], price=x$value[1], volume=x$value[2])})))

f_mbq = function(n) data.frame(
          t(sapply(trades[1:n],'[[','key')),    
          t(sapply(trades[1:n],'[[','value')))

rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
    replications = 50)

test            elapsed   relative 
f_ramnath(100)  0.144       3.692308     
f_dustin(100)   6.244     160.102564     
f_mrflick(100)  0.039       1.000000    
f_mbq(100)      0.074       1.897436   

EDIT. MrFlick's solution leads to an additional 3.5x speedup. I have updated my tests.

like image 158
Ramnath Avatar answered May 27 '26 18:05

Ramnath


I received another transformation by MrFlick in irc that was significantly faster and worth mentioning here:

data <- as.data.frame(do.call(rbind,
                              lapply(trades,
                                     function(x) {list(date=x$key[2],
                                                   price=x$value[1],
                                                   volume=x$value[2])})))

It seems to be made significantly faster by not building the inner frames.

like image 24
Dustin Avatar answered May 27 '26 17:05

Dustin