I am working with a large dataset of 8Gb (HIGGS dataset). When looking at the vignette for the dbplyr package (see vignette('dbplyr')) I came across this line,
(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.)
The HIGGS dataset does fit in memory on my machine, my questions are:
edit: After looking at the link provided by @Waldi: RAM 100x faster than HDD, an additional question is how does this change for a SSD?
R is memory intensive, so it’s best to get as much RAM as possible. the amount of RAM you have can limit the size of data set you can analyse.
Adding a solid state drive (SSD) typically won’t have much impact on the speed of your R – vignette(dbplyr) since R loads object into RAM. However, the reduction in boot time and increase in your overall productivity since I/0 is much faster makes an SSD drive a wonderful purchase.
library(benchmarkme) is package benchmarkme to assess your CPUs number crunching ability. CPU cores is another area you would like to explore for big data performances. The more the cores the better, if you are using CPU.
library(Multidplyr) is a backend for dplyr that partitions a data frame across multiple cores.
This minimizes time spent moving data around, and maximizes parallel performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With