Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logistic regression on huge dataset

I need to run a logistic regression on a huge dataset (many GBs of data). I am currently using using Julia's GLM package for this. Although my regression works on subsets of the data, I am running out of memory when I try to run this on the full dataset.

Is there a way to compute logistic regressions on huge, non-sparse datasets without using a prohibitive amount of memory? I thought about separating the data into chunks, calculating regressions on each of these and aggregating them somehow, but I'm not sure this would give valid results.

like image 685
Malper Avatar asked Oct 19 '25 05:10

Malper


2 Answers

Vowpal Wabbit is designed for that: linear models when the data (or even the model) does not fit in memory.

You can do the same thing by hand, using stochastic gradient descent (SGD): write the "loss function" of your logistic regression (the opposite of the likelihood), minimize it just a bit on a chunk of the data (perform a single gradient descent step), do the same thing on another chunk of data, and continue. After several passes on the data, you should have a good solution. It works better if the data arrives in a random order.

Another idea (ADMM, I think), similar to what you suggest, would be to split the data into chunks, and minimize the loss function on each chunk. Of course, the solutions on the different chunks are not the same. To address this problem, we can change the objective functions by adding a small penalty for the difference between the solution on a chunk of data and the average solution, and re-optimize everything. After a few iterations, the solutions become closer and closer and eventually converge. This has the added advantage of being parallelizable.

like image 149
Vincent Zoonekynd Avatar answered Oct 22 '25 04:10

Vincent Zoonekynd


I have not personally used it, but the StreamStats.jl package is designed for this use case. It supports linear and logistic regression, as well as other streaming statistic functions.

like image 33
IainDunning Avatar answered Oct 22 '25 03:10

IainDunning



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!