I have a data-set in Gigabytes(GB) and want to estimate the parameters for missing values in that.
There is an algorithm called MLE(Maximum-likelihood Estimation) in machine learning that can be used for it.
Since R might not work on such a large data-set,so which library will be best to use for it?
By wiki:MLE:
In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.
Generally you need two steps before you can apply MLE:
At this time, if you can obtain an analytic form of solution for the MLE estimate, just stream your data to the mle-estimate calculation, e.g., for gaussian distribution, to estimate mean, you just accumulate the sum, and keep the count and the sample mean will be your mle-estimate.
However, when the model involves many parameters and its pdf is highly non-linear. In such situations, the MLE estimate must be sought numerically using nonlinear optimization algorithms. If your data size is huge, try stochastic gradient descent, the true gradient is approximated by a gradient at a single example. As the algorithm sweeps through the training set, it performs the update formula for each training example. So that you can still stream your data one at a time to your update program in multiple sweeps fashion. In this way, memory constraint should not be a problem at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With