Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining distribution so I can generate test data

Tags:

r

I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.

From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?

like image 481
twk Avatar asked Jan 29 '26 08:01

twk


2 Answers

While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.

You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).

Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.

To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.

Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!

like image 197
Alex Martelli Avatar answered Jan 30 '26 22:01

Alex Martelli


To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.

like image 23
John D. Cook Avatar answered Jan 30 '26 22:01

John D. Cook



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!