I need a simple class to count the distribution (histogram) of IP addresses from a network monitoring system. There might be anywhere from 1 to 1010 packets, with anywhere from 1 to 232 addresses (or more, if we have an IPv6 interface). What I'm ideally looking for is a C++ class that will automatically create the histogram and then, when a limit is reached, start combining the less popular nodes through some kind of prefix routing.
Does anybody know of something like this, or do I need to write it?
Thanks!
What you're describing sounds like a perfect use case for the Count-Min sketch data structure. This data structure is used to approximate the frequency of various elements from a data stream and can be tuned to precisely use up a certain amount of memory. Moreover, given a fixed memory limit, you can adjust how accurate it is and how close to the exact answer you'd like it to be. My understanding is that Google uses this data structure to identify frequent searches without having to use a ridiculous amount of disk space.
As an added plus, the data structure never underestimates the true frequency of a given value. That is, if you want to query how frequently you've seen a given IP address, the Count-Min sketch will always give you a value that is no smaller than the true number.
The Count-Min sketch is extremely easy to implement - you just need a bunch of different hash functions and a 2D array. You can also find a variety of different implementations of the Count-Min sketch at Google's page on the data structure.
Hope this helps!
+1 to @templatetypedef, for the approximate solution.
For completeness, if one needs to store exact counts, there's no way around storing the exact numbers. However, depending on your requirements, you might be able to reduce the space needed significantly (for instance, 10.*.*.* and 192.68.*.* ips can never be publically routed; and many others, such as 25.*.*.*, currently are not being publically routed). You may also (again depending on your requirements) be able to count large groups of less-important ips together.
If you could lower down the space requirements far enough, you could store the counts in memory as compactly as possible using a bitset.  If there is no simple way to map ip-address to bitset-address, you'll need to use something like a succinct trie to map them.  A succinct trie would require one byte (amoritized) per ip-group.
And, if you can't lower it far enough, you'll likely need to use a database, and accept the performance hit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With