I want to store a list of the following tuples in a compressed format and I was wondering which algorithm gives me
My data looks like this:
(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)
One of the two ints refers to a point in time and it's very likely that the numbers ending up in one list are close to each other. The other int represents an abstract id and the values are less likely to be close, although they aren't going to be completely random, either. The double is representing a sensor reading and while there is some correlation between the values, it's probably not of much use.
Since the "time" ints can be close to each other, try to only store the first and after that save the difference to the int before (delta-coding). You can try the same for the second int, too.
Another thing you can try is to reorganize the data from [int1, int2, double], [int1, int2, double]... to [int1, int1...], [int2, int2...], [double, double...].
To find out the compression range your result will be in, you can write your data into a file and download the compressor CCM from Christian Martelock here. I found out that it performs very well for such data collections. It uses a quite fast context mixing algorithm. You can also compare it to other compressors like WinZIP or use a compression library like zLib to see if it is worth the effort.
Here is a common scheme used in most search engines: store deltas of values and encode the delta using a variable byte encoding scheme, i.e. if the delta is less than 128, it can be encoded with only 1 byte. See vint in Lucene and Protocol buffers for details.
This will not give you the best compression ratio but usually the fastest for encoding/decoding throughput.
If I'm reading the question correctly, you simply want to store the data efficiently. Obviously simple options like compressed xml are simple, but there are more direct binary serialization methods. One the leaps to mind is Google's protocol buffers.
For example, in C# with protobuf-net, you can simply create a class to hold the data:
[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}
Then just [de]serialize a List or Foo[] etc, via the ProtoBuf.Serializer class.
I'm not claiming it will be quite as space-efficient as rolling your own, but it'll be pretty darned close. The protocol buffer spec makes fairly good use of space (for example, using base-128 for integers, such that small numbers take less space). But it would be simple to try it out, without having to write all the serialization code yourself.
This approach, as well as being simple to implement, also has the advantage of being simple to use from other architectures, since there are protocol buffers implementations for various languages. It also uses much less CPU than regular [de]compression (GZip/DEFLATE/etc), and/or xml-based serialization.
Sort as already proposed, then store
(first ints) (second ints) (doubles)
transposed matrix. Then compressed
Most compression algorithms will work equally bad on such data. However, there are a few things ("preprocessing") that you can do to increase the compressibility of the data before feeding it to a gzip or deflate like algorithm. Try the following:
First, if possible, sort the tuples in ascending order. Use the abstract ID first, then the timestamp. Assuming you have many readings from the same sensor, similar ids will be placed close together.
Next, if the measures are taken at regular intervals, replace the timestamp with the difference from the previous timestamp (except for the very first tuple for a sensor, of course.) For example, if all measures are taken at 5 minutes intervals, the delta between two timestamps will usually be close to 300 seconds. The timestamp field will therefore be much more compressible, as most values are equal.
Then, assuming that the measured values are stable in time, replace all readings with a delta from the previous reading for the same sensor. Again, most values will be close to zero, and thus more compressible.
Also, floating point values are very bad candidates for compression, due to their internal representation. Try to convert them to an integer. For example, temperature readings most likely do not require more than two decimal digits. Multiply values by 100 and round to the nearest integer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With