Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compression using prime numbers

I think I've found a way you can do lossless compression using prime numbers or make other methods available over and over again.

There are 54 prime numbers in the range 0-255. When we have the prime number in a byte array, we can store it using the prime number indexes of that number, using 6 bits instead of 8 bits, right?

We need to create a map using a bit for each byte, which we can store for which numbers we are compressing, and add it to the data array.

This seems to work at first, but increases the file size slightly, but reduces the file to a re-compressible structure for algorithms like LZ4 or Huffman.

The indexes 0-53 correspond to the prime numbers 2-251, but there are unused numbers for 6 bits. (54-63). These 10 numbers can also be an opportunity to store 10 different non-prime numbers with the highest frequency in 6 bits. So we're going to squeeze more numbers. If this ratio exceeds 50%, we will be able to perform successful compression.

In addition, this method can create an opportunity to recompress compressed numbers. Do you think that if we create a method for uint and ulong data types in 4 or 8 byte data blocks instead of one byte, we can reduce the map of compressed data to a smaller size, right?

There are 203280221 prime numbers less than 2 ^ 32. This means that we can store 28 bits instead of 32 bits for each prime number.

By looking at the ratio of prime numbers in the data array, we can determine whether we can perform efficient compression.

I think that this method is not an effective compression method, but rather a data exchange that may allow re-compression of compressed files.

I would like to ask you know of other methods that we can do using prime numbers? Because obtaining prime numbers was very easy, especially for numbers below 2 ^ 32. We need to add prime number methods to our methods.

like image 429
Ayhan ARICAN Avatar asked Oct 20 '25 12:10

Ayhan ARICAN


1 Answers

One efficient and practical compression method for 32-bit primes is storing the halved difference between consecutive primes using simple 8-bit bytes (and pulling the prime 2 out of thin air when needed). With a bit of indexing magic this gives superfast 'decompression' - i.e. it is an ideal form of representation when iteration over prime ranges is required since it uses less memory than plain primes.

The 203,280,220 odd primes below 2^32 need a corresponding number of bytes, i.e. a bit less than 194 MB. The usability of byte-sized halved differences extends to 304,599,508,537 (roughly 2^38) but from that point on it is necessary to invent a scheme for coding occasional gaps greater than 510. However, when storing gap indices instead of halved gap widths then the scheme can pretty much go on forever (see Prime gap article @ Wikipedia).

These blocks of consecutive differences can be further compressed using the usual zlib-stile compression or similar, in which case you get better compression than when compressing a packed odds-only bitmap or a mod 30 wheel. This could be used as a compact on-disk representation.

Note, however, that a mod 30 wheel takes only 136 MB and it is directly indexable - like for satisfying primality queries. Extracting (decoding) ranges of primes is quite a bit slower than for delta-coded primes but still faster than sieving the primes afresh. Hence storing primes using the mod 30 wheel is one of the most efficient ways for 'compressing' primes in a way that keeps them directly available/usable.

Mixed forms can also be quite successful. For example, when decoding primes from a mod 30 wheel you can store them directly in delta-coded form without materialising them as vectors of numbers. This could be viewed as a form of re-compression. The mod 30 wheel would be the on-disk and main memory format, the delta coding would be the interchange format for the results of certain queries.

Delta coding and wheeled storage share one important advantage: their storage requirements are basically independent of the size of the represented primes. By contrast, if you store the primes as numbers than you get logarithmically increasing storage requirements. I.e. a 64-bit prime takes four times as much space as a 16-bit prime.

An explanation of prime wheels can be found in the Wikipedia article Wheel factorization.

The following table shows the space reduction that you get when adding more and more primes to the wheel, both relative to an unwheeled representation ('ratio') and relative to the preceding step ('delta'). The column 'spokes' gives the number of of potentially prime-bearing spokes (residues) per wheel rotation, which gives you an idea of the complexity of an optimised, fully unrolled implementation à la Kim Walisch's primesieve. As you can see the gains are diminishing rapidly while complexity is exploding.

wheel efficiency table

The wheel mod 2 - a.k.a. odds-only sieve - is important because it gives you most of the bang at virtually no cost in code complexity. Most of speed benefits of the mod 3 wheel can be reaped with almost no effort by sleight-of-hand (indexing tricks) when operating an odds-only sieve. The wheel mod 30 is important because it offers a good balance between code complexity and speedup, which makes it an ideal target for optimised implementations. The wheels up to mod 30030 have been used in practical implementations but the gain compared to the mod 30 wheel is actually negligible while the increased complexity and lost performance are not. Even higher-order wheels have been seen on university campuses, in the context of special projects.

One of the reasons why Kim Walisch's sieve program is the fastest that you can find anywhere on the Net is that he implemented unrolled versions of each of the 8 possible striding sequences of the mod 30 wheel, with specific loop entry points for each of the 8 possible phases of each sequence. All that in the only language (C/C++) for which you can get optimising compilers that are actually worthy of the name.

Doing the same for the 48 residues of the wheel mod 210 would mean 36 times as much code (!) while the reduction in bitmap size is negligible. However, mod 210 and mod 2310 might be promising targets for a code generator (C/C++ code or .NET IL), if you have a couple months of spare time to burn.

like image 131
DarthGizka Avatar answered Oct 23 '25 07:10

DarthGizka



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!