Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What encoding/compression algorithm is this?

I'm trying to reverse engineer a binary file format, but it has no magic bytes, and no specific extension. I can influence only a single aspect of the file: a short string. By trying different strings, I was able to figure out how data is stored in the file. It seems that the whole file uses some sort of simple encoding. I hope that finding the exact encoding allows me to narrow down my search for the file format. I know the file is generated by a Windows program written in C++.

Now, after much trial-and-error, I found that some sections of the file are encoded in runs. Each run starts with a byte that indicates how many bytes will follow and where the data is retrieved.

  • 000ddddd (1 byte)
    Take the following (ddddd)+1 bytes from the encoded data.
  • 111····· ···ddddd ···bbbbb (3 bytes)
    Go back (bbbbb)+1 bytes in the decoded data, and take the next (ddddd)+9 bytes from it.
  • ddd····· ··bbbbbb (2 bytes)
    Go back (bbbbbb)+1 bytes in the decoded data, and take the next (ddd)+2 bytes from it.

Here's an example:

This is the start of the file, with the UTF-16 string abracadabra encoded in it:

First 224 bytes of the file

  .     .  .  a  .  b  .  r     .  .  c     .  .  d     .  €  .
  0C 20 03 04 61 00 62 00 72 20 05 00 63 20 03 00 64 20 03 80 0D

To decode the string:

  0C                      number of Unicode chars: 12 (11 chars + \0)
  20 03       . . .       ??
  04                      next 5
  61 00       a .
  62 00       b .
  72          r
  20 05       . a .       back 6, take 3
  00                      next 1
  63          c
  20 03       . a .       back 4, take 3
  00                      next 1
  64          d
  20 03       . a .       back 4, take 3
  80 0D       b . r . a . back 14, take 6

This results in (UTF-16):

  a  .  b  .  r  .  a  .  c  .  a  .  d  .  a  .  b  .  r  .  a  .
  61 00 62 00 72 00 61 00 63 00 61 00 64 00 61 00 62 00 72 00 61 00

However, I have no clue as to what encoding/compression algorithm this might be. It looks like some variant of LZ that doesn't use a dictionary (like LZ77), but so far I haven't been able to find any algorithm that matches this description. I'm also not sure whether the entire file is encoded like this, or only portions of it.

Do you know this encoding? Or do you have any hints for things I might look for in the file to identify the encoding?

like image 714
Daniel A.A. Pelsmaeker Avatar asked Mar 26 '26 06:03

Daniel A.A. Pelsmaeker


1 Answers

After your edit I think it's LZF with the following differences to your observations:

  • The magic header and indication of compressed vs uncompressed has been removed in your example (not too surprising if it's embedded in a file).
  • You took the block length as one byte, but it is two bytes and big-endian, so the prior 0x00 is part of the length, which still works.
like image 109
Ben Jackson Avatar answered Mar 28 '26 00:03

Ben Jackson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!