Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is the gzip file size encoded?

Tags:

c++

c

encoding

gzip

The gzip file format contains the (uncompressed/original) file size encoded in the last 4 bytes of the compressed file. The "gzip -l" command reports the compressed and uncompressed sizes, the compression ratio, the original filename.

Looking around stackoverflow, there are a couple of mentions of decoding the size encoded in the last 4 bytes.

What is the encoding of the size? Big-endian (most significant byte first), Little-endian (least significant byte first), and is the value signed or unsigned?

This code snippet seems to be working for me,

FILE* fh; //assume file handle opened
unsigned char szbuf[4];
struct stat statbuf;
fstat(fn,&statbuf);
unsigned long clen=statbuf.st_size;
fseek(fh,clen-4,SEEK_SET);
int count=fread(szbuf,1,4,fh);
unsigned long ulen = ((((((szbuf[4-1] << 8) | szbuf[3-1]) << 8) | szbuf[2-1]) << 8) | szbuf[1-1]);

Here are a couple of related posts, which seem to imply little-endian, and unsigned long (0..4GB-1).

Determine uncompressed size of GZIP file

GZIPOutputStream not updating Gzip size bytes

Determine size of file in gzip

Gzip.org has more information about Gzip

like image 329
ChuckCottrill Avatar asked Oct 28 '25 16:10

ChuckCottrill


1 Answers

RFC says it's modulo 2^32 which means uint32_t, and experimentation using a .Net GZipStream gives it as little-endian.

RFC 1952

like image 184
Medinoc Avatar answered Oct 31 '25 06:10

Medinoc