Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a binary PDF stream of unknown length?

From the PDF docs: "The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes."

As the contents may be binary, an occurrence of endstream does not necessarily indicate the end of the stream. Now when considering this stream:

%PDF-1.4
%307쏢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x234+T03203T0^@A(235234˥^_d256220^314^U310^E^@[364^F!endstream
endobj
6 0 obj
30
endobj

The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.

I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect. While it may help PDF writers to output PDFs sequentially, it makes parsing for PDF readers quite difficult. Considering that a PDF file is read more frequently than being written, I don't understand this.

So how can such a stream be parsed correctly?

like image 612
U. Windl Avatar asked Aug 31 '25 02:08

U. Windl


1 Answers

The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.

This is an understandable conclusion if one assumes that the file is to be read sequentially beginning to end.

This assumption is incorrect, though, because parsing a PDF from the front and determining the PDF objects on the run is not the recommended way of parsing a PDF.

While ISO 32000-1 is a bit vague here and merely says

Conforming readers should read a PDF file from its end.

(ISO 32000-1, section 7.5.5 File Trailer)

ISO 32000-2 clearly specifies:

With the exception of linearized PDF files, all PDF files should be read using the trailer and cross-reference table as described in the following subclauses. Reading a non-linearized file in a serial manner is not reliable because of the way objects are to be processed after an incremental update. (See 6.3.2, "Conformance of PDF processors".)

(ISO 32000-2, section 7.5 File structure)

Thus, in case of your PDF excerpt, a PDF processor trying to read object 5 0

  • looks up object 5 0 in the cross references and gets its offset in the file,
  • goes to that offset and starts reading the object, first parsing the stream dictionary,
  • at the stream keyword recognizes that the object is a stream and retrieves its Length value which happens to be an indirect reference to 6 0,
  • looks up object 6 0 in the cross references and gets its offset in the file,
  • goes to that offset and reads the object, the number 30,
  • reads the stream content of the stream object 5 0 knowing its length is 30.

An approach as yours is explicitly considered "not reliable".


I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect.

If there were no cross references, you'd be correct. That also is why the FDF format (which does not have mandatory cross references) specifies:

FDF is based on PDF; it uses the same syntax and has essentially the same file structure (7.5, "File structure"). However, it differs from PDF in the following ways:

[...]

  • The length of a stream shall not be specified by an indirect object.

(ISO 32000-2, section 12.7.8 Forms data format)


Concerning the comments:

So I'm correct that PDF cannot be parsed sequentially,

While the very original design of PDF probably was meant for sequential parsing, it has been further developed with only access via cross references in mind. PDF simply is not meant to be parsed sequentially anymore. And that was already the case when I started dealing with PDFs in the late 90s.

and the only reason is that the required length of binary streams may be defined after the stream.

That's by far not the only reason, there are more situations requiring a cross reference lookup to parse correctly.

As @mkl indicated, a parser has to read somewhere before the end of the PDF file to get startxref, hoping that it does not start parsing in the middle of a binary stream.

That's not correct. The PDF must end with "%%EOF" plus optionally an end-of-line. Before that there must be an end-of-line, before that a number, before that an end-of-line, before that startxref.

This is already expressed clearly in ISO 32000-1:

The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.

(ISO 32000-1, section 7.5.5 File Trailer)

Thus, no danger of being "in the middle of a binary stream" if the PDF is valid.

The other thing I dislike about the format of PDF is this: When developing a parser, you usually create test files with some elements you are working on. This approach seems to work with everything but streams. The absolute file positions of syntax elements and the requirement for multiple random accesses makes this task harder.

You seem to be subject to the misconception that the PDF format is a tagged text format like HTML. This is not the case. Even though numerous syntactical elements are defined using some ASCII keyword and there are "lines", PDF is a binary format, the cross reference tables are not a gimmick but the central access hub to the objects, and optimization for random access is done by design.

like image 161
mkl Avatar answered Sep 03 '25 00:09

mkl