Going through the PDF spec, it says that the trailer precedes the startxref.  Which to me, says that the xref can appear anywhere in the document, but the trailer still appears before the startxref.  This makes sense until you have to parse it, because you have to parse in reverse you can't take into account comments or strings.  Lets get a little more wacky then.
trailer<< %\
  /Size 4 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 4 %\
  /Root 2 0 R %\
  /Info 3 0 R %\
>>%)
>>&)
% test test )
startxref
 15
%%EOF
Which is a perfectly valid trailer. The first one is the real trailer, but the second one is in a "string". In this case, reverse parsing is going to fail to catch the comments. Looking for the string trailer is going to fail if its apart of a comment or string. I was wondering what the best way of finding out where the trailer starts is?
Update - This trailer seems to open in Acrobat Reader
%PDF-1.3
%âãÏÓ
xref
0 4
00000000 65535 f
00000110 00000 n
00000250 00000 n
00000315 00000 n
00000576 00000 n
1 0 obj <<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction [ 3 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
  /Type /Pages
  /Kids [ 3 0 R ]
  /Count 1
>>
endobj
3 0 obj <<
  /Type /Page
  /Parent 2 0 R
  /Resources << >>
  /MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
  /Size 4 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 4 %\
  /Root 2 0 R %\
  /Info 3 0 R %\
>>%)
>>%)
% test test )
startxref
 15
%%EOF
As far as syntax goes, this conforms to spec. Somehow they seem to be able to know if they are in a comment, or a string. Parsing L-R, the second trailer is in a string with a % tailed on, with a comment after the trailer. But R-L parsing, you have no idea if the first ) is part of a comment, or the end of a string definition.
Another Example:
%PDF-1.3
%âãÏÓ
xref
0 8
0000000000 65535 f
0000000210 00000 n
0000000357 00000 n
0000000428 00000 n
0000000533 00000 n
0000000612 00000 n
0000000759 00000 n
0000000830 00000 n
0000000935 00000 n
1 0 obj <<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction [ 3 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
  /Type /Pages
  /Kids [ 3 0 R ]
  /Count 1
>>
endobj
3 0 obj <<
  /Type /Page
  /Parent 2 0 R
  /Resources << >>
  /MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj
5 0 obj <<
  /Type /Catalog
  /Pages 6 0 R
  /OpenAction [ 7 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
6 0 obj <<
  /Type /Pages
  /Kids [ 7 0 R ]
  /Count 1
>>
endobj
7 0 obj <<
  /Type /Page
  /Parent 6 0 R
  /Resources << >>
  /MediaBox [ 0 0 100 100 ]
>>
endobj
8 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
  /Size 8 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 8 %\
  /Root 5 0 R %\
  /Info 8 0 R %\
>>%)
>>%)
% test test )
startxref
 17
%%EOF
This example, is displayed correctly in Adobe. In my last case, you claimed it would fail because the "root" node is invalid, but this new sample, the root is valid, but its never actually used. So shouldn't it display a 100x100 window, instead of the 8.5"x11"?
In regard to the Resources
  (Required; inheritable) A dictionary containing any resources required by the page 
(see Section 3.7.2, “Resource Dictionaries”). If the page requires no resources, the 
value of this entry should be an empty dictionary. Omitting the entry entirely
indicates that the resources are to be inherited from an ancestor node in the page 
tree.
The startxref statement usually is at the end of the file, with the trailer preceeding it.
Update: Above introductionary sentence was not clearly enough formulated, as Jeremy Walton correctly observed (though later comments in my answer hinted at the exceptions). It should have read: "The startref statement appears usually at the end of the file as a single instance, with the trailer preceeding it (unless your file has undergone incremental updates, in which case you may have different instances of cross-references with assorted trailers."
If there are comments sprinkled into the PDF, they count the same as "real" PDF page description code when it comes to byte counting for the xref table byte-offset calculations. Therefor, it is not a problem to parse it correctly.
To quote straight "from the horse's mouth" (PDF specification ISO 32000-1, Section 7.5.5):
"The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker,
%%EOF. The two preceding lines shall contain, one per line and in order, the keywordstartxrefand the byte offset in the decoded stream from the beginning of the file to the beginning of thexref keywordin the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keywordtrailerfollowed by a series of key-value pairs enclosed in double angle brackets [...]"
The key expression to take into account here is "LAST cross-reference section".
If you are having in mind updated trailers, then have a look at Section 7.5.6.
Yes, you have to parse in reverse. The first cross-reference section to read is the last one appearing in the file -- and it will have a preceding last trailer. The second one to read is the last-but-one appearing in the file -- with a preceding last-but-one trailer. Etc.pp.... If you'll have to read more than one trailer/xref section, each one you read has to contain a reference to the next one to read.
Should you think of "comments" being something you can freely insert into the PDF without corrupting its structure: then think different. Once you  inserted comments, you have to update at least the xref table (and maybe the /Length keys of objects).
Update 2: The trailer<<...>> dictionary Jeremey constructed is probably not even a valid dictionary at all, therefor it's also not a valid trailer dictionary...
Anyway, according to the spec, the trailer dictionary must consist of "a series of key-value pairs". The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are even optional (see Table 15 in Section 7.5.5).
Jermey seems to have constructed his example in a way so to (mis-)understand this snippet as a potentially valid trailer dictionary:
trailer<<%) >>
% test test )
Which of course isn't a dictionary at all, since we don't see any key-value pair here.
His full example also isn't valid either because the "key" called /Key isn't amongst the valid key names for the trailer (which are, according to table 15: /Size, /Prev, /Root, /Encrypt, /Info, /ID, /XRefStm).
So Jeremy should do in his PDF parsing code the same that all sane and even most insane PDF processing libraries do: give up on obviously invalid constructs instead of searching sense in them and tell the user that "your damn PDF is corrupt because we cannot identify valid keys in the supposed trailer section of the file".
Q: Doc, it hurts when I do this.
A: Don't do that.
The correct way to parse the end of a PDF goes something like this:
startxref
You don't really have to parse out the object numbers and byte offsets and so forth if you're just trying to find the trailer. All you need to do is look to see how many entries are in a given subsection of the xref, skip 20*N bytes, and check for another subsection (or "trailer"). When you finally hit "trailer" instead of numbers, you're there.
So why on Earth do you just want the trailer?
When I when hunting through the PDF Reference, I expected to find some line of text stating that the header/body/xref/trailer had to be in that order. I did not.
What I DID find, was this:
A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
- A one-line header...
- A body...
- A cross-reference table...
- A trailer...
There are bullets in front of these sections, not numbers.
So that all hints that a conforming PDF can get away with swapping the order of the body and xref. On the other hand, the header is required to be first, the trailer is required to be last, and all the section of a PDF are listed in that order. This implies order, but won't hold up in court.
But if you look at Figure 2 (of chapter 7, section 5.1), entitled "Initial Structure of a PDF file", you'll see the order defined visually. That's a tad thin, but I'll cling to it anyway.
I wouldn't be at all surprised to find that a PDF that put its body after the xref table broke some PDF viewers (particularly a malformed PDF where the program tried to fix it).
I've been working with PDF files for well over a decade. In all that time, I have never seen a PDF where the xref came before the body. And I've seen some REALLY screwed up PDFs.
So while my "correct way to parse a PDF" may not be Iron Clad, it's still pretty durable.
And if you absolutely insist on backing up to find the keyword "trailer", then you can look for "close an array or dictionary" tokens after you parse out the trailer you found. If it were wrapped in a string, all the name slashes would have to be escaped, leading to Bad Parsing. You can't have spaces in a Name... so that leaves just array and dictionary.
But the odds of you ever encountering this problem in Real Life are astronomically small, unless you set out to break PDF software and create these PDFs yourself. That would bring your motives into question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With