Hy
I am using ITextSharp to parse a pdf file to text output. I want to know if I can catch if the pdf contains subscript or superscript, does anyone knows how to make the difference between a normal character and a superscript in a pdf using ITextSharp, or other library ?
Thanks
Disclaimer: I don't actually have any evidence for this but...
I would expect super/subscript to be identical to normal text. It's the same font, just smaller. If it happens to be on the same line as other text, super/sub scripts are raised and lowered - but you won't be able to detect that with some explicit meta-tag in a layout-oriented format such as PDF.
In other words, I'd guess that you need to identify super/subscripts by heuristics: finding text that's smaller and vertically displaced compared to other text on the "same" line. Whether that's easy to do or not depends on the PDF creator and the details of ITextSharp, since even identifying a "line" is not necessarily straightforward.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With