Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ITextSharp 4.1.6 extract PDF content as text

The company would like to use the Itextsharp 4.1.6 version specifically and don't want to buy the license (version 5/7). So, we had already implemented the TextExtract from pdf using the itextsharp 5 version. As we downgraded, this method doesn't support in the 4.16 LGPL version.

So, I looked into many StackOverflow and other sites for the answer. Looks like no custom implementation found other than the below code which exists in AGPL version.

PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())

And byte[] pageContent = reader.GetPageContent(i); gives the byte content, when converted to string it won't give us the exact file text.

As, we do not wish to buy the AGPL version and need to implement the textextractor of pdf, any idea if any other tool supports this/ anybody has the implementation of textextractor.

Any suggestions would be greatly appreciated.

Edit: Refernce for the @jgoday's answer: enter image description here

like image 934
Lak Avatar asked Nov 01 '25 16:11

Lak


1 Answers

With iText 4.1 you can use PdfContentParser (https://github.com/schourode/iTextSharp-LGPL/blob/f75cdad88236d502af42458a420d48be2a47008f/src/core/iTextSharp/text/pdf/PdfContentParser.cs), to parse contents of every page.

using System;
using System.Text;
using iTextSharp.text.pdf;

namespace PdfExtractor
{
    class Program
    {
        static void Main(string[] args)
        {
            var reader = new PdfReader(@"D:\Tmp\sample.pdf");

            try
            {
                var parser = new PdfContentParser(new PRTokeniser(reader.GetPageContent(2)));

                var sb = new StringBuilder();

                while (parser.Tokeniser.NextToken())
                {
                    if (parser.Tokeniser.TokenType == PRTokeniser.TK_STRING)
                    {
                        string str = parser.Tokeniser.StringValue;
                        sb.Append(str);
                    }
                }

                Console.WriteLine(sb.ToString());
            }
            finally {
                reader.Close();
            }
        }
    }
}

like image 130
jgoday Avatar answered Nov 03 '25 04:11

jgoday