Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read Japanese characters from a PDF?

I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:

    public static string ExtractTextFromPDF(string filePath)
    {
        var pdfReader = new PdfReader(filePath);
        var pdfDoc = new PdfDocument(pdfReader);
        var sb = new StringBuilder();
        for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
        }
        pdfDoc.Close();
        pdfReader.Close();
        return sb.ToString();
    }

But I run into the exception:

iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was not found.'

I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?

Thanks

like image 713
jsmars Avatar asked Oct 16 '25 18:10

jsmars


1 Answers

Encoding CMaps in particular for CJK scripts are in a separate package.

For .Net use itext7.font-asian via nuget.

For Java use com.itextpdf:font-asian via maven.

The existence of this package is more visible for the Java version than for the .Net version.

like image 144
mkl Avatar answered Oct 19 '25 08:10

mkl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!