Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Html to Pdf with german alphabet

I'm using openhtmltopdf to transform html to pdf. Currently I'm getting an exception if the html contains german characters, like for example ä,ö,ü.

  PdfRendererBuilder builder = new PdfRendererBuilder();
  builder.useFastMode();
  builder.withHtmlContent(html,"file://localhost/");
  builder.toStream(out);
  builder.run();

org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 31; The entity "auml" was referenced, but not declared.

Here my html:

<html>
   <head>      
      <meta charset="UTF-8" />
    </head>
    <body>
        k&auml;se
    </body>
</html>

The exported word is "käse" (cheese).


UPDATE

I have tried with an entity resolver, in this way:

 DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
    DocumentBuilder builder=null;
    try{
      builder=factory.newDocumentBuilder();

      ByteArrayInputStream input=new ByteArrayInputStream(html.getBytes("UTF-8"));
      builder.setEntityResolver(FSEntityResolver.instance());
      org.w3c.dom.Document doc=builder.parse(input);


    }catch(Exception e){
      logger.error(e.getMessage(),e);
    }

but I'm still getting the same exception at "parse".

like image 880
Neo Avatar asked Oct 16 '25 04:10

Neo


1 Answers

Looks like you either need to provide DTD or replace the entity name auml with its corresponding hex or decimal value, i.e. &#xE4; or &#228; respectively. See A.2. Entity Sets and HTML 4 Entity Names.

The html content would look like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
        <!ENTITY auml "&#228;">
]>
<html>
    <head>
    </head>
    <body>
        k&auml;se
    </body>
</html>

Alternatively, you can run through the html string and replace the entity names with their corresponding dec/hex values, which should be fine, or just prepend the DTD to your html string before passing it to the pdf builder.


Update

You might want to give the jsoup library a try. It It parses and provides you with a org.w3c.dom.Document, e.g.

Document jsoupDoc = Jsoup.parse(html); // org.jsoup.nodes.Document
W3CDom w3cDom = new W3CDom(); // org.jsoup.helper.W3CDom
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

You can then pass the w3cDoc to the pdf builder like so

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(w3cDoc, "file://localhost/");
like image 65
Kenan Güler Avatar answered Oct 17 '25 18:10

Kenan Güler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!