I am writing a python script to parse the content of Wordpress Export XML (wp xml) to generate a LaTex document. So far the wp xml is parsed via lxml.etree
and the code generates a new xml tree to be processed by texml, which in turn generates the tex file.
Currently I extract each post along with certain metadata (title, publication date, tags, content). The metadata poses no problem, but the content part is a bit problematic. Inside the wp xml the content is included as a CDATA structure in plain HTML/Wordpress Markup. To convert it into latex I choose pandoc to parse the content. TeXml supports inline LaTeX, so the content is added as plain LaTeX into the tree.
I decided to use pandoc in this case as it already converts most of the html tags nicely (a
, strong
, em
...), the only problem I have is how it deals with images.
I use a subprocess to interface with pandoc:
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=PIPE)
tex_result = p.communicate(input=(my_html_string).encode('utf-8'))[0]
a sample post might look like this
<strong>Lorem ipsum dolor</strong> sit amet, consectetur adipiscing elit.
<a href="http://link_to_source_image.jpg"><img class="alignnone size-medium wp-image-id" title="Title_text" src="http://link_to_scaled_down_version.jpg" alt="Some alt text" width="262" height="300" /></a>
Nam nulla ante, vestibulum a euismod sed, accumsan at magna. Cras non augue risus, vitae gravida quam.
I need images with captions embedded as figures e.g.
\begin{figure}
\includegraphics{link_to_image.jpg}
\label{fig:some_label}
\caption{Some alt text}
\end{figure}
pandoc seems to convert html img
tags into a simple inlined image, discarding any title or alt texts.
\href{http://link\_to\_source\_image.jpg}{\includegraphics{http://link_to_scaled_down_version.jpg}}
I did peek into the source and it looks like img
is only treated as inline element.
(pandoc parsing function). I don't know Haskell so this is how far I got.
If you convert the html into markdown though, it keeps the alt and title and the result is similar to

With markdown you can either have inlined images or figures in the resulting latex document. If you convert this markdown into latex the result is
\begin{figure}[htbp]
\centering
\includegraphics{http://link_to_scaled_down_version.jpg}
\caption{Some alt text}
\end{figure}
First pandoc seemed like a simple solution to parse the content, but I am a bit stuck as pandoc also doesn't support inline latex in html so I could first process all the images and the rest through pandoc.
Do you guys have any idea on how to (better) process img
tags in html to be embedded in a figure environment in latex having captions?
Pandoc treats paragraphs containing only an image specially, as images with captions. These will be turned into LaTeX figures with captions. Thus:
% pandoc -f html -t latex
<p><img src="myimg.jpg" alt="my text" title="my title"/></p>
^D
\begin{figure}[htbp]
\centering
\includegraphics{myimg.jpg}
\caption{my text}
\end{figure}
This might help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With