I am looking for a way to convert HTML formatted text to plain text while maintaining its basic structure, and perhaps be slightly tweaked, so:
<p>This is a paragraph.</p>
<ol>
<li>List item 1.</li>
<li>List item 2.</li>
</ol>
<p>This is an <a href="www.google.com">anchor</a>.</p>
Becomes:
This is a paragraph.
- List item 1.
- List item 2.
This is an anchor (www.google.com).
Any ideas on how to effectively achieve for a very large number of HTML-formatted templates?
Use a text-based browser, such as lynx, and have it output to stdout. I'm not sure it will suit all your tweaking-needs, but it's a very quick and easy start
lynx -crawl -dump http://stackoverflow.com/questions/13279364/convert-html-to-plain-text-and-keep-basic-formatting
(actually, I would expect your list to be
1. List item 1.
2. List item 2.
since it's an ordered list)
Edit: actually looked more into your actual use case, it works perfectly:
> echo '<p>This is a paragraph.</p>
<ol>
<li>List item 1.</li>
<li>List item 2.</li>
</ol>
<p>This is an <a href="http://www.google.com">anchor</a>.</p>' | lynx -stdin -dump
becomes
This is a paragraph.
1. List item 1.
2. List item 2.
This is an [1]anchor.
References
1. http://www.google.com/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With