Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert raw HTML to Markdown, server-side using .Net (C#)?

Tags:

c#

.net

markdown

I need to take a chunk of raw HTML code from a third-party which may contain any number of tags/attributes and potentially dirty or harmful code, and then strip it right back and transform it into clean, safe Markdown code.

A 'Markdownifier' if you will, much like heckyesmarkdown.com does, but from within my server-side .Net (C#) application, not on the client-side. I am happy to use a third-party library (free or paid) to do this, but not a third-party hosted REST API or similar for performance, security and reliability reasons.

There are many libraries available for .Net which allow you to convert Markdown to HTML, however I need to do the reverse and can't seem to find a tool for .Net which has already solved this problem (unless I'm being a bit dim and looking in the wrong place!).

like image 305
TimS Avatar asked Oct 26 '25 06:10

TimS


2 Answers

I have found this library on GitHub:

https://github.com/baynezy/Html2Markdown

Looks promising for your problem! I have not tried it myself yet though.

There is a Nuget package also:

Install-Package Html2Markdown

Usage is as follows (html variable is a string):

 var markdown = new Converter().Convert(html);
like image 101
JDTLH9 Avatar answered Oct 27 '25 19:10

JDTLH9


You can try Pandoc (http://pandoc.org/). For Windows it is a command line tool but it works pretty good. This is how I have interfaced it before...

private const string processName = @"c:\program files (x86)\pandoc\pandoc.exe";
private const string args = @"-t markdown -r html5 -o ""{0}"" ""{1}""";

public void Convert(Stream inputStream, Stream outputStream)
{
    var process = new Process();

    var inputFilename = Path.GetTempFileName();
    var outputFilename = Path.GetTempFileName();

    using (var fileStream = File.Create(inputFilename))
    {
        inputStream.CopyTo(fileStream);
    }

    ProcessStartInfo psi = new ProcessStartInfo(processName, string.Format(args, outputFilename, inputFilename))
    {
        RedirectStandardOutput = true,
        RedirectStandardInput = true,
        UseShellExecute = false
    };

    process.StartInfo = psi;
    process.Start();
    process.WaitForExit();

    var bytes = File.ReadAllBytes(outputFilename);
    outputStream.Write(bytes, 0, bytes.Length);
}

EDIT

It should probably be noted that I have not used it for converting markdown before, but I have used it for converting other formats to and from HTML and it does a fairly reasonable job of it and it doesn't just blowup if it can't do something like others do. The arguments I have used have been sourced from http://pandoc.org/README.html in particular this:

pandoc -f html -t markdown http://www.fsf.org
like image 29
Michael Coxon Avatar answered Oct 27 '25 20:10

Michael Coxon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!