Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need HTML that will fail parsing by html.Parse()

Tags:

html

go

I am writing a Go function to read an HTML response body and extract the page title. Overall, the function works just great, but I want to test the code path where the response body isn't proper HTML at all. My simplistic attempts to create some invalid HTML for unit tests have come to naught.

Apparently, and according to the html.Parse documentation, this is because:

the HTML5 parsing algorithm […] is very complicated. The resultant tree can contain implicitly created nodes that have no explicit <tag> listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end <tag>s. Conversely, explicit <tag>s in r's data can be silently dropped, with no corresponding node in the resulting tree.

Here is some code demonstrating the sort of approach I've been taking:

https://play.golang.org/p/T5WjdtjNcqq

package main

import (
    "bytes"
    "fmt"
    "golang.org/x/net/html"
)

func main() {
    inputs := []string{ "",
        "~",
        "<",
        "<ht",
        "<html",
        "<html>",
        "<html><",
        "<html><titl",
        "<html><title",
        "<html><title>",
        "<html><title>The C Progr",
        "<html><title>The C Programming Language",
        "<html><title>The C Programming Language<",
        "<html><title>The C Programming Language</",
        "<html><title>The C Programming Language</ti",
        "<html><title>The C Programming Language</title",
        "<html><title>The C Programming Language</title>",
        "<html><title>The C Programming Language</title><",
        "<html><title>The C Programming Language</title></",
        "<html><title>The C Programming Language</title></ht",
        "<html><title>The C Programming Language</title></html",
        "<html><title>The C Programming Language</title></html>",
    }

    for _, in := range inputs {
        fmt.Printf("%s\n", in)

        r := bytes.NewReader([]byte(in))
        _, err := html.Parse(r)
        if err != nil {
            fmt.Printf("COULD NOT PARSE HTML\n")
            panic(err)
        }
    }
}

Silly me, I would have expected many of these to yield an error since at face value they are invalid HTML, but the above code sails through all of the input strings without panic'ing -- that is, with no non-nil err from html.Parse().

I suppose I am grateful for a lenient / tolerant HTML parser, but: Does anyone have an example of text that would yield an error when fed to Go's html.Parse()?

EDIT 1

Combining ideas from comments by Ferrybig and CreationTribe, I even tried a huge stream of random bytes:

    rand.Seed(time.Now().UnixNano())

    in := make([]byte, 0)
    for i := 0; i < 2147483647; i++ {
        in = append(in, byte(rand.Intn(255)))
    }
    fmt.Printf("len(in) : %d\n", len(in))

    r := bytes.NewReader(in)
    _, err := html.Parse(r)

… and it still did not error.

Is there no input that will cause html.Parse() to error out?

like image 489
landru27 Avatar asked Oct 21 '25 06:10

landru27


1 Answers

From a quick read of https://github.com/golang/net/blob/master/html/token.go, it seems that the only returned errors can be:

  • io.EOF once r is fully read successfully;
  • any other errors returned by the underlying io.Reader; or
  • html.ErrBufferExceeded

It's not obvious to me after an initial read how trigger ErrBufferExceeded, but you could trigger an error from html.Parse by providing a dummy reader:

type ErrReader struct { Error error }

func (e *ErrReader) Read([]byte) (int, error) {
    return nil, e.Error
}

https://play.golang.org/p/s78HpfMLAI8

Hope that helps

like image 147
icio Avatar answered Oct 23 '25 21:10

icio



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!