this regular expression should match an html start tag, I think. <code>var results = html.match(/<(\/?)(\w+)([^>]*?)>/);</code> I see it should first capture the <code><</code>, but then I am confused what this capture <code>(\/?)</code> accomplishes. Am I correct in reasoning that the <code>([^>]*?)></code> searches for every character except <code>></code> >= 0 times? If so, why is the <code>(\w+)</code> capture necessary? Doesn't it fall within the purview of <code>[^>]*?</code>

Using the power of debuggex to generate you an image :) <pre class="prettyprint"><code><(\/?)(\w+)([^>]*?)> </code></pre> Will be evaluated like this <img src="https://www.debuggex.com/i/hnhZ3pDQrgvXlpHg.png" alt="Regular expression image"> Edit live on Debuggex As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following: <ol> <li> <code>(\/?)</code> existence of <code>/</code> (it's a closing tag, if present)</li> <li> <code>(\w+)</code> name of the tag</li> <li> <code>([^>]*?)</code> everything else until the tag closes (e.g. attributes)</li> </ol> This way it matches <code><a href="#"></code>. Interestingly it does not match <code><a data-fun="fun>nofun"></code> correctly because it stops at the <code>></code> within the <code>data-fun</code> attribute. Although (I think) <code>></code> is valid in an attribute value. Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows <code>Letter | Digit | '.' | '-' | '_' | ':' | ..</code> (source: XHTML spec). <code>(\w+)</code>, however, does not match <code>.</code>, <code>-</code>, and <code>:</code>. An imaginary <code><.foobar></code> tag will not be matched by this regex. This should not have any real life impact, though. You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.

meaning of (\/?) in regex / is (\w+)([^>]*?) a redundancy?

Tags:

javascript

regex

this regular expression should match an html start tag, I think.

var results = html.match(/<(\/?)(\w+)([^>]*?)>/);

I see it should first capture the <, but then I am confused what this capture (\/?) accomplishes. Am I correct in reasoning that the ([^>]*?)> searches for every character except > >= 0 times? If so, why is the (\w+) capture necessary? Doesn't it fall within the purview of [^>]*?

476

asked Jul 03 '13 16:07

1252748

2 Answers

Take it token by token:

/ begin regex literal
< match a literal <
(\/?) match 0 or 1 (?) literal /, which is escaped by the \
(\w+) match one or more "word characters"
([^>]*?) lazily* match zero or more (*?) of anything that is not a >
> match a literal >
/ end regex literal

lazily* - adding "?" after a repetition quantifier will make it perform lazily, meaning the regex will match the preceding token the minimum number of times. See the documentation.

So essentially this regular expression will match "<", potentially followed by a "/", followed by any number of letters, digits, or underscores, followed by anything that is not a ">", and finally followed by a ">".

That being said, the token (\w+) is not redundant, as it ensures there is at least one word character in between < and >.

Please be aware that attempting to parse HTML with regular expressions is generally a bad idea.

144

answered Oct 19 '22 23:10

jbabey

Using the power of debuggex to generate you an image :)

<(\/?)(\w+)([^>]*?)>

Will be evaluated like this

Regular expression image

Edit live on Debuggex

As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following:

(\/?) existence of / (it's a closing tag, if present)
(\w+) name of the tag
([^>]*?) everything else until the tag closes (e.g. attributes)

This way it matches <a href="#">. Interestingly it does not match <a data-fun="fun>nofun"> correctly because it stops at the > within the data-fun attribute. Although (I think) > is valid in an attribute value.

Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows Letter | Digit | '.' | '-' | '_' | ':' | .. (source: XHTML spec). (\w+), however, does not match ., -, and :. An imaginary <.foobar> tag will not be matched by this regex. This should not have any real life impact, though.

You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.

answered Oct 20 '22 00:10

tessi

Related questions
                            
                                Questions on Javascript hoisting
                            
                                How to prevent objects inside an SVG drawing to be clipped at the bounds of the SVG element in chrome?
                            
                                using facebook batch request javascript api
                            
                                How to attach a function to popover dismiss event (Twitter Bootstrap)
                            
                                How is Object.prototype.toString.apply(value) different from value.toString()?
                            
                                linking nodes of variable radius with arrows
                            
                                Installing npm module results in command not found
                            
                                How do you use twitter bootstrap button with jquery?
                            
                                event.keyCode not working in Firefox
                            
                                bounding box appearance - controls customization with fabricjs
                            
                                Persian Calender in MVC , Asp.net
                            
                                Different display value for selecte text using select2.js
                            
                                how to create a ActiveXObject with node.js?
                            
                                Writing a function that "solves" an equation
                            
                                How to remove text between two elements with jQuery
                            
                                What's the different between style.left and element.offsetLeft
                            
                                Insert newline into javascript string
                            
                                Very simple javascript doesn't work at all [duplicate]
                            
                                AngularJS ng-repeat and form validation
                            
                                How to remove only html tags in a string using javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With