Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the regex [^\s]*? mean?

Tags:

python

regex

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex. I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s? And what's the difference between the two regexes?

http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
like image 447
LiuHao Avatar asked Oct 22 '25 23:10

LiuHao


1 Answers

Alright, so to answer your first question, I'll break down [^\s]*?.

  • The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.

  • \s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.

  • *? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.

In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.

To answer the second part of your question, I'll compare the two regexes you give:

http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)

They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.

Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like

http:foo.bar.png
http:.png

Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:

http:// .jpg
http://foo bar.png

Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:

https?://\S+\.(jpe?g|png|gif)

In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

like image 155
Sebastian Lenartowicz Avatar answered Oct 25 '25 13:10

Sebastian Lenartowicz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!