Ahh, the HREF. This is probably one of the most searched for RegEx on Stackoverflow (link, href, anchor searches equal more than 18,000 results).

Okay, so here's the regex:


There are two regex specifics, and the rest is just plain text to be taken as written.

The <a says to look for code that has a less-than sign followed immediatly by the letter a.

The next section \s+ says to look for the space character (you have to escape the s otherwise the expression will look for the literal s character. The plus-sign + tells the expression to look for 1 or more.

Then we are back to literal characters: href.

To summarize thus far: Look for <a followed by some number of spaces, followed by href.

The next section is a regex character group which is denoted by the square-brackets [], in this case [^>]. Okay, this group is going to look for anything -- spaces, single or double quotes, question-marks, letters, numbers, whatever -- just NOT the greater-than sign (>) which denotes the closing of the anchor tag. The +-sign after the square-brackets says to, that's right, look for 1 or more characters within that character group.

Finally, look for a greater-than sign.


Easy enough to switch out the href with an src and you can be looking for an image tag.

Capitalization you wonder? I know that FrontPage97 code likes to capitalize their tags: <A HREF="http://microserfs.com">. Well, there are two ways. First you can throw in the case-insensitivity flag at the end, which tells the regex engine to ignore case for the whole expression: <a\s+href[^>]+> \i. Or, this works too: <[aA]\s+href[^>]+> (but just for the leading a which denotes the anchor tag.