share
Stack OverflowRegEx match open tags except XHTML self-contained tags
[+2247] [37] Jeff
[2009-11-13 22:38:26]
[ html regex xhtml ]
[ https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ]

I need to match all of these opening tags:

<p>
<a href="foo">

But not self-closing tags:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

Do I have that right? And more importantly, what do you think?

[+4406] [2009-11-13 23:04:30] bobince [ACCEPTED]

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ


Have you tried using an XML parser instead?


Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.


(179) Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death. - bobince
1
[+3571] [2009-11-14 06:27:19] Kaitlin Duck Sherwood

While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.


(161) Also, scraping fairly regularly formatted data from large documents is going to be WAY faster with judicious use of scan & regex than any generic parser. And if you are comfortable with coding regexes, way faster to code than coding xpaths. And almost certainly less fragile to changes in what you are scraping. So bleh. - Michael Johnston
(316) @MichaelJohnston "Less fragile"? Almost certainly not. Regexes care about text-formatting details than an XML parser can silently ignore. Switching between &foo; encodings and CDATA sections? Using an HTML minifier to remove all whitespace in your document that the browser doesn't render? An XML parser won't care, and neither will a well-written XPath statement. A regex-based "parser", on the other hand... - Charles Duffy
(46) @CharlesDuffy for an one time job it's ok, and for spaces we use \s+ - quantum
(80) @xiaomao indeed, if having to know all the gotchas and workarounds to get an 80% solution that fails the rest of the time "works for you", I can't stop you. Meanwhile, I'm over on my side of the fence using parsers that work on 100% of syntactically valid XML. - Charles Duffy
(445) I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke, and all their styling was inline or with <font> etc.: no classes or IDs to help navigate the DOM. After fighting all day with the "right" approach, I finally switched to a regex solution and had it working in an hour. - Paul A Jungwirth
(44) @CharlesDuffy: definitely less fragile. When the third-party changes their html, they are much more likely to change the structure (breaking your xpaths) than the leaf nodes you are scraping. Scraping is not parsing. Scraping is pulling specific bits of data from a puddle of designer contaminated crap you don't care about. You DO NOT WANT to parse that puddle. You want to do only the absolute minimum amount of "parsing" that will get you your data. You don't CARE about the structure. You care about your bits of data. - Michael Johnston
2
[+2332] [2009-11-18 18:42:40] NealB

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) [1] and a regular expression is a Chomsky Type 3 grammar (regular grammar) [2]. Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy [3]), you can't possibly make this work.

But many will try, and some will even claim success - but until others find the fault and totally mess you up.

[1] http://en.wikipedia.org/wiki/Context-free_grammar
[2] http://en.wikipedia.org/wiki/Regular_grammar
[3] http://en.wikipedia.org/wiki/Chomsky_hierarchy

(278) The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG. - LarsH
(131) In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/… describes, so-called "regular expressions" can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can't accomplish. - Adam Mihalcin
(2) @LarsH, that may be strictly true, but only if you can depend on well-formedness. To be able to reliably parse even start tags one would have to accommodate a number of workarounds for (X)HTML syntax errors implemented in real world user agents, and even then probably cross fingers and pray. - eyelidlessness
(18) @eyelidlessness: the same "only if" applies to all CFGs, does it not? I.e. if the (X)HTML input is not well-formed, not even a full-blown XML parser will work reliably. Maybe if you give examples of the "(X)HTML syntax errors implemented in real world user agents" you're referring to, I'll understand what you're getting at better. - LarsH
(102) @AdamMihalcin is exactly right. Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs). Some regex engines (such as Perl's) are Turing complete. It's true that even those are poor tools for parsing HTML, but this oft-cited argument is not the reason why. - dubiousjim
(5) You are correct, I just want to drop this: You can write a finite state-machine that is equivalent to any regex expression (finite state-machines are equivalent in power to Type 3 languages.) But to parse HTML you need memory, there is this thing called pushdown automaton that is basically an state-machine with a stack to store values. That can be used to parse HTML, unfortunately it is pretty hard to describe a pushdown automaton in text language (like you can represent state-machines in regex,) it is easier to program it yourself. - Hoffmann
(3) I used regex for a long time before learning the theory and when I was first told you can't parse things with regex I said that was nonsense which is kind of true, it depends what you mean. Language is ambiguous. You can't parse HTML using only regex. I was using regex but additionally with functions that provided the higher level elements for parsing such as maintaining a stack of the current nesting. I wouldn't try using regex alone for parsing anything complex even if it had added features to make it possible. - jgmjgm
(7) To say the grammar of language A dictates it's parsing capabilities of another language B based on its grammar, is not valid. For example, just because HTML is Chomsky Type 2 language, doesn't mean you could write pure HTML which could parse any Chomsky Type 3 language. HTML itself is not a language with any features that give it the ability to parse other languages. Please don't say "Javascript", because javascript is not parsed by something written in HTML. - AaronLS
(7) However, you are on the right track. Since HTML is a Chomsky Type 2 language, then to parse a Chomsky Type 2 language you generally need a stack capability within the parsing language(to track context). Regex doesn't have stack management capability: not because it is a Chomsky Type 3, but simply because the language was not designed with that capability. With some support for some extended constructs like recursion, it is possible in Regex, but would not be easy. Just as HTML lacks any capabilities to parse other languages, Regex lacks a stack capability needed to parse HTML. - AaronLS
(3) FWIW HTML is not context-free (although the ways in which it is not are not relevant to this problem). For example, you cannot have unique ids in a context-free grammar. - Tgr
(7) RegEx hasn't been limited to regular languages for 30 years now. - Erik Reppen
@LarsH One exception to XHTML start tag finding with regex : If it includes <?php ?> inline dynamic code for string interpolation. - Milind R
@MilindR You're right, that would make it impossible to match with a regexp; but that's also outside the scope of what the OP asked (as I interpret it). - LarsH
@LarsH Fair. Any suggestions how to parse such intermingled HTML+PHP(or any other server script)? Lex+Yacc? Or are there any simpler tools for mortals? - Milind R
@MilindR Sorry, I don't know. I guess you would start with a PHP parser... but I have very little experience with PHP-related tools. - LarsH
Coolest thing I learnt today! - ᐅdevrimbaris
3
[+1177] [2009-11-15 06:37:18] itsadok

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi [1]'s negative look-behind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

[1] https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732395#1732395

(123) I would go with something that works on sane things than weep about not being universally perfect :-) - prajeesh kumar
(24) so you do not actually solve the parsing problem with regexp only but as a part of the parser this may work. PS: working product doesn't mean good code. No offence, but this is how industrial programming works and gets their money - mishmashru
(43) Your regex starts fail on the very shortest possible, valid HTML: <!doctype html><title><</title>. Simple '<!doctype html><title><</title>'.match(/<(?:"[^"]*"['"]*|'[^']*'['"]*|‌​[^'">])+>/g) returns ["<!doctype html>", "<title>", "<</title>"] while should ["<title>", "</title>"]. - user1180790
(3) if we're just trying to match & not match the examples given, /<.([^r>][^>]*)?>/g works :-) // javascript: '<p> <a href="foo"> <br /> <hr class="foo" />'.match(/<.([^r>][^>]*)?>/g) - imma
There is a difference between pattern matching a fragment of html and pattern matching html while ensuring a valid structure is returned. Tokenising HTML is probably easy with regex but then try dealing with: <a></b> - jgmjgm
(2) "Is someone using CDATA inside HTML?" - yes, I do. It takes less bytes if you show HTML source code in <pre> tags. - cweiske
Your regex fails with three false positives in the basic set (first group after the blank initial line). The regex I use to detect possible HTML (detection, not parsing!) does not suffer these failures, additionally covers DOCTYPE and encoded entities. - amcgregor
(1) This one for Python regex101.com/r/6IbTnI/1 , this one for PCRE regex101.com/r/p0t1H8/1 - user13843220
This partially matches comments, but not the whole thing. - Cody
<42>These shouldn't match either</42>. Credit to LiveOverflow's Generic HTML Sanitizer Bypass Investigation - Cameron Tacklind
It didn't work for me, check this demo, it shows me 3 matches. - Amine KOUIS
4
[+595] [2011-03-08 13:30:46] xanatos

There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

You can live in their reality or take the red pill.

Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:

7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=

The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.

If you have problems reconverting it to a human-readable regex, this should help:

static string FromBase64(string str)
{
    byte[] byteArray = Convert.FromBase64String(str);

    using (var msIn = new MemoryStream(byteArray))
    using (var msOut = new MemoryStream()) {
        using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
            ds.CopyTo(msOut);
        }

        return Encoding.UTF8.GetString(msOut.ToArray());
    }
}

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests [1]. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods:

regex to tokenize an xml [2] or the full plain regex [3]

[1] http://www.w3.org/XML/Test/
[2] http://pastebin.com/hzYazFVb
[3] https://topaz.github.io/paste/#XQAAAQD5hQAAAAAAAAAUD8Q6Ijb26igjgaUO/S4VLr/Od1fatGY8ycZ79EV23K5OCMWdbg2gH+s7o5uxCPlMSN1JtgtVM2MKR6CqK1eEDhtb5JZyw5spb/FtqvAc3ed4JkSFjzVZF7RTA0u9sRtmbSyVgOdqUpqnibi1CDqHGXGOzOlBKLxSopincGbR0sbzm+mA3nrgLtwe1kqAj3MWoPyOrU8e7ipjvkI+e0LALD6uam6dq+hXtGQJ8LYSeoUpKjGW3LDV7Oh3mE3OBu9AaQF7PiSsUTC2b/AqI1rEOqBWwwkUevXnMnpPYZ+FlYhJ4zgvOyR3YStbExN6Q8h79n9w8lEqI1rr4B2xDaqTgsFd+rg0Iu3S3aaRhII9wdUaipKiEKuDujWemedqT6P+ohRi9CC/lGr8Kz5+QlErsB/97LiffPcTizNflkF8TnInJba8R0w9nhL70OX9IijnRbrHYLnEK62mliz7JFFmSWu9KqzbyrC+OkAQIi0hdmLzITt7lz8OCUKWocUyBeP3JSgXOGX/P8sw3WF6q6QBu0XmN4EgtHfcBb130ewOQ34MhCEw8q79ycePiduoP7MlbzbG5Iw8202AlrfjFp96dawcaALWOIMDGEaM7X1ZC5RFAfcpHNLu/KxctKOoyhIzYWS+LTMMPBx13L4IYXiDysJuG4acbJiDiKfla4i8Z0QGrPLvF7/1A5ufy7yLck9adE1aXZUD7yxX6qXICx+Ue6Fq+PHDslFeU6Q74LWjj/tu8CGM55EMItBrpz5EcTgeoBxNuA/vrYi/Ybm7hMscw/pYGL9RG5H+ok3OzKrWdjintjxvVV+cGNWsN/LNWC3bGp5OJaArP5OCehsMwcAQMQkNi8cpSX+cP6nRaV5nO/5borKcXufMdw8g1zmgTqul+0qISwn3MNK/Y0Qd+KgBIumvIUQT1HzLpbehbjAkYFg+PBUr4BPDAGiEN+lvtSsn3R3yFMyX0TcYe0a5dSBSMpq4P/ZCRJy+2pFLvtIMYJwph34zhLPJOoFK0LiiT+Vgt4yjHLQwGfzSug2oT5TaUAFwOWY2SeTxb5SfaxTB+DX8B+jhlX2DvEVV/EUWcoEkImMx1v9u+yuIshY69ikFaZfcrcCFPRLu6RVog+sLNgXuk/Q+OnoUuoeok367pwuiw26/byFpSFogS2DIRIG2J3agwqa0XPtcHY2j3H2niOigKaOX1oeansYqIjvGykcysm43IhAR2QEcoPKZOhi1bwSwpP98hpin+dkVJDD8f0w/ipDIMpIDRTv45VQWAzdK4yLqaauZRR76QeiAi618bOSiO0LnUYcbyRsU32v9UJ5LMZjzKo/trYrBgY/F4rZG6X+GSl03MbbQM3CHqo1iNc9voknMrNfmuSb7eGB2sNN/B5l0fk57pspZsJ2EuE1v5NtBjwrS9qMQzehoE7sh5YxbNyj9x44FSZDbV/2PXhAgkVZ63td5m8AfPngjAReF4bTvL/rlIWMCbJL6IQKAt2jH4l4wpfFm0qssBl2vdsfNXPhTzRWbB+UPJmxUBGv8YF0rd4Ol3SpuF8fF368DUP96pt96T8W56LIhPULh6yECYWX83QwMyoEvkcgeEJIEm08InYo7UWKRiQml0BTb+YOcy+V20V+k+YAZM2hEjbTNNnXqCvtmVytw1fA6OESzlpcOWzmFwKqwhRAtRJ+Z/YhQLhC7J1xdbFc3cG9hihArqtMRXCCFLcf24zl5rhtV9NJRZdn56s2qspoMtk8m+vGXaLFKdt3j8O5KEaPCILeUbXLS6gtm+ByiGuIF4GWAWcstCh0IQ5j+0J/+5SRp27y/Q0kvZNhD/HrqNmONDE6h7qaE6fKrhrmCLo8XcM59eiEeJuO/KWSDVbpwaDhrx+DS0ngI5TeWmAliRXYUISI/B+hhjFwawuXlK1FAm0Ohyf6XBo4dwoU/SYOHva8wB2qiPlVCvRvs7vK9FkWQjzNw0v/sDHy+nd49LiIdJkvBPsYS72H/E7kLt7P7WVJgpENY4AqXXGtZ6/L5lcByXgFxDgZbiWMKf1GCfb5QNLauPHZBjxI45JvZsDlG3sUaHwnRyYLiDE+ly+w53l2GgVX4wpPQ1JPjCIvLJ8fmKy4B5HOC5uJYTfUyjAeKP5aIloVVGESb8SGbXRfcme11BZmPyBvjivWZ8kABDh6aKGZdUZCvMnlbZnwKYUWl1ZSFi5AMlw0nEu9pFy5h/AIE+yRTioJ9VYn7ZC4njk5p7V7g+ynr8xGDRAcwLQPVUuCVCDVDSx1eGfWa6IT9G6aVHA1+SHx+sPvHNmWCMYpYWPY5b6l5DYXlTPqChQBwMxcGQnusdNEsEvQYV4FBJhYjgLMxfjBoLPPvysNmpg+qItxnBaDZgMEFa4I3Ek1e7f412UaMloHzTKuzotNQE3quvOH0/9zORWQ=

(74) not-sure-if-serious.jpg -- hopefully this is brilliant satire - Brad Mace
(8) @bemace You can try it... In a VM... disconnected from the Internet and from your LAN... Using a 10 foot pole to run it! :-) - xanatos
(86) Good Lord, it's massive. My biggest question is why? You realize that all modern languages have XML parsers, right? You can do all that in like 3 lines and be sure it'll work. Furthermore, do you also realize that pure regex is provably unable to do certain things? Unless you've created a hybrid regex/imperative code parser, but it doesn't look like you have. Can you compress random data as well? - Justin Morgan
(155) @Justin I don't need a reason. It could be done (and it wasn't illegal/immoral), so I have done it. There are no limitations to the mind except those we acknowledge (Napoleon Hill)... Modern languages can parse XML? Really? And I thought that THAT was illegal! :-) - xanatos
(6) Well, we're talking about the theoretical limits of the language; it's not like we just haven't figured out how to do it yet. If you use pure regex, there's always going to be some (X)HTML valid code that breaks it. Maybe it's <foo bar="baz > quux"> or maybe it's a certain nesting depth. Not that I didn't find your post funny (Old Ones - hah!) - Justin Morgan
(10) Don't try it! If you do, the entire internet will be compressed to a quantum singularity and sucked down the rabbit hole! - Christian Hayter
(99) Sir, I'm convinced. I'm going to use this code as part of the kernel for my perpetual-motion machine--can you believe those fools at the patent office keep rejecting my application? Well, I'll show them. I'll show them all! - Justin Morgan
(6) @Justin Unless there is a bug, or unless the memory becomes full, I'm pretty sure that any VALID XML can be tokenized. As @John-David Dalton noticed, the specifics of XML are given in Regex-like expressions, so it was quite easy (if not long and harduous) - xanatos
(4) Does it work for <foo bar="baz > quux">? Because that is in fact valid. So is stuff like <foo bar='"baz" > \'quux\''>. There's a LOT of stuff that looks "wrong" but is still valid XML. - Justin Morgan
(6) @justin The second one is illegal, you can't escape the quotes with a \ . The first one will work if you close the foo. - xanatos
(4) @justin And if you don't trust me, stackoverflow.com/questions/1222367/… - xanatos
(3) Huh, looks like I stand corrected on that one. Fair enough. The basic point is more or less the same, though: Just because you haven't figured out how to break it yet, that doesn't mean it always works. And if it doesn't always work, what's the point? An HTML parser will be more practical AND catch everything. - Justin Morgan
(36) @Justin So an Xml Parser is by definition bug free, while a Regex isn't? Because if an Xml Parser isn't bug free by definition there could be an xml that make it crash and we are back to step 0. Let say this: both the Xml Parser and this Regex try to be able to parse all the "legal" XML. They CAN parse some "illegal" XML. Bugs could crash both of them. C# XmlReader is surely more tested than this Regex. - xanatos
(45) No, nothing is bug free: 1) All programs contain at least one bug. 2) All programs contain at least one line of unnecessary source code. 3) By #1 and #2 and using logical induction, it's a simple matter to prove that any program can be reduced to a single line of code with a bug. (from Learning Perl) - Scott Weaver
(24) Nope, not all XML parsers are free of defects. Not all cars are free of defects, but it's a lot easier to buy one than it is to grow a mutant ant big enough to ride. You've spent hours constructing a mutant ant recipe which may or may not work at all, and may even be biologically impossible. Meanwhile, there's a perfectly good car sitting there. - Justin Morgan
(9) The .NET regular language does span XML, and so you can tokenize XML with a .NET regular expression. Tokenization and parsing are very very far from the same. The above regex will accept <A><B><C></B></C></A> (I presume). It is a valid sequence of Xml tokens. This string is in the language of XML tokens. It is not in the language of XML, but that doesn't mean it cannot be tokenized. Just because you cannot parse something doesn't mean that it can't be tokenized. LR(0) parsers can't parse binary strings "00011", but the regex [01]* sure can tokenize them. - Michael Graczyk
LR(0) can't parse them because they are left recursive, btw. - Michael Graczyk
(3) @MichaelGraczyk I remember correctly, the above regex won't accept <A><B><C></B></C></A>. It is a stateful tokenizer, not a stateless tokenizer. - xanatos
(4) In that case, it isn't a regular expression proper. If it uses backtracking at all, or for whatever reason cannot be expressed in terms of empty strings, literals, and the empty set with concatenations, alternations, Kleene stars thereof (inductive concatenations), then it isn't formally regular. A lot of people here seem to think that all "regular expressions" are true, formal regular expressions. It's true that Thompson NFA regular expressions are not true regular expressions, and are not recursive. However, PCRE regex is not truly regular, and hence can be recursive. - Michael Graczyk
(2) Of course, you never said that it was a proper regular expression. You just accurately stated that it would tokenize XML. - Michael Graczyk
(1) Actually, an algorithm CAN be made to compress random data. It would search through all seeds and compare them to the random data to find what seed was used, and find the algorithm used. Then all of that random data can be compressed down to a seed, length, and algorithm. Only problem: it would take days. :-) - uınbɐɥs
(5) @ShaquinTrifonoff - Then it's not actually random. I think your idea of random data is different from mine. - Justin Morgan
@JustinMorgan - You're right. I was thinking of 'normal' random numbers produced by the browser - see Temporary User Tracking In Major Browsers. That document was originally here, but has since been removed. - uınbɐɥs
(3) This isn't regex, this is full blown parser! While it's perfectly fine to tokenize bits of string by regex, you need some functions around that to make it into a real parser, and that's exactly what this is. By "parsing html with regex", I imagine regex-only-"parser", which is obviously impossible. This is just C# parser that uses regexes every now and then, not regex parser. - enrey
This is certainly the second best answer here, particularly because the other top answers make fun of the idea of parsing HTML with regex while this one literally succeeds at just that. The only dispute is that PCRE is not mathematically considered a regular expression language: the naysayers hinge their argument on the opinion that regular expressions cannot be recursive. My response? The OP said nothing about regular expressions and only asked for a regex. And yes there is a difference, you just need to take the red pill to see it: rexegg.com/regex-vs-regular-expression.html - Pluto
(1) If anyone wants to deflate the base64 string (saved in my case as xml-parser.b64) on the command prompt, you can try this: printf "\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00" | cat - <(base64 -d xml-regex.b64 ) | gunzip - Alexander Stumpf
@AlexanderStumpf I get an "invalid argument" when I execute that in zsh/MacOS. I assume xml-regex.b64 is the name of the file you saved the snippet above in? - d-b
@d-b you are correct, I should've written "saved in my case as xml-regex.b64" - Alexander Stumpf
5
[+318] [2010-02-15 00:55:24] dubiousjim

In shell, you can parse HTML [1] using sed [2]:

  1. Turing.sed [3]
  2. Write HTML parser (homework)
  3. ???
  4. Profit!

Related (why you shouldn't use regex match):

[1] https://en.wikipedia.org/wiki/HTML
[2] https://en.wikipedia.org/wiki/Sed
[3] http://sed.sourceforge.net/grabbag/scripts/turing.sed
[4] https://blog.codinghorror.com/if-you-like-regular-expressions-so-much-why-dont-you-marry-them/
[5] https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/
[6] http://danlec.com/blog/hacking-stackoverflow-com-s-html-sanitizer

(4) I’m afraid you did not get the joke, @kenorb. Please, read the question and the accepted answer once more. This is not about HTML parsing tools in general, nor about HTML parsing shell tools, it’s about parsing HTML via regexes. - Palec
(1) @Palec I don't get the joke either. Is it nearly impossible to parse HTML with regex? - Honinbo Shusaku
(3) No, @Abdul. It is completely, provably (in the mathematical sense) impossible. - Palec
@Palec Is that mathematical sense in relation to VladGudim's answer on the grammar types? Or something else? - Honinbo Shusaku
(6) Yes, that answer summarizes it well, @Abdul. Note that, however, regex implementations are not really regular expressions in the mathematical sense -- they have constructs that make them stronger, often Turing-complete (equivalent to Type 0 grammars). The argument breaks with this fact, but is still somewhat valid in the sense that regexes were never meant to be capable of doing such a job, though. - Palec
(2) And by the way, the joke I referred to was the content of this answer before kenorb's (radical) edits, specifically revision 4, @Abdul. - Palec
(12) The funny thing is that OP never asked to parse html using regex. He asked to match text (which happens to be HTML) using regex. Which is perfectly reasonable. - Paralife
6
[+295] [2011-09-27 04:01:04] Sam

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework [1] and specifically talks about Consider[ing] the Input Source [2].

Regular Expressions do have limitations, but have you considered the following?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions [3].

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.

  • (?<group>) - pushes the captured result on the capture stack with the name group.
  • (?<-group>) - pops the top most capture with the name group off the capture stack.
  • (?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.

Consider the following regular expression:

(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
   <!-- .*? -->                  |
   <[^>]*/>                      |
   (?<opentag><(?!/)[^>]*[^/]>)  |
   (?<-opentag></[^>]*[^/]>)     |
   [^<>]*
)*
(?(opentag)(?!))

Use the flags:

  • Singleline
  • IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)
  • IgnoreCase (not necessary)

Regular Expression Explained (inline)

(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?>                                        # atomic group / don't backtrack (faster)
   <!-- .*? -->                 |          # match xml / html comment
   <[^>]*/>                     |          # self closing tag
   (?<opentag><(?!/)[^>]*[^/]>) |          # push opening xml tag
   (?<-opentag></[^>]*[^/]>)    |          # pop closing xml tag
   [^<>]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester [7].

I used the sample source of:

<html>
<body>
<div>
   <br />
   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>
</div>
</body>
</html>

This found the match:

   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>

although it actually came out like this:

<ul id="matchMe" type="square">           <li>stuff...</li>           <li>more stuff</li>           <li>               <div>                    <span>still more</span>                    <ul>                         <li>Another &gt;ul&lt;, oh my!</li>                         <li>...</li>                    </ul>               </div>           </li>        </ul>

Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way [8]. Funny enough, it cites the answer to this question that currently has over 4k votes.

[1] https://learn.microsoft.com/dotnet/standard/base-types/best-practices
[2] https://learn.microsoft.com/dotnet/standard/base-types/best-practices#consider-the-input-source
[3] https://learn.microsoft.com/dotnet/standard/base-types/grouping-constructs-in-regular-expressions#balancing_group_definition
[4] https://weblogs.asp.net/whaggard/377025
[5] https://learn.microsoft.com/archive/blogs/bclteam/net-regular-expressions-regex-and-balanced-matching-ryan-byington
[6] https://learn.microsoft.com/dotnet/standard/base-types/grouping-constructs-in-regular-expressions#balancing_group_definition
[7] http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
[8] https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

(19) System.Text is not part of C#. It's part of .NET. - John Saunders
(8) In the first line of your regex ((?=<ul\s*id="matchMe"\s*type="square"\s*>) # match start with <ul id="matchMe"...), in between "<ul" and "id" should be \s+, not \s*, unless you want it to match <ulid=... ;) - C0deH4cker
@C0deH4cker You are correct, the expression should have \s+ instead of \s*. - Sam
(4) Not that I really understand it, but I think your regex fails on <img src="images/pic.jpg" /> - Scheintod
(3) @Scheintod Thank you for the comment. I updated the code. The previous expression failed for self closing tags that had a / somewhere inside which failed for your <img src="images/pic.jpg" /> html. - Sam
It's an interesting regex approach! I just noticed that if you add <br / > (with a space after the slash) inside a child of the <ul> you are looking for, then the pattern will also match the closing </div> because the self-closing tag pattern should be <[^>]*/\s*>. Spaces are also allowed in closing tags. I came up with these minor corrections: regex101.com/r/tUtKXj/1 So this typically shows that even advanced regular expressions like your very nice one may break on valid HTML if we don't think at everything :-/ - Patrick Janser
7
[+267] [2009-11-13 23:44:50] John Fiala

I suggest using QueryPath [1] for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.

[1] http://querypath.org/

(9) @Kyle—jQuery does not parse XML, it uses the client's built–in parser (if there is one). Therefore you do not need jQuery to do it, but as little as two lines of plain old JavaScript. If there is no built–in parser, jQuery will not help. - RobG
(2) @RobG Actually jQuery uses the DOM, not the built-in parser. - Qix - MONICA WAS MISTREATED
(12) @Qix—you'd better tell the authors of the documentation then: "jQuery.parseXML uses the native parsing function of the browser…". Source: jQuery.parseXML() - RobG
(6) Having come here from the meme question (meta.stackexchange.com/questions/19478/the-many-memes-of-me‌​ta/…), I love that one of the answers is 'Use jQuery' - user8681
8
[+236] [2010-01-27 12:54:35] moritz

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.


(31) +1 for noting that the question is not about parsing full (X)HTML, it's about matching (X)HTML open tags. - LarsH
(11) Something else most of the answers seem to ignore, is that an HTML parser can very well use regular expressions in its implementation for parts of HTML, and I would be surprised if most parsers didn't do this. - Thayne
(2) @Thayne Exactly. When parsing individual tags, a regular expression is the right tool for the job. It is quite ridiculous that one has to scroll halfway down the page to find a reasonable answer. The accepted answer is incorrect because it mixes up lexing and parsing. - kasperd
(4) The answer given here will fail when an attribute value contains a '>' or '/' character. - Martin L
(1) This will work incorrectly on HTML containing comments or CData sections. It will also not work correctly if a quoted attribute contains a > character. I agree what OP suggest can be done with a regex, but the one presented here is far to simplistic. - JacquesB
(2) The <h1> tag would like a word with you (easily fixed, I know, but still)... - jimbobmcgee
Thanks, this works for, me. demo - Amine KOUIS
9
[+197] [2012-05-17 10:13:03] cytinus

Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:

It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss. If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.

I have composed a haiku describing the nature of HTML.

HTML has
complexity exceeding
regular language.

I have also composed a haiku describing the nature of regex in Perl.

The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>

How many syllables does <([a-zA-Z]+)(?:[^>]*[^/]*)?> have?? - LarsH
Amazing, also this works fine. demo - Amine KOUIS
10
[+193] [2009-11-13 22:50:48] Kobi

Try:

<([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.


(116) <a href="foo" title="5>3"> Oops </a> - Gareth
(68) > is valid in an attribute value. Indeed, in the ‘canonical XML’ serialisation you must not use &gt;. (Which isn't entirely relevant, except to emphasise that > in an attribute value is not at all an unusual thing.) - bobince
(5) @Kobi: what does the exlamation mark (the one you placed tpward the end) mean in a regexp? - Marco Demaio
(6) @bobince: are u sure? I don't understand anymore, so is this valid HTML too: <div title="this tag is a <div></div>">hello</div> - Marco Demaio
(3) @MarcoDemaio - > does not have to be escaped in an attribute value, but < does. So this is would be valid HTML: <div title="this tag is a &lt;div>&lt;/div>">hello</div> - Daniel Haley
This also matches <!-- some comment -->. - fritzmg
Sorry, it doesn't work demo - Amine KOUIS
11
[+161] [2009-11-15 14:37:06] meder omuraliev
<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>';

$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );
    }
}

Output:

string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.


(1) If you're dealing with real XHTML then append getElementsByTagName with NS and specify the namespace. - meder omuraliev
12
[+153] [2009-11-16 23:15:03] GONeale

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack [1]?

Excerpt:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

[1] http://www.codeplex.com/htmlagilitypack

CodePlex closed down (but this one is in the CodePlex archive). Perhaps update? - Peter Mortensen
13
[+141] [2009-11-13 22:47:17] Jherico

You want the first > not preceded by a /. Look here [1] for details on how to do that. It's referred to as negative lookbehind.

However, a naïve implementation of that will end up matching <bar/></foo> in this example document

<foo><bar/></foo>

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

[1] http://www.regular-expressions.info/lookaround.html

14
[+110] [2009-11-16 19:02:48] SamGoody

If you need this for PHP:

The PHP DOM [1] functions [2] won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom [3] is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath [4], so can't comment on its usefulness.

Another one to try is my DOMParser [5] which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

[1] http://www.php.net/manual/en/function.dom-import-simplexml.php
[2] http://php.net/manual/en/class.domdocument.php
[3] http://simplehtmldom.sourceforge.net/
[4] http://querypath.org/
[5] http://github.com/siteroller/domparser

15
[+99] [2011-07-25 14:35:59] yodabar

Here's the solution:

<?php
// here's the pattern:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';
    
// a string to parse:
$string = 'Hello, try clicking <a href="#paragraph">here</a>
    <br/>and check out.<hr />
    <h2>title</h2>
    <a name ="paragraph" rel= "I\'m an anchor"></a>
    Fine, <span title=\'highlight the "punch"\'>thanks<span>.
    <div class = "clear"></div>
    <br>';
    
// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
    
// print the result:
print_r($matches[0]);
?>

To test it deeply, I entered in the string auto-closing tags like:

  1. <hr />
  2. <br/>
  3. <br>

I also entered tags with:

  1. one attribute
  2. more than one attribute
  3. attributes which value is bound either into single quotes or into double quotes
  4. attributes containing single quotes when the delimiter is a double quote and vice versa
  5. "unpretty" attributes with a space before the "=" symbol, after it and both before and after it.

Should you find something which does not work in the proof of concept above, I am available in analyzing the code to improve my skills.

<EDIT> I forgot that the question from the user was to avoid the parsing of self-closing tags. In this case the pattern is simpler, turning into this:

$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';

The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:

$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';

</EDIT>

Understanding the pattern

If someone is interested in learning more about the pattern, I provide some line:
  1. the first sub-expression (\w+) matches the tag name
  2. the second sub-expression contains the pattern of an attribute. It is composed by:
  3. one or more whitespaces \s+
  4. the name of the attribute (\w+)
  5. zero or more whitespaces \s* (it is possible or not, leaving blanks here)
  6. the "=" symbol
  7. again, zero or more whitespaces
  8. the delimiter of the attribute value, a single or double quote ('|"). In the pattern, the single quote is escaped because it coincides with the PHP string delimiter. This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute, that's why it is very important.
  9. the value of the attribute, matched by almost anything: (.*?); in this specific syntax, using the greedy match (the question mark after the asterisk) the RegExp engine enables a "look-ahead"-like operator, which matches anything but what follows this sub-expression
  10. here comes the fun: the \4 part is a backreference operator, which refers to a sub-expression defined before in the pattern, in this case, I am referring to the fourth sub-expression, which is the first attribute delimiter found
  11. zero or more whitespaces \s*
  12. the attribute sub-expression ends here, with the specification of zero or more possible occurrences, given by the asterisk.
  13. Then, since a tag may end with a whitespace before the ">" symbol, zero or more whitespaces are matched with the \s* subpattern.
  14. The tag to match may end with a simple ">" symbol, or a possible XHTML closure, which makes use of the slash before it: (/>|>). The slash is, of course, escaped since it coincides with the regular expression delimiter.

Small tip: to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping.


(12) Does not match valid tags having attributes with no value, i.e. <option selected>. Also does not match valid tags with unquoted attribute values, i.e. <p id=10>. - ridgerunner
(1) @ridgerunner: Thanks very much for your comment. In that case the pattern must change a bit: $pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/'; I tested it and works in case of non-quoted attributes or attributes with no value. - yodabar
How about a space before the tag name: < a href="http://wtf.org" > I'm pretty sure it is legal, but you don't match it. - Floris
(7) NO sorry, whitespaces before a tagname are illegal. Beyond being "pretty sure" why don't you provide some evidences of your objection? Here are mine, w3.org/TR/xml11/#sec-starttags referred to XML 1.1, and you can find the same for HTML 4, 5 and XHTML, as a W3C validation would also warn if you make a test. As a lot of other blah-blah-poets around here, I did not still receive any intelligent argumentation, apart some hundred of minus to my answers, to demonstrate where my code fails according to the rules of contract specified in the question. I would only welcome them. - yodabar
XML tags can contain colons, e.g. <namespace:name>, is that not so in HTML? - Qwertie
16
[+94] [2009-11-18 14:50:26] Sembiance

Whenever I need to quickly extract something from an HTML document, I use Tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this:

//p/a[@href='foo']

17
[+91] [2009-11-16 18:34:50] wen

I used a open source tool called HTMLParser [1] before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

[1] http://htmlparser.sourceforge.net/

18
[+87] [2011-07-11 17:13:17] Sam Watkins

I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep [1], sed [2], Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

HTML Split [3]


Some better regular expressions:

/(<.*?>|[^<]+)\s*/g    # Get tags and text
/(\w+)="(.*?)"/g       # Get attibutes

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.


The best way to write regular expressions is in the Lex [4] / Yacc [5] style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.

[1] http://en.wikipedia.org/wiki/Grep
[2] http://en.wikipedia.org/wiki/Sed
[3] http://sam.nipl.net/code/nipl-tools/bin/htmlsplit
[4] http://en.wikipedia.org/wiki/Lex_%28software%29
[5] http://en.wikipedia.org/wiki/Yacc

(62) "I don't attempt to parse idiot HTML that is deliberately broken." How does your code know the difference? - Kevin Panko
Well it doesn't matter much if the HTML is broken or not. The thing will still split HTML into tags and text. The only thing that could foul it up is if people include unescaped < or > characters in text or attributes. In practise, my tiny HTML splitter works well. I don't need an enormous monstrosity chock full of heuristics. Simple solutions are not for everyone...! - Sam Watkins
I added some simpler regexps for extracting tags, text, and attributes, for XML / XHTML. - Sam Watkins
(5) (get attributes bug 1) /(\w+)="(.*?)"/ assumes double quotes. It will miss values in single quotes. In html version 4 and earlier unquoted value is allowed, if it is a simple word. - David Andersson
(4) (get attributes bug 2) /(\w+)="(.*?)"/ may falsely match text that looks like an attribute within an attribute, e.g. <img title="Nope down='up' for aussies" src="..." />. If applied globally, it will also match such things in ordinary text or in html comments. - David Andersson
(3) (get attributes bug 3) /(\w+)="(.*?)"/ Optional whitespace should be allowed around the equal sign. - David Andersson
(4) (html split bug 1) s/>/>\n/g Since ">" is allowed in data, this may split text lines and confuse subsequent processing. - David Andersson
@DavidAndersson, as I said above those regexps are for XML / XHTML, which has a more strict syntax and is easier to parse. If you want a "perfect" parser that can cope with any difficult or malformed HTML input, maybe it's better not to use regexps. But I've found that regexps are more than sufficient for all the html processing tasks I have ever needed to do, and if they don't work for some reason I can tweak them to make them work. In face regexps can be more flexible with some types of broken html that would be rejected by a fussy parser. - Sam Watkins
(2) The four bugs above apply to xhtml too. (Except the last part of get attributes bug 1 does not apply to xhtml.) If you put these regexps into production code, would you notice if they occasionally miss input data and/or generate additional fake data? (You have a case for using regexps when trying to parse malformed html, but I think these are too simplistic.) - David Andersson
19
[+72] [2010-07-05 14:16:50] yodabar

About the question of the regular expression methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since nobody here spoke about recursion.

A regular expression-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".

No, holy cow, no match found. Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a regular expression parser without recursion is not enough for the purpose. It's a simple construct.

The black art of regular expressions is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.

Here's a sample note on the PHP manual I wrote in January: Reference [1]

(Take care. In that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the regular expression engine, since no ^ or $ anchoring was used).

Now, we could speak about the limits of this method from a more informed point of view:

  1. according to the specific implementation of the regular expression engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
  2. although corrupted, (x)HTML does not drive into severe errors. It is not sanitized.

Anyhow, it is only a regular expression pattern, but it discloses the possibility to develop of a lot of powerful implementations.

I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

[1] http://php.net/manual/en/regexp.reference.recursive.php

(41) I'll put this in the "Regex which doesn't allow greater-than in attributes" bin. Check it against <input value="is 5 > 3?" /> - Gareth
(74) If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him. - aehiilrs
@Gareth: thanks for your objection, but are you sure that putting a greater-than inside an attribute is a valid code? Well, also if not, this evidences another limit to add to the ones I listed above in case to create a greed parser for the real world... But it is not too much to demonstrate the way is not good, do you agree? There are other useful operators in RegExp which allow to check for next occurrences, this should be a proper use for them. - yodabar
(1) @Bart K.: it is valid only in an HTML 4- document. XHTML documents need the five XML entities encoded. - yodabar
If your comments are aimed to nothing but criticize, I see no good results this discussion may reach. - yodabar
(3) I was the first to say that my solution has some limits, but of course I am available to listen anyone who can help me in improving it. I posted something which costed me time and work, and which results are effective in a number of projects up and running. I thought it could help, proposing the way of a RegExp solution which nobody nearly spoke about (recursion), and which is the only way to parse nested markup patterns (through RegExp, of course). - yodabar
(32) Regular expressions can't work because by definition they are not recursive. Adding a recursive operator to regular expressions basically makes a CFG only with poorer syntax. Why not use something designed to be recursive in the first place rather than violently insert recursion into something already overflowing with extraneous functionality? - Welbog
(3) Once again... > is valid pretty much everywhere in XML, and thus in XHTML, see section 2.4 of the XML spec (at xml.com/axml/target.html#syntax for example) - mirod
(1) You are right, the lesser-than only is not valid inside XML attributes. Thanks to your criticism, I implemented my solution so that it can parse anything inside the attributes :) Beside this, I implemented the parsing of XML prologue, DTDs and CDATA. The only upset is that the mod closed the possibility to answer this discussion for users with less than 10 points, so that I cannot post it. I twitted him the request to unlock it, but had no response. Come to me, enemies, I wait you! :) The more you are, the stronger I become! - yodabar
(20) My objection isn't one of functionality it is one of time invested. The problem with RegEx is that by the time you post the cutsey little one liners it appears that you did something more efficiently ("See one line of code!"). And of course no one mentions the half hour (or 3) that they spent with their cheat-sheet and (hopefully) testing every possible permutation of input. And once you get past all that when the maintainer goes to figure out or validate the code they can't just look at it and see that it is right. The have to dissect the expression and essentially retest it all over again... - Oorang
(17) ... to know that it is good. And that will happen even with people who are good with regex. And honestly I suspect that overwhelming majority of people won't know it well. So you take one of the most notorious maintenance nightmares and combine it with recursion which is the other maintenance nightmare and I think to myself what I really need on my project is someone a little less clever. The goal is to write code that bad programmers can maintain without breaking the code base. I know it galls to code to the least common denominator. But hiring excellent talent is hard, and you often... - Oorang
20
[+71] [2010-04-25 16:38:42] sblom

There are some nice regexes for replacing HTML with BBCode here [1]. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

For example:

$store =~ s/http:/http:\/\//gi;
$store =~ s/https:/https:\/\//gi;
$baseurl = $store;

if (!$query->param("ascii")) {
    $html =~ s/\s\s+/\n/gi;
    $html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi;
}

$html =~ s/\n//gi;
$html =~ s/\r\r//gi;
$html =~ s/$baseurl//gi;
$html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi;
$html =~ s/<p>/\n\n/gi;
$html =~ s/<br(.*?)>/\n/gi;
$html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi;
$html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi;
$html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi;
$html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi;
$html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi;
$html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi;
$html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi;
$html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<link(.*?)>//gi;
$html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi;
$html =~ s/<ul(.*?)>/\[list]/gi;
$html =~ s/<\/ul>/\[\/list]/gi;
$html =~ s/<div>/\n/gi;
$html =~ s/<\/div>/\n/gi;
$html =~ s/<td(.*?)>/ /gi;
$html =~ s/<tr(.*?)>/\n/gi;

$html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi;
$html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi;
$html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi;
$html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi;

$html =~ s/<head>(.*?)<\/head>//sgmi;
$html =~ s/<object>(.*?)<\/object>//sgmi;
$html =~ s/<script(.*?)>(.*?)<\/script>//sgmi;
$html =~ s/<style(.*?)>(.*?)<\/style>//sgmi;
$html =~ s/<title>(.*?)<\/title>//sgmi;
$html =~ s/<!--(.*?)-->/\n/sgmi;

$html =~ s/\/\//\//gi;
$html =~ s/http:\//http:\/\//gi;
$html =~ s/https:\//https:\/\//gi;

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi;
$html =~ s/\r\r//gi;
$html =~ s/\[img]\//\[img]/gi;
$html =~ s/\[url=\//\[url=/gi;
[1] http://www.garyshood.com/htmltobb/source.txt

(20) Don't do this. Please. - maletor
21
[+66] [2012-05-10 13:53:54] daghan
<\s*(\w+)[^/>]*>

The parts explained:

<: Starting character

\s*: It may have whitespaces before the tag name (ugly, but possible).

(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious, use ([a-zA-Z0-9]+) instead.

[^/>]*: Anything except > and / until closing >

>: Closing >

UNRELATED

And to the fellows, who underestimate regular expressions, saying they are only as powerful as regular languages:

anbanban which is not regular and not even context free, can be matched with ^(a+)b\1b\1$

Backreferencing FTW [1]!

[1] http://en.wiktionary.org/wiki/FTW

@GlitchMr, that was his point. Modern regular expressions are not technically regular, nor is there any reason for them to be. - alanaktion
(5) @alanaktion: The "modern" regular expressions (read: with Perl extensions) cannot match within O(MN) (M being regular expression length, N being text length). Backreferences are one of causes of that. The implementation in awk doesn't have backreferences and matches everything within O(MN) time. - 0..
(3) <a href="foo" title="5>3"> Oops </a> (quoting @Gareth - odd how people keep posting answers with this specific deficiency over and over. CDATA is kind of easy to overlook, but this is rather more basic) - Qwertie
(1) This regex will not work if html tag will contains / in between. For example : <a href="example.com/test/example.html"> - Rohìt Jíndal
Sorry, it doesn't match <br /> check this demo. - Amine KOUIS
@AmineKOUIS that is exactly what the question asked: "But not self-closing tags:" - daghan
22
[+63] [2010-02-04 16:22:00] Corey Sanders

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy [1] library to turn the HTML into XML and then Jaxen to xpath into the result.

[1] http://jtidy.sourceforge.net/

23
[+58] [2012-06-01 05:13:26] Lonnie Best

If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:

/<[^/]*?>/g

I wrote it in 30 seconds, and tested here: https://regexr.com/

It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.


(2) FYI, you don't need to escape angle brackets. Of course, it does no harm to escape them anyway, but look at the confusion you could have avoided. ;) - Alan Moore
I sometimes escape unnecessarily when I'm unsure if something is special character or not. I've edited the answer; it works the same but more concise. - Lonnie Best
I checked it here, but it seems, doesn't match <br /> - Amine KOUIS
@AmineKOUIS You're right; it doesn't match the self-closing tags found in XHTML. However, the OP requested to only match opening tags, and HTML5 validators warn that <br /> should simply be <br> instead. To match both opening tags and self-closing tags, without matching closing tags, try /<[^/>][^>]*?>/g. - Lonnie Best
24
[+54] [2012-05-28 23:27:06] slevithan

It's true that when programming it's usually best to use dedicated parsers and APIs instead of regular expressions when dealing with HTML, especially if accuracy is paramount (e.g., if your processing might have security implications). However, I don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but they're not insurmountable or even necessarily relevant.

A simple regex like <([^>"']|"[^"]*"|'[^']*')*> is usually good enough, in cases such as those I just mentioned. It's a naive solution, all things considered, but it does correctly allow unencoded > symbols in attribute values. If you're looking for, e.g., a table tag, you could adapt it as </?table\b([^>"']|"[^"]*"|'[^']*')*>.

Just to give a sense of what a more "advanced" HTML regex would look like, the following does a fairly respectable job of emulating real-world browser behavior and the HTML5 parsing algorithm:

</?([A-Za-z][^\s>/]*)(?:=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)|[^>])*(?:>|$)

The following matches a fairly strict definition of XML tags (although it doesn't account for the full set of Unicode characters allowed in XML names):

<(?:([_:A-Z][-.:\w]*)(?:\s+[_:A-Z][-.:\w]*\s*=\s*(?:"[^"]*"|'[^']*'))*\s*/?|/([_:A-Z][-.:\w]*)\s*)>

Granted, these don't account for surrounding context and a few edge cases, but even such things could be dealt with if you really wanted to (e.g., by searching between the matches of another regex).

At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.


25
[+53] [2009-11-15 17:13:19] manixrock

It seems to me you're trying to match tags without a "/" at the end. Try this:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>

(11) This does not work. For the input '<x a="<b>"/><y>' the matches are x and y, although x is terminated. - ceving
@ceving you're right demo - Amine KOUIS
26
[+51] [2010-02-09 03:59:27] Emre Yazici

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post [1] about matching innermost HTML elements written by Steven Levithan.

[1] http://blog.stevenlevithan.com/archives/match-innermost-html-element

27
[+43] [2010-11-24 10:11:39] morja

If you only want the tag names, it should be possible to do this via a regular expression.

<([a-zA-Z]+)(?:[^>]*[^/] *)?>

should do what you need. But I think the solution of "moritz" is already fine. I didn't see it in the beginning.

For all downvoters: In some cases it just makes sense to use a regular expression, because it can be the easiest and quickest solution. I agree that in general you should not parse HTML with regular expressions.

But regular expressions can be a very powerful tool when you have a subset of HTML where you know the format and you just want to extract some values. I did that hundreds of times and almost always achieved what I wanted.


please it doesn't match 'br' tag, here is a demo - Amine KOUIS
@AmineKOUIS thanks, but the OP specifically asked not to match <br /> - morja
28
[+39] [2011-03-06 12:38:47] Jonathan Wood

The OP doesn't seem to say what he needs to do with the tags. For example, does he need to extract inner text, or just examine the tags?

I'm firmly in the camp that says a regular expression is not the be-all, end-all text parser. I've written a large amount of text-parsing code including this code to parse HTML tags [1].

While it's true I'm not all that great with regular expressions, I consider regular expressions just too rigid and hard to maintain for this sort of parsing.

[1] http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c

29
[+32] [2010-04-23 06:38:31] Paul

This may do:

<.*?[^/]>

Or without the ending tags:

<[^/].*?[^/]>

What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...


(6) <a href="foo" title="5>3"> Oops </a> (quoting @Gareth) - Qwertie
@Qwertie If you escape > it works, you can check this demo - Amine KOUIS
30
[+22] [2012-05-26 13:25:16] Cylian

I think this might work

<[a-z][^<>]*(?:(?:[^/]\s*)|(?:\s*[^/]))>

And that could be tested here [1].


As per W3Schools [2]...

XML Naming Rules

XML elements must follow these naming rules:

  • Names can contain letters, numbers, and other characters
  • Names cannot start with a number or punctuation character
  • Names cannot start with the letters xml (or XML, Xml, etc.)
  • Names cannot contain spaces
  • Any name can be used, and no words are reserved.

And the pattern I used is going to adhere these rules.

[1] http://regexr.com?312s9
[2] http://www.w3schools.com/xml/xml_elements.asp

(81) Warning: w3schools should not be treated as an authoritative or reliable reference (ref). Anyway, the rules you listed only apply to the names of elements and attributes; attribute values are much more flexible. You might get away with disallowing > (which is legal but rarely used), but imagine an HREF attribute with no slashes! ;) - Alan Moore
(7) This expression will work for many element names, however, the XML spec uses letter in the Unicode sense. There are legitimate element names which this won't match. - JamieSee
@AlanMoore href attribute with no slashes: href="some_other_page.html" - Solomon Ucko
31
[+8] [2020-06-04 02:20:16] b7kich

Here's a PCRE [1] regular expression for XML/XHTML, built from a simplified EBNF [2] syntax definition:

/
(?(DEFINE)
(?<tag> (?&tagempty) | (?&tagopen) ((?&textnode) | (?&tag) | (?&comment))* (?&tagclose))
(?<tagunnested> (?&tagempty) | (?&tagopen) ((?&textnode) | (?&comment))* (?&tagclose))
(?<textnode> [^<>]+)
(?<comment> <!--([\s\S]*?)-->)
(?<tagopen> < (?&tagname) (?&attrlist)? (?&ws)* >)
(?<tagempty> < (?&tagname) (?&ws)* (?&attrlist)? (?&ws)* \/>)
(?<tagclose> <\/ (?&tagname) (?&ws)* >)
(?<attrlist> ((?&ws)+ (?&attr))+)
(?<attr> (?&attrunquoted) | (?&attrsinglequoted) | (?&attrdoublequoted) | (?&attrempty))
(?<attrempty> (?&attrname))
(?<attrunquoted> (?&attrname) (?&ws)* = (?&ws)* (?&attrunquotedvalue))
(?<attrsinglequoted> (?&attrname) (?&ws)* = (?&ws)* ' (?&attrsinglequotedvalue) ')
(?<attrdoublequoted> (?&attrname) (?&ws)* = (?&ws)* " (?&attrdoublequotedvalue) ")
(?<tagname> (?&alphabets) ((?&alphabets) | (?&digits))*)
(?<attrname>(?&alphabets)+((?&alphabets)|(?&digits)|[:-]) )
(?<attrunquotedvalue> [^\s"'=<>`]+)
(?<attrsinglequotedvalue> [^']+)
(?<attrdoublequotedvalue> [^"]+)
(?<alphabets> [a-zA-Z])
(?<digits> [0-9])
(?<ws> \s)
)
(?&tagopen)
/x

This illustrates how to build regular expressions for context-free grammars [3]. You can match other parts of the definition by changing the match on the last line from (?&tagopen) to e.g. (?&tagunnested)

The real question is: Should you do it?

For XML/XHTML the consensus is no!

Credits to nikic [4] for supplying the idea.

[1] https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
[2] https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
[3] https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
[4] https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html

32
[+7] [2021-05-10 07:50:00] JacquesB

First, to answer the direct question: Your regex has a bug since it will exclude a tag with a slash anywhere, not just at the end. For example it would exclude this valid opening tag: <a href="foo/bar.html"> because it has a slash in an attribute value.

We can fix that, but more seriously, this regex will lead to false positives, because it will also match inside comments and cdata sections, where the same characters doesn't represent a valid tag. For example:

<!-- <foo> -->

or

<![CDATA[ <foo> ]]>

Especially html strings embedded in scripts is likely to trigger false positives, and so is the regular use of < and > as comparison operators in JavaScript. And of course sections of html which is commented-out with <!-- -->.

So to only match actual tags, you also need to be able to skip past comments and cdata sections. So you need the regex to also match comments and cdata, but only capture the opening tags. This is still possible using a rexep, but it becomes significantly more complex, for example:

(  
    <!-- .*? --> # comment   
  | <!\[CDATA\[ .*? \]\]> # CData section
  | < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag  
  | (?<tag> < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > ) # opening tag - captured  
  | </ \w+ \s* > # end tag  
)

And this just for XHTML conforming to the HTML compatibility guidelines. If you want to handle arbitrary XHTML you should also handle processing instructions and internal DTD's, since they can also embed false positives. If you also want to handle HTML there are additional complexities like the <script>-tag. And if you also want to handle invalid HTML it gets yet more complex.

Given the complexity, I would not recommend going down that road. Instead, look for an off-the-shelf (X)HTML parsing library which can solve your problem.

A parser typically uses regular expressions (or similar) under the hood to split the document into "tokens" (doctype, start tags, end tags, text content etc.). But someone else will have debugged and tested these regexes for you! Depending on the type of parser it may further build a tree structure of elements by matching start tags to end tags. This will almost certainly save you a lot of time.

The exact parser library to use depend on your language and platform and the task you are solving. If you need access to the actual tag-substrings (e.g. if you are writing a syntax highlighter for HTML) you need to use a SAX-style parser which exposes the syntax tokens directly.

If you are only performing the tag-matching in order to manually build a syntax tree of elements, then a DOM parser does this work for you. But a DOM parser does not expose the underlying tag syntax, so does not solve the exact problem you describe.

You should also consider if you need to to parse invalid HTML. This is a much more complex task, but on the wild web most of the HTML is actually invalid. Something like Pytons html5lib can parse invalid HTML.


33
[+3] [2020-10-01 18:55:20] user13843220

RegEx match open tags except XHTML self-contained tags
All other tags (and content) are skipped.


This regex does that. If you need to match only specific Open tags, make a list
in an alternation (?:p|br|<whatever tags you want>) and replace the [\w:]+ construct
in the appropriate place below.

<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>)(*SKIP)(*FAIL))|(?:[\w:]+\b(?=((?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)*)>)\2(?<!/))|(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))(*SKIP)(*FAIL))>

https://regex101.com/r/uMvJn0/1

 # Mix html/xml     
 # https://regex101.com/r/uMvJn0/1     
 
 <
 (?:
    
    # Invisible content gets failed
    
    (?:
       (?:
                               # Invisible content; end tag req'd
          (                    # (1 start)
             script
           | style
           | object
           | embed
           | applet
           | noframes
           | noscript
           | noembed 
          )                    # (1 end)
          (?:
             \s+ 
             (?>
                " [\S\s]*? "
              | ' [\S\s]*? '
              | (?:
                   (?! /> )
                   [^>] 
                )?
             )+
          )?
          \s* >
       )
       
       [\S\s]*? </ \1 \s* 
       (?= > )
       (*SKIP)(*FAIL)
    )
    
  | 
    
    # This is any open html tag we will match
    
    (?:
       [\w:]+ \b 
       (?=
          (                    # (2 start)
             (?:
                " [\S\s]*? " 
              | ' [\S\s]*? ' 
              | [^>]? 
             )*
          )                    # (2 end)
          >
       )
       \2 
       (?<! / )
    )
    
  | 
    # All other tags get failed
    
    (?:
       (?: /? [\w:]+ \s* /? )
     | (?:
          [\w:]+ 
          \s+ 
          (?:
             " [\S\s]*? " 
           | ' [\S\s]*? ' 
           | [^>]? 
          )+
          \s* /?
       )
     | \? [\S\s]*? \?
     | (?:
          !
          (?:
             (?: DOCTYPE [\S\s]*? )
           | (?: \[CDATA\[ [\S\s]*? \]\] )
           | (?: -- [\S\s]*? -- )
           | (?: ATTLIST [\S\s]*? )
           | (?: ENTITY [\S\s]*? )
           | (?: ELEMENT [\S\s]*? )
          )
       )
    )
    (*SKIP)(*FAIL)
 )
 >

34
[+2] [2022-12-27 11:39:21] Ahmed Kolsi
<([a-z][^>\s]*)(?:\s+[^>]+)?>

This regular expression will match opening tags that consist of a single word (e.g. <p>, <a>, etc.), followed by zero or more spaces and any number of characters (except > or whitespace) before the closing > character. It will also match tags with attributes, and will not match tags with names that contain characters other than a-z. However, it will still not match self-closing tags.


35
[-1] [2024-02-12 17:24:17] Charlotte Briggs

Your regular expression is mostly correct, However, there's a small adjustment to make it more accurate

<([a-z]+)(?:\s[^\/]*?)?>

Explanation:

'<': Matches the opening '<.'

'([a-z]+)': Captures one or more lowercase letters.

'(?:\s[^/]*?)?': Non-capturing group for optional whitespace ('\s') followed by any character except / (zero or more times, non-greedy), making sure not to match self-closing tags. The (?: ... ) is a non-capturing group.

Your understanding is correct, and the adjustment ensures better accuracy for matching your desired opening tags.


36
[-3] [2023-08-27 12:32:46] Ehsan

To match open tags (start tags) except XHTML self-contained tags, you can use the following regular expression:

<[^/][^>]*>
  1. <: Matches the opening angle bracket.
  2. [^/]: Matches any character except the forward slash /, ensuring the tag is not a closing tag.
  3. [^>]*: Matches zero or more characters, not the closing angle bracket >, allowing any attributes to be present.
  4. >: Matches the closing angle bracket, completing the tag.

This regular expression is wrong. It does exclude normal closing XML tags, but also matches self-closing tags. For example, this regex will successfully match against <tag/>, because the < matches, then [^/] matches the t, then [^>]* matches ag/ and finally > matches, so the overall match is successful for a self-closing tag. To match just opening tags and ignore closing and self-closing tags, here is a simple regex that mostly works correctly: <[^\/>]+> (However, this won't match valid opening XML tags with attribute values containing literal / or > characters.) - Deven T. Corzine
37