Extract the content between all of certain HTML tags and put in array

They have: 426 posts

Joined: Feb 2005

What I want to do is get a web page using PHP file_get_contents('www.http.com') and then parse the string for certain tag and look for the content inside the tag (between and ) then put that content inside an array.

I want to do this for all divs in the string?

I dont know regex but found this but doesnt work. Perhaps one of you experts can help explain why it doesnt work and what I can do to it to get it to work?!

"^(.*)(<[ \\n\\r\\t]*$tagname(>|[^>]*>))(.*)(<[ \\n\\r\\t]*/[ \\n\\r\\t]*$tagname(>|[^>]*>))(.*)$"

He has: 698 posts

Joined: Jul 2005

Is the variable 'tagname' defined elsewhere in the code? I'm guessing that's the "certain tag" that will be sought to which you were referring?

Anyway, your expression is confusing me a bit. I'm certainly no expert (and I know you didn't write the expression, but why at one point are they searching for a ">" or anything but a ">" (optional) and then a ">". Wouldn't the second search produce the same result?

...(>|[^>]*>)...

Also, maybe I'm not seeing exactly what you want, but I don't see the reasoning behind all the line feeds, tabs, etc...

Here's what I would use, though I haven't tested it completely:

"/^(.*)(<$tagname([^>]*>))(.*)(<\/$tagname>))(.*)$/"

Edit:
Thinking over it again, if you're just looking for the data between the tags, this would probably be a better expression:

"/^.*<$tagname[^>]*>(.*)<\/$tagname>.*$/"

As far as I understand, anything in parenthesis is captured when the expression is matched...

Kurtis

Abhishek Reddy's picture

He has: 3,348 posts

Joined: Jul 2001

Is the markup likely to be well-formed? If so, you can run it through a suitable HTML or XML parser. Traverse the resulting DOM object to access the nodes you're interested in. Look at the SimpleXML extension for a start.

For malformed markup, a regular expression may be best. But if it's feasible, a simpler one which is more specific to the document will be easier to maintain and should work well enough. Very general regexps like you've posted will be hard to debug and adapt to changing inputs.

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.