A distant voice says "perldoc"

japhy — Wed, 06 Dec 2000 05:10:45 +0000

All regex metacharacters are explained in the perlre documentation.

. matches any character (except for \n, but that can be changed by using the /s modifier to a regex)

* matches 0 or more occurrences of some pattern, and opts for the most possible occurrences

? matches 0 or 1 occurrence of some pattern, and opts for 1 occurrence

If you have perl installed, you have perldoc, and can read the perlre documentation by typing perldoc perlre at your nearest shell. Or, read the documentation online, at http://www.perldoc.com/.

Mark Hensler — Wed, 06 Dec 2000 05:01:32 +0000

uh huh...

so can someone tell me what all these do...
. * ?

I always thought . was one or more characters, * was zero or more characters, and ? was one character. But I'm probably wrong. I may be remembering that from the DOS days...

.* is GREEDY

japhy — Wed, 06 Dec 2000 03:39:54 +0000

Max - try that out first. First, you need the /s modifier in there to allow . to match newlines. Second, if you use it:

#!/usr/bin/perl

<< "HTML" =~ m!<BODY(.*)>(.*)</BODY>!is;
<html>
<body>
<b>Hello world!</b> What's up?
</body>   
</html>
HTML

print $2;

then you'll get too little content in $2. It will print " What's up?\n", instead of "\nHello world! What's up?\n". This is because the .* is greedy, and will match AS MUCH AS POSSIBLE and still allow for a valid match.

As it is, the first .* matches from the ">" at the end of "" to the "b" in "". Then the ">" matches, and then you match " What's up?\n" in the second (.*).

Greediness can bite you. And in cases like this, it's IMPERATIVE to use a properly formed tokenizer.

Mark Hensler — Wed, 06 Dec 2000 03:09:28 +0000

will there ever be anything in the tag? This should work:

$html =~ /(.*)/i;
$text = $2;

Andrew — Tue, 05 Dec 2000 20:05:21 +0000

Thank you both Rob and japhy.

I did look at the HTML::Parser module but unfortunately I could not figure out how to use it properly. (The documentation is vauge AND I'm a horrible Perl programmer...a bad combination.)

I ended up using something from the Perl Cookbook which is very similar to what Rob wrote:

($text) = ($html =~ m#\s*(.*?)\s*#is);

Luckily for me, the HTML I get back is always simple, so it works.

Thanks again!

Parsing HTML is not easy

japhy — Tue, 05 Dec 2000 19:53:49 +0000

Technically, to properly parse HTML, you need some sort of tokenizer. Take, for instance, this regex, which supposedly removes all HTML tags:

$text =~ s/<.*?>//sg;

This breaks in several places:

<!-- comment out the <hr> tag -->

<img src="arrow.gif" alt="-->">

if X < 1 + Y, then Z > 2 - X

Therefore, you need a module that can properly break things down. There is such a thing, HTML::Parser, on CPAN. I've also developed one, called YAPE::HTML, that will be available very soon. Look into using modules for this type of thing.

Rob Pengelly — Mon, 04 Dec 2000 22:55:30 +0000

Use a regular expression to do what you want.
perldoc perlre
for more information about regular expressions.

my $html = "<html><body>i need this</body></html>";
{
	$html =~ m/<body>(.+?)<\/body>/i;
	print $1;
}

output:
i need this

Hope that helps

[Edited by Rob Pengelly on Dec. 04, 2000 at 06:29 PM]