<?xml version="1.0" encoding="utf-8" ?><rss version="2.0" xml:base="https://www.webmaster-forums.net/crss/node/1012956" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title></title>
    <link>https://www.webmaster-forums.net/crss/node/1012956</link>
    <description></description>
    <language>en</language>
          <item>
    <title>A distant voice says &quot;perldoc&quot;</title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075225</link>
    <description> &lt;p&gt;All regex metacharacters are explained in the &lt;strong&gt;perlre&lt;/strong&gt; documentation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;.&lt;/strong&gt; matches any character (except for &lt;strong&gt;\n&lt;/strong&gt;, but that can be changed by using the &lt;strong&gt;/s&lt;/strong&gt; modifier to a regex)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;*&lt;/strong&gt; matches 0 or more occurrences of some pattern, and opts for the most possible occurrences&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;?&lt;/strong&gt; matches 0 or 1 occurrence of some pattern, and opts for 1 occurrence&lt;/p&gt;
&lt;p&gt;If you have &lt;strong&gt;perl&lt;/strong&gt; installed, you have &lt;strong&gt;perldoc&lt;/strong&gt;, and can read the &lt;strong&gt;perlre&lt;/strong&gt; documentation by typing &lt;strong&gt;perldoc perlre&lt;/strong&gt; at your nearest shell.  Or, read the documentation online, at &lt;a href=&quot;http://www.perldoc.com/&quot; class=&quot;bb-url&quot;&gt;http://www.perldoc.com/&lt;/a&gt;.&lt;/p&gt;
 </description>
     <pubDate>Wed, 06 Dec 2000 05:10:45 +0000</pubDate>
 <dc:creator>japhy</dc:creator>
 <guid isPermaLink="false">comment 1075225 at https://www.webmaster-forums.net</guid>
  </item>
  <item>
    <title></title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075223</link>
    <description> &lt;p&gt;uh huh...&lt;/p&gt;
&lt;p&gt;so can someone tell me what all these do...&lt;br /&gt;
&lt;strong&gt;.&lt;/strong&gt;  &lt;strong&gt;*&lt;/strong&gt;  &lt;strong&gt;?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I always thought &lt;strong&gt;.&lt;/strong&gt; was one or more characters, &lt;strong&gt;*&lt;/strong&gt; was zero or more characters, and &lt;strong&gt;?&lt;/strong&gt; was one character.  But I&#039;m probably wrong.  I may be remembering that from the DOS days...&lt;/p&gt;
 </description>
     <pubDate>Wed, 06 Dec 2000 05:01:32 +0000</pubDate>
 <dc:creator>Mark Hensler</dc:creator>
 <guid isPermaLink="false">comment 1075223 at https://www.webmaster-forums.net</guid>
  </item>
  <item>
    <title>.* is GREEDY</title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075215</link>
    <description> &lt;p&gt;Max - try that out first.  First, you need the &lt;strong&gt;/s&lt;/strong&gt; modifier in there to allow &lt;strong&gt;.&lt;/strong&gt; to match newlines.  Second, if you use it:&lt;/p&gt;
&lt;p&gt;&lt;div class=&quot;codeblock&quot;&gt;&lt;code&gt;#!/usr/bin/perl&lt;br /&gt;&lt;br /&gt;&amp;lt;&amp;lt; &amp;quot;HTML&amp;quot; =~ m!&amp;lt;BODY(.*)&amp;gt;(.*)&amp;lt;/BODY&amp;gt;!is;&lt;br /&gt;&amp;lt;html&amp;gt;&lt;br /&gt;&amp;lt;body&amp;gt;&lt;br /&gt;&amp;lt;b&amp;gt;Hello world!&amp;lt;/b&amp;gt; What&amp;#039;s up?&lt;br /&gt;&amp;lt;/body&amp;gt;&amp;nbsp;&amp;nbsp; &lt;br /&gt;&amp;lt;/html&amp;gt;&lt;br /&gt;HTML&lt;br /&gt;&lt;br /&gt;print $2;&lt;/code&gt;&lt;/div&gt;&#039;&lt;/p&gt;
&lt;p&gt;then you&#039;ll get too little content in &lt;strong&gt;$2&lt;/strong&gt;.  It will print &quot; What&#039;s up?\n&quot;, instead of &quot;\nHello world! What&#039;s up?\n&quot;.  This is because the &lt;strong&gt;.*&lt;/strong&gt; is greedy, and will match AS MUCH AS POSSIBLE and still allow for a valid match.&lt;/p&gt;
&lt;p&gt;As it is, the first &lt;strong&gt;.*&lt;/strong&gt; matches from the &quot;&amp;gt;&quot; at the end of &quot;&quot; to the &quot;b&quot; in &quot;&quot;.  Then the &quot;&amp;gt;&quot; matches, and then you match &quot; What&#039;s up?\n&quot; in the second &lt;strong&gt;(.*)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Greediness can bite you.  And in cases like this, it&#039;s IMPERATIVE to use a properly formed tokenizer.&lt;/p&gt;
 </description>
     <pubDate>Wed, 06 Dec 2000 03:39:54 +0000</pubDate>
 <dc:creator>japhy</dc:creator>
 <guid isPermaLink="false">comment 1075215 at https://www.webmaster-forums.net</guid>
  </item>
  <item>
    <title></title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075213</link>
    <description> &lt;p&gt;will there ever be anything in the  tag?  This should work:&lt;/p&gt;
&lt;p&gt;$html =~ /(.*)/i;&lt;br /&gt;
$text = $2;&lt;/p&gt;
 </description>
     <pubDate>Wed, 06 Dec 2000 03:09:28 +0000</pubDate>
 <dc:creator>Mark Hensler</dc:creator>
 <guid isPermaLink="false">comment 1075213 at https://www.webmaster-forums.net</guid>
  </item>
  <item>
    <title></title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075185</link>
    <description> &lt;p&gt;Thank you both Rob and japhy.&lt;/p&gt;
&lt;p&gt;I did look at the HTML::Parser module but unfortunately I could not figure out how to use it properly. (The documentation is vauge AND I&#039;m a horrible Perl programmer...a bad combination.)&lt;/p&gt;
&lt;p&gt;I ended up using something from the Perl Cookbook which is very similar to what Rob wrote:&lt;/p&gt;
&lt;p&gt;($text) = ($html =~ m#\s*(.*?)\s*#is);&lt;/p&gt;
&lt;p&gt;Luckily for me, the HTML I get back is always simple, so it works.&lt;/p&gt;
&lt;p&gt;Thanks again!&lt;/p&gt;
 </description>
     <pubDate>Tue, 05 Dec 2000 20:05:21 +0000</pubDate>
 <dc:creator>Andrew</dc:creator>
 <guid isPermaLink="false">comment 1075185 at https://www.webmaster-forums.net</guid>
  </item>
  <item>
    <title>Parsing HTML is not easy</title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075183</link>
    <description> &lt;p&gt;Technically, to properly parse HTML, you need some sort of tokenizer.  Take, for instance, this regex, which supposedly removes all HTML tags:&lt;/p&gt;
&lt;p&gt;&lt;div class=&quot;codeblock&quot;&gt;&lt;code&gt;$text =~ s/&amp;lt;.*?&amp;gt;//sg;&lt;/code&gt;&lt;/div&gt;&#039;&lt;/p&gt;
&lt;p&gt;This breaks in several places:&lt;/p&gt;
&lt;p&gt;&lt;div class=&quot;codeblock&quot;&gt;&lt;code&gt;&amp;lt;!-- comment out the &amp;lt;hr&amp;gt; tag --&amp;gt;&lt;br /&gt;&lt;br /&gt;&amp;lt;img src=&amp;quot;arrow.gif&amp;quot; alt=&amp;quot;--&amp;gt;&amp;quot;&amp;gt;&lt;br /&gt;&lt;br /&gt;if X &amp;lt; 1 + Y, then Z &amp;gt; 2 - X&lt;/code&gt;&lt;/div&gt;&#039;&lt;/p&gt;
&lt;p&gt;Therefore, you need a module that can properly break things down.  There is such a thing, HTML::Parser, on CPAN.  I&#039;ve also developed one, called YAPE::HTML, that will be available very soon.  Look into using modules for this type of thing.&lt;/p&gt;
 </description>
     <pubDate>Tue, 05 Dec 2000 19:53:49 +0000</pubDate>
 <dc:creator>japhy</dc:creator>
 <guid isPermaLink="false">comment 1075183 at https://www.webmaster-forums.net</guid>
  </item>
  <item>
    <title></title>
    <link>https://www.webmaster-forums.net/serverside-scripting/parsing-htlml#comment-1075118</link>
    <description> &lt;p&gt;Use a regular expression to do what you want.&lt;br /&gt;
perldoc perlre&lt;br /&gt;
for more information about regular expressions.&lt;br /&gt;
&lt;div class=&quot;codeblock&quot;&gt;&lt;code&gt;my $html = &amp;quot;&amp;lt;html&amp;gt;&amp;lt;body&amp;gt;i need this&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;&amp;quot;;&lt;br /&gt;{&lt;br /&gt;	$html =~ m/&amp;lt;body&amp;gt;(.+?)&amp;lt;\/body&amp;gt;/i;&lt;br /&gt;	print $1;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;output:&lt;br /&gt;i need this&lt;/code&gt;&lt;/div&gt;&#039;&lt;/p&gt;
&lt;p&gt;Hope that helps&lt;/p&gt;
&lt;p&gt;[Edited by Rob Pengelly on Dec. 04, 2000 at 06:29 PM]&lt;/p&gt;
 </description>
     <pubDate>Mon, 04 Dec 2000 22:55:30 +0000</pubDate>
 <dc:creator>Rob Pengelly</dc:creator>
 <guid isPermaLink="false">comment 1075118 at https://www.webmaster-forums.net</guid>
  </item>
  </channel>
</rss>
