Wil — Fri, 29 Aug 2003 07:42:24 +0000

Or try it free online from Safari for 30 days (the online Oreilly bookshop).

Mark Hensler — Thu, 28 Aug 2003 19:06:00 +0000

Alright, thanks for the help!

Prospero — Thu, 28 Aug 2003 08:39:37 +0000

Quote: Originally posted by Mark Hensler
The mechanics happens to be exactly what I'm interested in.

I've taken several programming courses (in college), but they don't teach much farther than the language's syntax and grammar. Assembly (i386+) came the closest to "How's that work?", but still left me wondering about higher level languages. Regular Expressions are a whole other animal, and I'm quite interested to know what makes them tick.

Do you have any online resources you could refer me to?

I don't, offhand, know of online resources which give a good introduction to the mechanics. The information is certianly out there, but it's usually kind of raw. If I run across something, I'll let you know.

On the other hand, I can easily refer you to a printed resource: chapter 4 of 'Mastering Regular Expressions', from O'Reilly & Associates. Great introduction.

Mark Hensler — Thu, 28 Aug 2003 08:11:22 +0000

The mechanics happens to be exactly what I'm interested in.

I've taken several programming courses (in college), but they don't teach much farther than the language's syntax and grammar. Assembly (i386+) came the closest to "How's that work?", but still left me wondering about higher level languages. Regular Expressions are a whole other animal, and I'm quite interested to know what makes them tick.

Do you have any online resources you could refer me to?

Prospero — Thu, 28 Aug 2003 07:56:04 +0000

You know, I'm feeling more and more squeamish about this explanation... so I wanted to say again, don't take the details here too seriously. Particularly the mechanical details, which, I hope, will get across a sense of how this sort of thing appears to work.

The model we're using to understand things actually describes what are called DFA regular expressions. It's a simpler model, which makes it nice for explanations like these. The problem is, Perl doesn't actually use DFA, but the other kind, NFA regular expressions. And technically, they're modified NFA, so they're not really 'regular' expressions at all.

You can probably see why I was loathe to go into this. Oversimplification is sometimes useful. However I'd hate to feel responsible for spreading misinformation, so I'm adding a caveat. This isn't a story you want to rely upon too closely.

And don't worry too much about this stuff, either. It's not that important for 99% of what you need in Perl.

Prospero — Thu, 28 Aug 2003 04:47:00 +0000

Quote: Originally posted by Mark Hensler
So would this be correct...

Because the pattern matched. The search was continued at the end of the last successfull match within the string.

Had the match failed, the search would have resumed at the letter after the beging of the last failed match within the string.

"3rd place" ^ first pass starts here and fails, as "3" is not \w ------------------------------------------------------------------ "3rd place" ^ second pass starts here and suceeds, as "rd" matches \w{1,3} ------------------------------------------------------------------ "3rd place" ^ third pass starts here and fails, as " " is not \w ------------------------------------------------------------------ "3rd place" ^ fourth pass starts here and suceeds, as "place" matches \w{1,3}\w*
'

That's pretty close. Actually, though, there's only one pass. The engine simply goes along one character at a time, moving to different states as it goes. When it gets to the end, it's done.

Let's say our pattern was /ri\w/g, and our string is "3rd prize".

"3rd prize"
 ^
Not a possible match. Discard.

"3rd prize"
  ^
A possible match. Keep the 'r', move on.

"3rd prize"
   ^
No longer a possible match. Discard the 'r', move on.

"3rd prize"
    ^
Not a possible match. Discard.

"3rd prize"
     ^
Not a possible match. Discard.

"3rd prize"
      ^
A possible match. Keep the 'r', move on.

"3rd prize"
       ^
A possible match. Keep the 'ri', move on.

"3rd prize"
        ^
A match. Keep the 'riz', move on, since /g is in effect.

"3rd prize"
         ^
Not a possible match. Discard.

But you've essentially got the idea, I think.

Another, less mechanical way to think about it is this: each character in the string is only going to be part of a single pattern match. Hence, if the 'r' in 'George' was part of the character sequence matched by (\w{1,3})\w* (which it was, since the \w* is greedy), it won't be used in any future matches. That's why I put \w* in there: to prevent the rest of the word from matching the part of the regex I actually care about, (\w{1,3})

Mark Hensler — Thu, 28 Aug 2003 02:46:54 +0000

So would this be correct...

Because the pattern matched. The search was continued at the end of the last successfull match within the string.

Had the match failed, the search would have resumed at the letter after the beging of the last failed match within the string.

"3rd place"
 ^
first pass starts here and fails, as "3" is not \w
------------------------------------------------------------------
"3rd place"
  ^
second pass starts here and suceeds, as "rd" matches \w{1,3}
------------------------------------------------------------------
"3rd place"
    ^ 
third pass starts here and fails, as " " is not \w
------------------------------------------------------------------
"3rd place"
     ^
fourth pass starts here and suceeds, as "place" matches \w{1,3}\w*

Prospero — Thu, 28 Aug 2003 01:40:20 +0000

Quote: Originally posted by Mark Hensler
Yes, welcome!

Alright, I tested Prospero's code, and I see that it works.

I understand why \w* matches up to the end of the word string. But why does \w only match the begining of the word string. Seems like you'd need \W\w to match the begining of a word string.

I guess what's confusing me is that I've recently read about "Once-only subpatterns" -- or "(?>" -- and that's lingering in the back of my mind.

Nothing quite so fancy as that. I'll try to explain it in a few different ways, and hope one will make sense.

Remember, to be satisfied, the string needs to satisfy the whole pattern. The pattern says, "Find instances where between one and three 'word characters' are immediately followed by zero or more word characters, and capture the one-to-three part."

So the pattern actually matches the whole word each time. If it finds between 1 and 3 word consecutive word characters, it will always match... because it can't fail to match the zero-or-more part. That'll match on nothing at all, if it has to, though it will try to match as many characters as it can. Regexes are greedy, that way. Hence it's a lot like saying \w+ That's exactly what we'd say, in fact, if we didn't need to capture the first several letters.

So, say you have a sequence like ' a '. The pattern first looks at the leading space. It realizes that no matter what comes next, a space cannot be the beginning of what it's looking for. So it ignores the space. Next step: look at the 'a'. Is 'a' a possible match? Yep. It matches \w, and also \w{1,3}.

But, it's greedy. It doesn't want 1, when it might be able to get 3. So it looks at the next character: another space. Drat. It's not getting any more than one... but that's enough. It satisfies \w{1,3}.

As it turns out, it also satisfies \w{1,3}\w*. It found one word character, followed immediately by zero word characters. So we have a valid match. Now, because the pattern is actually (\w{1,3})\w*, it 'captures' the \w{1,3} part, and puts it in a buffer.

(Incidentally, this is not a perfectly accurate description of what happens, but it should give you a pretty good model with which to work.)

Now, what if our string had been ' a priori '? Well, the matching for ' a ' went pretty much as described. The regex engine tosses out the second space, of course, because it has no use for it. Then it keeps going, because we used the /g switch, so it's not done yet. It comes upon a 'p'. Is that a possible match? Yep. And if the regex weren't greedy, it'd be satisfied already; 'p' matches (\w{1,3})\w* just as 'a' did.

But its job is to make the match as long as is possible. And good thing, too, or patterns like .* which would match an infinite number of times, and we'd never get anywhere. It's zero or more of anything, after all, and if it keeps stopping at zero, it's no fun.

So, 'p' is a satisfactory beginning. Next it looks and sees 'r', which matches \w. 'pr', too, matches \w{1,3} but isn't the maximum (3), so onward! Now it sees 'i'. 'pri' not only matches \w, it's the longest possible match for \w{1,3}, so there's no sense looking at the next one to see what it is, at this point. That part of the pattern matched.

The pattern, however, calls for \w{1,3} followed immediately by \w*. Well, the nice thing about * is, it's always a match. Whatever it is, you have at least zero of it. But again, * is greedy. So we look at the next character. Turns out it's 'o', and 'o' matches \w. But 'o' isn't the longest possible match... with *, there is no longest possible match. So we keep going. Next comes 'r', which matches \w, but 'or' again isn't the best \w* could hope for, so onward; 'i' matches \w, and we have 'ori'. Next comes a space. Oops. That doesn't match \w, so now our search for that pattern is finished. Stuff \w{1,3}, which is 'pri', in a buffer, and keep going.

What comes after space? The end of the string. So, no more matches, and we're done. Return the buffers, which contain 'a' and 'pri', and go back to our regularly scheduled programming.

Again, this description is subtly wrong, particularly the order in which things happen. But don't worry about that. The order doesn't interest you, just the results. If you ever get into writing parsers, it'll be more interesting to sweat the details.

Now, you were wondering why that pattern didn't match "Geo", "eor", "org", and so on. Think about this: what if our pattern had been .* and our string 'foobar'. Would you expect that to match 'foobar', 'oobar', 'obar', 'bar', 'ar', and 'r'? I wouldn't. The matching engine doesn't find 'f' 'o' 'o' 'b' 'a' 'r', see that it matches .*, and then head back to the first 'o', and try again.

Well, there. Hopefully you're not more confused than when I started.

Mark Hensler — Thu, 28 Aug 2003 00:07:59 +0000

Yes, welcome!

Alright, I tested Prospero's code, and I see that it works.

I understand why \w* matches up to the end of the word string. But why does \w only match the begining of the word string. Seems like you'd need \W\w to match the begining of a word string.

I guess what's confusing me is that I've recently read about "Once-only subpatterns" -- or "(?>" -- and that's lingering in the back of my mind.

Suzanne — Wed, 27 Aug 2003 21:18:36 +0000

welcome Prospero! And thanks for a super explanation.