Monday, March 7, 2011

Difficulty with Simple Regex (match prefix/suffix)

I'm try to develop a regex that will be used in a C# program..

My initial regex was:

(?<=\()\w+(?=\))

Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".

However, if I modify the regex to:

\[(?<=\()\w+(?=\))\]

and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.

Thanks in advance for your kind help.

Rob Cecil

From stackoverflow
  • Your look-behinds are the problem. Here's how the string is being processed:

    1. We see [ in the string, and it matches the regex.
    2. Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.

    At least thats what I would guess is causing the problem.

    Try this regex instead:

    (?<=\[\()\w+(?=\)\])
    
    configurator : My thoughts exactly.
  • Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
    In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
    Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.

    Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.

0 comments:

Post a Comment