Saturday, February 12, 2011

How do I match part of a string only if it is not preceded by certain characters?

I've created the following regex pattern in an attempt to match a string 6 characters in length ending in either "PRI" or "SEC", unless the string = "SIGSEC". For example, I want to match ABCPRI, XYZPRI, ABCSEC and XYZSEC, but not SIGSEC.

(\w{3}PRI$|[^SIG].*SEC$)

It is very close and sort of works (if I pass in "SINSEC", it returns a partial match on "NSEC"), but I don't have a good feeling about it in its current form. Also, I may have a need to add more exclusions besides "SIG" later and realize that this probably won't scale too well. Any ideas?

BTW, I'm using System.Text.RegularExpressions.Regex.Match() in C#

Thanks, Rich

  • Assuming your regex engine supports negative lookaheads, try this:

    ((?!SIGSEC)\w{3}(?:SEC|PRI))
    

    Edit: A commenter pointed out that .NET does support negative lookaheads, so this should work fine (thanks, Charlie).

    Charlie : .NET regular expressions do support negative lookaheads, so this will work
    Dan : Ah, good to know, thanks Charlie. I'm really not a .NET guy ;)
    Rich : This works perfectly Dan, thanks! Ran a quick test and it will be trivial to add the additional exclusion matches.
    Pop Catalin : As a side note, .Net regex supports unlimited length lookaround on all kinds of lookarounds. Actually .Net regex and JGsoft engines are the only regex engines that allow "full regular expressions inside lookbehind"
    From Dan
  • Personally, I'd be inclined to build-up the exclusion list using a second variable, then include it into the full expression - it's the approach I've used in the past when having to build any complex expression.

    Something like exclude = 'someexpression'; prefix = 'list of prefixes'; suffix = 'list of suffixes'; expression = '{prefix}{exclude}{suffix}';

    From warren
  • "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." -Jamie Zawinski

  • You may not even want to do the exclusions in the regex. For example, if this were Perl (I don't know C#, but you can probably follow along), I'd do it like this

    if ( ( $str =~ /^\w{3}(?:PRI|SEC)$/ ) && ( $str ne 'SIGSEC' ) )
    

    to be clear. It's doing exactly what you wanted:

    • Three word characters, followed by PRI or SEC, and
    • It's not SIGSEC

    Nobody says you have to force everything into one regex.

    Dan : I agree, this is probably the most sensible way to do it. However it looks like he's trying to extract these things from text with a regular expression - not having to worry about dealing with matches you don't want could potentially lead to a cleaner solution.
  • You can try this one:

    @"\w{3}(?:PRI|(?<!SIG)SEC)"
    
    • Matches 3 "word" characters
    • Matches PRI or SEC (but not after SIG i.e. SIGSEC is excluded) (? < !x)y - is a negative lookbehind (it mathces y if it's not preceded by x)

    Also, I may have a need to add more exclusions besides "SIG" later and realize that this probably won't scale too well

    Using my code, you can easily add another exceptions, for example following code excludes SIGSEC and FOOSEC

    @"\w{3}(?:PRI|(?<!SIG|FOO)SEC)"
    
    From aku
  • To help break down Dan's (correct) answer, here's how it works:

    (           // outer capturing group to bind everything
     (?!SIGSEC) // negative lookahead: a match only works if "SIGSEC" does not appear next
     \w{3}      // exactly three "word" characters
     (?:        // non-capturing group - we don't care which of the following things matched
       SEC|PRI  // either "SEC" or "PRI"
     )
    )
    

    All together: ((?!SIGSEC)\w{3}(?:SEC|PRI))

    Dan : Nicely summarised :)
    Charlie : Thanks for the fixup of my final listing.
    From Charlie
  • Why not use more readable code? In my opinion this is much more maintainable.

    private Boolean HasValidEnding(String input)
    {
        if (input.EndsWith("SEC",StringComparison.Ordinal) || input.EndsWith("PRI",StringComparison.Ordinal))
        {
            if (!input.Equals("SIGSEC",StringComparison.Ordinal))
            {
                return true;
            }
        }
        return false;
    }
    

    or in one line

    private Boolean HasValidEnding(String input)
    {
        return (input.EndsWith("SEC",StringComparison.Ordinal) || input.EndsWith("PRI",StringComparison.Ordinal)) && !input.Equals("SIGSEC",StringComparison.Ordinal);
    }
    

    It's not that I don't use regular expressions, but in this case I wouldn't use them.

    Rich : Yep, I had actually started with something exactly along those lines but the requirements changed and I decided to externalize the logic. I opted for using a regex inside a config file so as not to have to make code changes when new exclusion strings need to be added.
  • Go and get Regexbuddy from RegExBuddy.com it is an amazingly simple tool that will help you figure out the most complicated regex easily.

    From Toby Allen

0 comments:

Post a Comment