Code Answer: Regular Expressions: Is there an AND operator?

Obviously, you can use | (pipe?), to represent OR, but can you match 'and' as well?

Specifically, I'm wanting to match paragraphs of text that contain ALL of a certain phrase, but in no particular order.

From stackoverflow

The order is always implied in the structure of the regular expression. To accomplish what you want, you'll have to match the input string multiple times against different expressions.

What you want to do is not possible with a single regexp.

Robert P : It's not technically impossible, but not worthwhile to implement. I dunno why someone downvoted though...

Alan Moore : Probably because it's not only possible, it's simple, assuming your regex flavor supports lookaheads. And that's a good bet; most of today's major programming languages do support them.
Is it not possible in your case to do the AND on several matching results? in pseudocode
```
regexp_match(pattern1, data) && regexp_match(pattern2, data) && ...
```
Use a non-consuming regular expression.

The typical (i.e. Perl/Java) notation is:

(?=expr)

This means "match expr but after that continue matching at the original match-point."

You can do as many of these as you want, and this will be an "and." Example:

(?=match this expression)(?=match this too)(?=oh, and this)

You can even add capture groups inside the non-consuming expressions if you need to save some of the data therein.

Hugoware : Do you just place them all in a row, no separators between them? i.e. (?=apple)(?=orange)(?=pear)

Robert P : Separators (or any characters) would indicate that those must exist before the next non-consuming group.

Robert P : Giving this a try...I don't think this is exactly what he means.

Robert P : perl -e "q{some stuff and things} =~ /(?=some)(?=stuff)(?=things)/ ? print 'yes' : print 'no'" prints 'no'.

Jason Cohen : Thanks for the good comments; I've updated the answer to include examples.

strager : It should be mentioned that this particular example is called a positive lookahead assertion. It has other uses than "and". Note that the text isn't consumed.

Cirno de Bergerac : Using (?=) like this results in a regex that can never succeed. But it *is* the conjunction analog to |. The OP is just wrong in what he thinks will solve his problem.

kriss : perl -e "q{some stuff and things} =~ /(?=.*some)(?=.*stuff)(?=.*things)/ ? print 'yes' : print 'no'"

e-satis : I just love you so much right now.
If you use Perl regular expressions, you can use positive lookahead:

For example
```
(?=[1-9][0-9]{2})[0-9]*[05]\b
```
would be numbers greater than 100 and divisible by 5
You can do that with a regular expression but probably you'll want to some else. For example use several regexp and combine them in a if clause.

You can enumerate all possible permutations with a standard regexp, like this (matches a, b and c in any order):
```
(abc)|(bca)|(acb)|(bac)|(cab)|(cba)
```
However, this makes a very long and probably inefficient regexp, if you have more than couple terms.

If you are using some extended regexp version, like Perl's or Java's, they have better ways to do this. Other answers have suggested using positive lookahead operation.
You need to use lookahead as some of the other responders have said, but the lookahead has to account for other characters between its target word and the current match position. For example:
```
(?=.*word1)(?=.*word2)(?=.*word3)
```
The .* in the first lookahead lets it match however many characters it needs to before it gets to "word1". Then the match position is reset and the second lookahead seeks out "word2". Reset again, and the final part matches "word3"; since it's the last word you're checking for, it isn't necessary that it be in a lookahead, but it doesn't hurt.

In order to match a whole paragraph, you need to anchor the regex at both ends and add a final .* to consume the remaining characters. Using Perl-style notation, that would be:
```
/^(?=.*word1)(?=.*word2)(?=.*word3).*$/m
```
The 'm' modifier is for multline mode; it lets the ^ and $ match at paragraph boundaries ("line boundaries" in regex-speak). It's essential in this case that you not use the 's' modifier, which lets the dot metacharacter match newlines as well as all other characters.

Finally, you want to make sure you're matching whole words and not just fragments of longer words, so you need to add word boundaries:
```
/^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b).*$/m
```

Code Answer

Thursday, March 24, 2011

Regular Expressions: Is there an AND operator?

0 comments:

Post a Comment

Blog Archive