Friday, May 6, 2011

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:

%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29

This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:

p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');


times = 

1x16 struct array with fields:

The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.

In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).

Is this possible?

P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.

Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.

From stackoverflow
  • MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:

    c =
    %%%% 09-May-2009 04:10:29
    % Starting foo
    this is stuff
    to ignore
    %%%% 09-May-2009 04:10:50
    % Starting bar
    more stuff
    to ignore
    %%%% 09-May-2009 04:11:29
    some more junk

    ...and applied the following expression:

    p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';

    The processing can then be done with the following code:

    names = regexp(c,p,'names');
    [names.stop] = deal(names(2:end).start,[]);
    names = names(1:end-1);

    ...which gives us these results for the above sample text:

    >> names(1)
    ans = 
        start: '09-May-2009 04:10:29'
         name: 'foo'
         stop: '09-May-2009 04:10:50'
    >> names(2)
    ans = 
        start: '09-May-2009 04:10:50'
         name: 'bar'
         stop: '09-May-2009 04:11:29'
    Matthew Simoneau : I fixed the anti-causal-ness of the log. These were supposed to be sequential.
  • All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:

    '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'

    EDIT: Here it is without named groups:

    '%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'
    Matthew Simoneau : Alan, thanks for taking a crack at it. When I try what you suggest (and add the closing paren), it longer captures the values.
    Alan Moore : Try it without named groups, then. In every other regex flavor I know of that supports lookaheads and capturing groups, you can capture things inside a lookahead. But it may be that MATLAB doesn't allow that.
    gnovice : Even without using named groups, it still doesn't seem to work. I think MATLAB has trouble resolving the token/grouping operations and the lookahead operations simultaneously.
  • If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.


Post a Comment