Tuesday, May 3, 2011

Why do on-line parsers seem to stop at regexps?

I've been wondering for long why there doesn't seem to be any parsers for, say, BNF, that behave like regexps in various libraries.

Sure, there's things like ANTLR, Yacc and many others that generate code which, in turn, can parse a CFG, but there doesn't seem to be a library that can do that without the intermediate step.

I'm interested in writing a Packrat parser, to boot all those nested-parenthesis-quirks associated with regexps (and, perhaps even more so, for the sport of it), but somehow I have this feeling that I'm just walking into another halting problem -like class of swamps.

Is there a technical/theoretical limitation for these parsers, or am I just missing something?

From stackoverflow
  • Because full-blown context-free grammars are confusing enough as they are without some cryptically dense and incomprehensible syntax to make them even more confusing?

    It's hard to know what you're asking. Are you trying to create something like a regular expression, but for context-free grammars? Like, using $var =~ /expr = expr + expr/ (in Perl) and having that match "1 + 1" or "1 + 1 + 1" or "1 + 1 + 1 + 1 + 1 + ..."? I think one of the limitations of this is going to be syntax: Having more than about three rules is going to make your "grammar-expression" even more unreadable than any modern-day regular expression.

    Henrik Paul : It seems like you are arguing the (yet unspecified) implementation, than the ability of parsing a CFG in itself? Sure, regular expressions are cryptic to the untrained eye. Perhaps, a context free language could be even more cryptic. But that wasn't the point. The point was, why are there only code generators, and not stuff I can just put into a function/object and get matched blocks of text out of them, like I do with today's regular expressions.?
    Chris Lutz : Usually, when people use a parser, they're looking to do a whole lot more than just look at some text and see if it matches their grammar or not. Not that there's anything wrong with that, but most parsers do quite a bit more.
    Chris Lutz : Plus, the implementation is a technical limitation you are going to have to deal with at some point, and you did ask about technical/theoretical limitations.
  • I think it's more of a cultural thing. The use of context-free grammars is mostly confined to compilers, which typically have code associated with each production rule. In some languages, it's easier to output code than to simulate callbacks. In others, you'll see parser libraries: parser combinators in Haskell, for example. On the other hand, regular expressions see wide use in tools like grep, where it's inconvenient to run the C compiler every time the user gives a new regular expression.

  • Side effect are the only thing I see thing that will get you. Most of the parser generators include embedded code for processing and you would need an eval to make that work.

    One way around that would be to name actions and then make an "action" function that takes the name of the action to do and the args to do it with.

  • You could theoretically do it with Boost Spirit in C++, but it is mainly made for static grammars. I think the reason this is not common is that CFGs are not as commonly used as regexs. I've never had to use a grammar except for compiler construction, but I have used regexs many times. CFGs are generally much more complex than regexs, so it makes sense to generate code statically with a tool like YACC or ANTLR.

    Henrik Paul : I would agree that the use of regular expressions are more common than what seems to be today's demand for CFG:s. But, since I've encountered a numerous questions on getting a correctly nested set pf parentheses on Stack Overflow alone, I'm sure that there is _some_ use of CFG:s. Maybe they aren't being asked for, because people know only of regexps?
    BCS : IIRC Boost::Spirit is a pure compile time implementation. I can't see any reason for it to allow run time definition of grammars.
    Zifre : Boost Spirit can be used at runtime - just shift tokens onto the end of the grammar: grammar = grammar >> ch_p('a');
  • tcllib has something like that, if you can put up with Parse Expression Grammars and also TCL. If Perl is your thing CPAN has Parse::Earley. Here's a pure Perl variation which looks promising. PLY seems to be a plausible solution for Python

  • Boost.Spirit looks like what you are after.

    If you are looking to make your own, I've used BNFC for my latest compiler project and it provides the grammar used in its own implementation. This might be a good starting point...

    Zifre : As I said in my answer, Boost.Spirit works but is mainly meant for static grammars. Thanks for mentioning BNFC though, I might use that for my compiler now. It looks really great!
  • There isn't and technical/theoretical limitation lurking in the shadows. I can't say why they aren't more popular, but I know of at least one library that provides this sort of "on-line" parsing that you seek.

    SimpleParse is a python library that lets you simply paste your hairy EBNF grammar into your program and use it to parse things right away, no itermediate steps. I've used it for several projects where I wanted a custom input language but really didn't want to commit to any formal build process.

    Here's a tiny example off the top of my head:

    decl = r"""
        root := expr
        expr := term, ("|", term)*
        term := factor+
        factor := ("(" expr ")") / [a-z]
    parser = Parser(decl) 
    success, trees, next = parser.parse("(a(b|def)|c)def")

    The parser combinator libraries for Haskell and Scala also let your express your the grammar for your parser in the same chunk of code that uses it. However you can't, say, let the user type in a grammar at runtime (which might only be of interest to people making software to help people understand grammars anyway).

    Henrik Paul : Thanks for your informative answer!
  • Pyparsing (http://pyparsing.wikispaces.com) has built-in support for packrat parsing and it is pure Python, so you can see the actual implementation.


Post a Comment