Question

I'm currently shopping for a FOSS parser generator for a project of mine. It has to support either C or C++.

I've looked at bison/flex and at boost::spirit.

I went from writing my own to spirit to bison to spirit to bison to spirit, each time hit by some feature I found unpleasant.

The thing I hate most about bison/flex is that they actually generate C/C++ source for you. There are a number of disadvantages to this, e.g. debugging. I like spirit from this point of view, but I find it very very heavy on syntax.

I am curious about what you are using, what you would recommend, and general thoughts about the state of the art in parser generators. I am also curious to hear about approaches being used in other languages for parsing problems.

Answer 1

Antlr ^[1] isn't bad and it has a built in debugger. The package also comes with an API ^[2] for C (among other available languages).

[1] http://www.antlr.org/
[2] http://www.antlr.org/api/index.html

Answer 2

Please don't use bison/flex or yacc/lex. They parse very efficiently but are really hard on the programmer. Use a more modern parser generator with a better user interface. ANTLR ^[1] is a good suggestion, and you might also consider

A packrat parser ^[2]
The Elkhound ^[3] GLR parser generator

[1] http://www.antlr.org/
[2] http://bford.info/packrat/
[3] http://scottmcpeak.com/elkhound/

Answer 3

I'd recommend looking a little at the Lemon parser generator used in SQLite

Lemon ^[1]

Lemon is an LALR(1) parser generator for C or C++. It does the same job as "bison" and "yacc". But lemon is not another bison or yacc clone. It uses a different grammar syntax which is designed to reduce the number of coding errors. Lemon also uses a more sophisticated parsing engine that is faster than yacc and bison and which is both reentrant and thread-safe. Furthermore, Lemon implements features that can be used to eliminate resource leaks, making is suitable for use in long-running programs such as graphical user interfaces or embedded controllers.

[1] http://www.hwaci.com/sw/lemon/

Answer 4

I've been very happy using spirit ^[1]. Yes, the syntax can take some getting used to but it's flexible and powerful.

If your code is in C++ it's the most elegant solution IMHO since a) it integrates beautifully with your code (particularly with the design of actions) and b) you don't need to run a code generator as a separate build step.

I'd suggest looking into it some more before dismissing it.

Antlr ^[2] is great if you're using other languages, but when I'm using C++ Antlr feels clunky and awkward compared to using spirit. I've drunk the kool-aid; spirit FTW! ;)

[1] http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/index.html
[2] http://www.antlr.org/

Answer 5

I use FLEX and Bison.
Both have the ability to generate C++ code (via command line flags or directives in the file).

I hear Antlr is good but have never used it personally.

Answer 6

A state-of-the-art parser generator is the DMS Software Reengineering Toolkit ^[1]. (I'm the architect).

It isn't FOSS, but you asked specifically about state-of-the art.

It isn't so much just a parser generator, as a complete ecosystem for building tools that process formal documents (programs, specifications, hardware designs, anything that has a "formal syntax/semantics").

DMS provides

lexers with full Unicode capability and ability to read a huge variety of input encoding formats (ascii, UTF-8/16, EBCDIC, ...)
full-context free parsing (infinite lookahead and built-in error recovery)
automatically builds abstract syntax trees, determining which productions are lists. The syntax trees capture comments in the text.
provides direct support for building tree-structured analzers called "attribute grammar evaluators"
provides symbol table construction support that has been proven to be capable of handling nasty languages such as C++
provides pretty printers to regenerate valid source text from the trees, including regenerating valid comments
source-to-source rewrite rules to allow you to define program transformations using the syntax of the langauge of interest
provides control flow, data flow, call graph, and global points-to analysis machinery
has tested front front ends for C, C++, Java, and COBOL, all of which build symbol tables and construct the various flow analyses above
has front ends for a variety of other langauges, including C# (4.0), PHP, Ada, ...

One of the tests of fire for a "state of the art" parser generator is its ability to parse C++. DMS parses C++, does all the symbol table construction, etc. and has been used to carry out massive transformations automatically on C++ code.

Other "parser generators" tend to provide at best parsing ability and leave you to build your own trees and all of the rest of the above stuff if you have the heart and the years to do it.

ANTLR is a bit better in that it does provide support for tree building, some syntax-directed pattern matching. The C++-trial-by-fire ANTLR sort of passes; there is a C++ front end for ANTLR. To the best of my knowledge, it is incomplete, doesn't have symbol table support, and I don't know of any uses of it for production tasks.

ELSA succeeds at C++ (and symbol tables) by virtue of being focused on parsing C++. The foundation machinery (Elkhound) behind ELSA is the same GLR parsing algorithms used by DMS. But I don't believe that Elkhound is widely used for anything but to support ELSA.

At the risk of being immodest, I would suggest that DMS defines the state of the art. (I'll agree that ANTLR is pretty good for what it does).

You can get more detailed comparisons of DMS to many other systems here ^[2].

[1] http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
[2] http://www.semdesigns.com/Products/DMS/DMSComparison.html

Answer 7

I am curious about what you are using, what you would recommend, and general thoughts about the state of the art in parser generators.

I'm using the GOLD parser at http://www.devincook.com/goldparser/ ... because:

I'm not experienced with or formally educated in parsing, and I found it easy to learn and use
It says that it supports several languages (including C, C++, and C#).

Answer 8

There are plenty of good documentations on Antlr and it has a very nice eclipse plugin. So I recommend it. But unfortunately have no experiences at other options.

Answer 9

If you understand the theory of lexing and parsing you can use Flex and Bison to generate the state machine tables for you and implement the lexer and parser yourself (or re-implement the templates that come with Bison and Flex) to get rid of the things you don't like about them.

I've done this at one time, and it's nice in so far as you can have your own lexer and parser written to your specifications, in your application's style, with your own coding standards and debugging features, but you use the well coded algorithms inside Flex and Bison to generate the state transition tables for you. And I'd wager to say that creating the tables is probably the more complicated problem.

So in summary: Use flex and bison to generate your state transition tables, which are then used by your own lexer and parser.

Answer 10

Flex has a way to configure it to generate C++ (and perhaps Bison does as well, though I'm unsure of that). I recall trying to use this in the final project for my compilers class and finding it nearly undocumented, so I fell back to using C. That was a year and a half ago, so maybe it's gotten better since then. There's definitely a section in the man page on it though. I'm not sure that's helpful, but at least it's something you can try :)

Answer 11

Visual ++ Parser