Parsing with Co-routines: Help Me Wrap My Head Around This

Posted on September 3, 2010

Parsing can be a fascinating excercise if you haven’t done it before. I’ve personally never written a low-level parser. I’m familiar with the basic tools for generating parsers as most people should at least know about (lex/yacc/etc). These tools let you define the basic syntax and grammar tables and do all the dirty work of generating all the code you need to parse an input source. Yet one should stop and write a low-level parser once or twice just to get a feel of what’s going on under the hood. The excercise can be quite revealing.

I decided to take up Code Quarterly’s inaugral coding challenge. The challenge is to write a parser for their specification of a markup language aptly named, “Markup.” The specification isn’t overly complex and the rules aren’t so strict as to make it overly competitive. It makes for a neat time-sink to learn a few new things.

So my first idea off the bat was to use co-routines to do my parsing instead of a traditional state-machine. I got the idea from David Beazley’s excellent talk on Python co-routines a while ago. It seems like a good fit: lexing usually reads a stream of bytes and emits tokens and parsers read lists of tokens and generate data-structures. When thought about in this way, the process of parsing seems a bit like a pipeline which is pretty much how David describes how co-routines can operate in his talk. In his examples he used it to process XML files — which isn’t exactly “parsing” at the level this challenge demands, but I figure the problem could theoretically scale down… right?

Well I’m having a bit of difficulty wrapping my head around it. I keep thinking of the problem as a state-machine and invariably end up grafting such notions into my chain of co-routines. It never quite looks right so I end up deleting a bunch of code and heading back to the old drawing board. That is, I keep trying to enforce the notion of state despite much literature that suggests state is embedded in the co-routines themselves.

The Code Quarterly challenge is slightly more complex than many of the examples that suggest we can replace state machines with co-routines. Yet it’s not so complex as to require a full lex/yacc solution. It might even be a small enough problem to not require seperate lexical analysis and parsing steps. Yet I can’t seem to wrap my head around how to scale out the problem without requiring some sort of state management.

I might eventually figure it out… but I could use some ideas at the moment. I’ve tried a number of approaches and feel like I’m hitting the same wall.