Today Stacey and I were pair programming on a parser for requests to an IRC server. We started with this and this. Using the instaparser library, we ended up with this:

(def request-parser
  (insta/parser "message = prefix? command params?
                prefix = <':'> word <SPACE+>
                command = word | number number number
                params = middle? trailing?
                middle = (<SPACE+> word)+
                trailing = (<SPACE+> <':'> #'[a-zA-Z0-9 ]+')
                word = #'[a-zA-z]+'
                number = #'\\d'
                <SPACE> = #' '"))

If you take a look at the IRC RFCs, you'll immediately see that there are some differences and discrepancies. For example, the issue of nick and message lengths are left untouched by this grammar.

As Stacey and I developed the parser, we began to wonder where certain errors like "invalid nick, length > 9" would be thrown or handled. Specifically, should these be parsing errors? Generally, what parts of the program's logic should be handled by this grammar?

Somehow I recalled Chomsky's famous line "Colorless green ideas sleep furiously" which is a sentence (grammatically correct, or in our CS context, it parses), but meaningless. This way of thinking about grammar applies here too. The IRC command FOOBAR is grammatically correct but meaningless. Errors for meaningless commands should certainly be handled, but not in the grammar.

I went to the wikipedia page on "Colorless green ideas sleep furiously," saw this:

Alt text

and was enlightened.


That tree resembles a parse tree. If I didn't know any better, I would say that it is a parse tree.

Happy Halloween!