maradydd | There's gotta be a way to do this ...

... and I'm just too much of an idiot to see it, so maybe one of you smart people can help me.

I'm writing a parser for a context-sensitive grammar in Haskell (for a language that I didn't design). At the moment I'm using Happy but I'm completely open to using parsec, uuagc, whatever will get the job done. Here's my problem:

All tokens of this language are bytes. One of the nonterminals has, as its right-hand side, a byte (which evaluates to an integer value -- I have an attribute rule for this) followed by N bytes, where N is whatever that first byte evaluated to. The usual attribute-grammar way to handle this seems to look a bit like this (using an inherited attribute, and OBTW this is pseudocode):

StringOfBytes : byte NBytes { $$.value = $2.value; $2.length = $1.value }

NBytes : {- empty -} { $$.value = []; unless $$.length == 0 error }
       | byte NBytes { $$.value = $1.value : $2.value; $2.length = $$.length - 1 }

Unfortunately, in most cases there's more content past the end of the StringOfBytes, and the way this rule is written, NBytes will happily keep shifting tokens (bytes) until it runs out of input and finally reduces. This is obviously bad.

If I had some way to force NBytes to reduce whenever the length of the nested NBytes reaches 0, that would solve it, but I don't see a way to impose a precedence on an attribute rule. But that's not even a very Haskelly way to do it; what I'd really like to do is more like:

StringOfBytes : byte TakeN { $$.value = $2.value; $2.length = $1.value }

TakeN : (take $$.length INPUTSTREAM) { $$ = $1 }

...because, I mean, that's what take is for, lazily grabbing N elements of a list. How to do this, though, is not immediately obvious. Having one of my token types be a partial function that I then somehow bind to the appropriate length value would also be cool, but again I don't see how to do this.

I'm thinking monadic parsing might be a solution, since monadic parsing lets you keep track of state and do various context-dependent things ... but I'm at a loss as to how to actually do this. I've pored through the sample code that ships with Happy, but lack sufficient clue to figure out what to do with it. If anyone can provide some guidance with this, I'd be extremely grateful.

Flat | Top-Level Comments Only

Don't know if this is the answer but in the ANTLR world of Java/C++ parsers, byte-level parsers are considered "beneath tool's scope" and the Unicode support in ANTLR gets in the way. Essentially the abstraction boundary is set too high to work well with run-length parsing.

Haskell, with Char being Unicode, may have this same "overproduced" aspect that impeds ANTLR sometimes. In ANTLR, this "feature" makes parsing "legacy" byte-level data extremely hard to deal with - my experience with this came from of needing to create Yet Another GDS-II Parser and finding you can't really do it using the ANTLR "model" of quick-and-dirty parser building.

GDS-II comes from a 9-track tape era of computing so it has byte-level op codes, operands and run-length codes through out to deal with multi-tape data linking. Nobody uses 9-track anymore but the file format has the feature and for really big designs it still gets used to handle splitting into multiple files. Nonetheless, this 1960s vintage format is still the de facto lingua franca for representing integrated circuit design layouts.

The answer in the ANTLR Java/C++ realm is semi-officially to "skip ANTLR and code the parser in C - it'll give you the parser you need and it's faster" (from the horse's mouth) which wasn't exactly what I was hoping for as I wanted to trivially leverage the parse tree features in ANTLR, but it worked in the end anyway.

I'm wondering if the Unicode-ness of Haskell could prevent you from doing what seems the obvious means production rule matching. Haskell may be too dignified - I just learning Haskell but I can already see how such parser could be a challenge over parsing regular text inputs. But I could just be a newbie on that.

Haskell actually has some nice features, like bytestrings, lazy bytestrings, and the Data.Word types, which make handling byte-level data not so bad. My partner in crime on this project in fact already handrolled some Haskell to process a different byte-level format and it worked just fine, but for ideological and maintenance reasons (we have a lot of byte-level formats to deal with) we're after a parser-based approach that we can standardise.

Fortunately

seldomworks came up with a nifty, elegant answer using parsec -- link's below, do check it out. :)

There's gotta be a way to do this ...

This kind of parsing is problematic in ANTLR as well

Re: This kind of parsing is problematic in ANTLR as well