Entry tags:
There's gotta be a way to do this ...
... and I'm just too much of an idiot to see it, so maybe one of you smart people can help me.
I'm writing a parser for a context-sensitive grammar in Haskell (for a language that I didn't design). At the moment I'm using Happy but I'm completely open to using parsec, uuagc, whatever will get the job done. Here's my problem:
All tokens of this language are bytes. One of the nonterminals has, as its right-hand side, a byte (which evaluates to an integer value -- I have an attribute rule for this) followed by N bytes, where N is whatever that first byte evaluated to. The usual attribute-grammar way to handle this seems to look a bit like this (using an inherited attribute, and OBTW this is pseudocode):
Unfortunately, in most cases there's more content past the end of the StringOfBytes, and the way this rule is written, NBytes will happily keep shifting tokens (bytes) until it runs out of input and finally reduces. This is obviously bad.
If I had some way to force NBytes to reduce whenever the length of the nested NBytes reaches 0, that would solve it, but I don't see a way to impose a precedence on an attribute rule. But that's not even a very Haskelly way to do it; what I'd really like to do is more like:
...because, I mean, that's what take is for, lazily grabbing N elements of a list. How to do this, though, is not immediately obvious. Having one of my token types be a partial function that I then somehow bind to the appropriate length value would also be cool, but again I don't see how to do this.
I'm thinking monadic parsing might be a solution, since monadic parsing lets you keep track of state and do various context-dependent things ... but I'm at a loss as to how to actually do this. I've pored through the sample code that ships with Happy, but lack sufficient clue to figure out what to do with it. If anyone can provide some guidance with this, I'd be extremely grateful.
I'm writing a parser for a context-sensitive grammar in Haskell (for a language that I didn't design). At the moment I'm using Happy but I'm completely open to using parsec, uuagc, whatever will get the job done. Here's my problem:
All tokens of this language are bytes. One of the nonterminals has, as its right-hand side, a byte (which evaluates to an integer value -- I have an attribute rule for this) followed by N bytes, where N is whatever that first byte evaluated to. The usual attribute-grammar way to handle this seems to look a bit like this (using an inherited attribute, and OBTW this is pseudocode):
StringOfBytes : byte NBytes { $$.value = $2.value; $2.length = $1.value } NBytes : {- empty -} { $$.value = []; unless $$.length == 0 error } | byte NBytes { $$.value = $1.value : $2.value; $2.length = $$.length - 1 }
Unfortunately, in most cases there's more content past the end of the StringOfBytes, and the way this rule is written, NBytes will happily keep shifting tokens (bytes) until it runs out of input and finally reduces. This is obviously bad.
If I had some way to force NBytes to reduce whenever the length of the nested NBytes reaches 0, that would solve it, but I don't see a way to impose a precedence on an attribute rule. But that's not even a very Haskelly way to do it; what I'd really like to do is more like:
StringOfBytes : byte TakeN { $$.value = $2.value; $2.length = $1.value } TakeN : (take $$.length INPUTSTREAM) { $$ = $1 }
...because, I mean, that's what take is for, lazily grabbing N elements of a list. How to do this, though, is not immediately obvious. Having one of my token types be a partial function that I then somehow bind to the appropriate length value would also be cool, but again I don't see how to do this.
I'm thinking monadic parsing might be a solution, since monadic parsing lets you keep track of state and do various context-dependent things ... but I'm at a loss as to how to actually do this. I've pored through the sample code that ships with Happy, but lack sufficient clue to figure out what to do with it. If anyone can provide some guidance with this, I'd be extremely grateful.
no subject
no subject
This kind of parsing is problematic in ANTLR as well
(Anonymous) 2009-08-25 05:13 am (UTC)(link)Haskell, with Char being Unicode, may have this same "overproduced" aspect that impeds ANTLR sometimes. In ANTLR, this "feature" makes parsing "legacy" byte-level data extremely hard to deal with - my experience with this came from of needing to create Yet Another GDS-II Parser and finding you can't really do it using the ANTLR "model" of quick-and-dirty parser building.
GDS-II comes from a 9-track tape era of computing so it has byte-level op codes, operands and run-length codes through out to deal with multi-tape data linking. Nobody uses 9-track anymore but the file format has the feature and for really big designs it still gets used to handle splitting into multiple files. Nonetheless, this 1960s vintage format is still the de facto lingua franca for representing integrated circuit design layouts.
The answer in the ANTLR Java/C++ realm is semi-officially to "skip ANTLR and code the parser in C - it'll give you the parser you need and it's faster" (from the horse's mouth) which wasn't exactly what I was hoping for as I wanted to trivially leverage the parse tree features in ANTLR, but it worked in the end anyway.
I'm wondering if the Unicode-ness of Haskell could prevent you from doing what seems the obvious means production rule matching. Haskell may be too dignified - I just learning Haskell but I can already see how such parser could be a challenge over parsing regular text inputs. But I could just be a newbie on that.
Re: This kind of parsing is problematic in ANTLR as well
Fortunately
no subject
(Anonymous) 2009-08-25 05:54 am (UTC)(link)So nb is the length byte, the length of the content to come. bs is the content.
no subject
I extracted pieces of some old code to write this.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8551#a8551
It parses a StringOfInts. There is a test at the that you can run on the command line. Hopefully you will be able to plugin a byte parse instead of my Int parser and use it.
Hope it helps.
Anish
no subject
Thank you very much for the example -- it's quite clear and gives me some other ideas for how to use monads to design the entire parser more elegantly. I'll give it a try with Data.Word8 as the token type.
no subject
<<NBytes:1/int, Rest/bytes>> = Binary,
<<StringOfBytes:NBytes/bytes, Rest2/bytes>> = Rest
and there is a BitSyntax (http://hackage.haskell.org/package/BitSyntax) package on Hackage, which should let you do something like that.
For monadic parsing,
no subject
no subject
(Anonymous) 2009-08-25 09:01 pm (UTC)(link)