maradydd: (fail)
maradydd ([personal profile] maradydd) wrote2009-08-25 02:50 am

There's gotta be a way to do this ...

... and I'm just too much of an idiot to see it, so maybe one of you smart people can help me.

I'm writing a parser for a context-sensitive grammar in Haskell (for a language that I didn't design). At the moment I'm using Happy but I'm completely open to using parsec, uuagc, whatever will get the job done. Here's my problem:

All tokens of this language are bytes. One of the nonterminals has, as its right-hand side, a byte (which evaluates to an integer value -- I have an attribute rule for this) followed by N bytes, where N is whatever that first byte evaluated to. The usual attribute-grammar way to handle this seems to look a bit like this (using an inherited attribute, and OBTW this is pseudocode):

StringOfBytes : byte NBytes { $$.value = $2.value; $2.length = $1.value }

NBytes : {- empty -} { $$.value = []; unless $$.length == 0 error }
       | byte NBytes { $$.value = $1.value : $2.value; $2.length = $$.length - 1 }

Unfortunately, in most cases there's more content past the end of the StringOfBytes, and the way this rule is written, NBytes will happily keep shifting tokens (bytes) until it runs out of input and finally reduces. This is obviously bad.

If I had some way to force NBytes to reduce whenever the length of the nested NBytes reaches 0, that would solve it, but I don't see a way to impose a precedence on an attribute rule. But that's not even a very Haskelly way to do it; what I'd really like to do is more like:

StringOfBytes : byte TakeN { $$.value = $2.value; $2.length = $1.value }

TakeN : (take $$.length INPUTSTREAM) { $$ = $1 }

...because, I mean, that's what take is for, lazily grabbing N elements of a list. How to do this, though, is not immediately obvious. Having one of my token types be a partial function that I then somehow bind to the appropriate length value would also be cool, but again I don't see how to do this.

I'm thinking monadic parsing might be a solution, since monadic parsing lets you keep track of state and do various context-dependent things ... but I'm at a loss as to how to actually do this. I've pored through the sample code that ships with Happy, but lack sufficient clue to figure out what to do with it. If anyone can provide some guidance with this, I'd be extremely grateful.

[identity profile] neoliminal.livejournal.com 2009-08-25 01:46 am (UTC)(link)
Helping you in this regard would require reading at least 12 wikipedia articles and I think someone else will have an answer much faster than that.

[identity profile] http://users.livejournal.com/_rck_/ 2009-08-25 03:20 am (UTC)(link)
look at the way the average function is defined in the article for attribute grammars that you cited some posts ago; that is really the most relevant thing, since the lenght-has-been-reached predicate has to be iteratively evaluated just like the avg function

This kind of parsing is problematic in ANTLR as well

(Anonymous) 2009-08-25 05:13 am (UTC)(link)
Don't know if this is the answer but in the ANTLR world of Java/C++ parsers, byte-level parsers are considered "beneath tool's scope" and the Unicode support in ANTLR gets in the way. Essentially the abstraction boundary is set too high to work well with run-length parsing.

Haskell, with Char being Unicode, may have this same "overproduced" aspect that impeds ANTLR sometimes. In ANTLR, this "feature" makes parsing "legacy" byte-level data extremely hard to deal with - my experience with this came from of needing to create Yet Another GDS-II Parser and finding you can't really do it using the ANTLR "model" of quick-and-dirty parser building.

GDS-II comes from a 9-track tape era of computing so it has byte-level op codes, operands and run-length codes through out to deal with multi-tape data linking. Nobody uses 9-track anymore but the file format has the feature and for really big designs it still gets used to handle splitting into multiple files. Nonetheless, this 1960s vintage format is still the de facto lingua franca for representing integrated circuit design layouts.

The answer in the ANTLR Java/C++ realm is semi-officially to "skip ANTLR and code the parser in C - it'll give you the parser you need and it's faster" (from the horse's mouth) which wasn't exactly what I was hoping for as I wanted to trivially leverage the parse tree features in ANTLR, but it worked in the end anyway.

I'm wondering if the Unicode-ness of Haskell could prevent you from doing what seems the obvious means production rule matching. Haskell may be too dignified - I just learning Haskell but I can already see how such parser could be a challenge over parsing regular text inputs. But I could just be a newbie on that.

Re: This kind of parsing is problematic in ANTLR as well

[identity profile] maradydd.livejournal.com 2009-08-25 07:32 am (UTC)(link)
Haskell actually has some nice features, like bytestrings, lazy bytestrings, and the Data.Word types, which make handling byte-level data not so bad. My partner in crime on this project in fact already handrolled some Haskell to process a different byte-level format and it worked just fine, but for ideological and maintenance reasons (we have a lot of byte-level formats to deal with) we're after a parser-based approach that we can standardise.

Fortunately [livejournal.com profile] seldomworks came up with a nifty, elegant answer using parsec -- link's below, do check it out. :)

(Anonymous) 2009-08-25 05:54 am (UTC)(link)
I just skimmed this, so sorry if I misunderstood, but I think you just want to do something like

... = do
  nb <- char
  bs <- count (ord nb) char

So nb is the length byte, the length of the content to come. bs is the content.

[identity profile] seldomworks.livejournal.com 2009-08-25 06:16 am (UTC)(link)
You are right about monadic parsers making it easy
I extracted pieces of some old code to write this.
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=8551#a8551

It parses a StringOfInts. There is a test at the that you can run on the command line. Hopefully you will be able to plugin a byte parse instead of my Int parser and use it.

Hope it helps.

Anish

[identity profile] maradydd.livejournal.com 2009-08-25 07:23 am (UTC)(link)
Marvelous! I'd been wanting to try out parsec anyway for some of the other parser-combinator features it provides, plus it sounds like it might have a speed advantage over Happy and that's important for this particular application.

Thank you very much for the example -- it's quite clear and gives me some other ideas for how to use monads to design the entire parser more elegantly. I'll give it a try with Data.Word8 as the token type.

[identity profile] alexey-rom.livejournal.com 2009-08-25 06:48 am (UTC)(link)
This isn't directly useful, but with Erlang bitsyntax you can do

<<NBytes:1/int, Rest/bytes>> = Binary,
<<StringOfBytes:NBytes/bytes, Rest2/bytes>> = Rest

and there is a BitSyntax (http://hackage.haskell.org/package/BitSyntax) package on Hackage, which should let you do something like that.

For monadic parsing, [livejournal.com profile] seldomworks' solution works.

[identity profile] maradydd.livejournal.com 2009-08-25 07:25 am (UTC)(link)
Cool! I'll keep this in the toolbox -- it's bound to prove useful. Thank you!

(Anonymous) 2009-08-25 09:01 pm (UTC)(link)
You can use parsec, it has an instance of ByteString.Lazy for its Stream typeclass (http://hackage.haskell.org/packages/archive/parsec/3.0.0/doc/html/Text-Parsec-ByteString-Lazy.html). If you never used parsec before, start reading rwh (http://book.realworldhaskell.org/read/using-parsec.html). You can also use a state monad transformer to achieve a more idiomatic style. Bye!