[ragel-users] properties list

Thu Dec 3 00:59:36 UTC 2009

Thanks for the response, Adrian.

I got much further today.

> Yes. Ragel makes no assumptions about how the programmer wishes to
> allocate memory for input buffers. Avoiding such assumptions precludes
> automatic capture of matched items.
>
> Your choices are to copy characters into a buffer byte by byte, or to
> retain pointers. The latter approach requires more care if it is
> expected that interesting items span input buffers.

Great. That's essentially what I've been doing now.

  key = '"' @key (any - '"' )* @key_append '"';
  value = '"' @value (any - '"' )* @value_append '"';
  assignment = whitespace* key whitespace* "=" whitespace* value
whitespace* @assignment;

One thing that still seems problematic are escaped quotes though.

 "this here \"test\" is a"

Wondering what the approach is to express this. I was thinking
something along the lines of

  key = '"' @key (any - ([^\\] '"') )* @key_append '"';

...but that obviously doesn't work as hoped. Any pointers here?

>> 2. I've had a look at the C grammar but did not really understand how
>> the comment rules worked. I tried with that approach but I could not
>> capture and access the comment text.
>
> See Chapter Four of the manual.

Cool, I came up with something very similar. But now I have changed it to

  comment_c = "/*" @comment ((any @comment_append)* - (any* "*/" any*)) "*/";
  comment_cpp = "//" @comment (any - "\n")* @comment_append "\n";

Thanks for the pointer.

It just seems that my @comment_append method is not positioned correctly.
I am still getting a trailing "*" for the "comment_c". Not sure I
understand why.

>> 4. What about unicode support? I've read that UTF8 should be possible.
>> What about UTF16?
>
> Yes, parsing UTF16 is possible. Ragel is only concerned with processing
> arrays of fixed size characters. These can be 1, 2, 4, etc bytes wide.
> The rest is up to you.

Sounds like converting UTF16 -> UTF8 and then use the proper byte
sequences might be a little easier.
I found the character sequence definitions here:

 http://git.wincent.com/wikitext.git?a=blob;f=ext/wikitext_ragel.rl

action non_printable_ascii {
    c = *p & 0x7f;
}

action two_byte_utf8_sequence {
    c = ((uint32_t)(*(p - 1)) & 0x1f) << 6 |
        (*p & 0x3f);
}

action three_byte_utf8_sequence {
    c = ((uint32_t)(*(p - 2)) & 0x0f) << 12 |
        ((uint32_t)(*(p - 1)) & 0x3f) << 6 |
        (*p & 0x3f);
}

action four_byte_utf8_sequence {
    c = ((uint32_t)(*(p - 3)) & 0x07) << 18 |
        ((uint32_t)(*(p - 2)) & 0x3f) << 12 |
        ((uint32_t)(*(p - 1)) & 0x3f) << 6 |
        (*p & 0x3f);
}

(0x01..0x1f | 0x7f)                             @non_printable_ascii        |
(0xc2..0xdf 0x80..0xbf)                         @two_byte_utf8_sequence     |
(0xe0..0xef 0x80..0xbf 0x80..0xbf)              @three_byte_utf8_sequence   |
(0xf0..0xf4 0x80..0xbf 0x80..0xbf 0x80..0xbf)   @four_byte_utf8_sequence

Still trying to figure out to use those though :)

Is there any other example available somewhere?

cheers
--
Torsten