[ragel-users] properties list
Torsten Curdt
tcurdt at vafer.org
Thu Dec 3 00:59:36 UTC 2009
Thanks for the response, Adrian.
I got much further today.
> Yes. Ragel makes no assumptions about how the programmer wishes to
> allocate memory for input buffers. Avoiding such assumptions precludes
> automatic capture of matched items.
>
> Your choices are to copy characters into a buffer byte by byte, or to
> retain pointers. The latter approach requires more care if it is
> expected that interesting items span input buffers.
Great. That's essentially what I've been doing now.
key = '"' @key (any - '"' )* @key_append '"';
value = '"' @value (any - '"' )* @value_append '"';
assignment = whitespace* key whitespace* "=" whitespace* value
whitespace* @assignment;
One thing that still seems problematic are escaped quotes though.
"this here \"test\" is a"
Wondering what the approach is to express this. I was thinking
something along the lines of
key = '"' @key (any - ([^\\] '"') )* @key_append '"';
...but that obviously doesn't work as hoped. Any pointers here?
>> 2. I've had a look at the C grammar but did not really understand how
>> the comment rules worked. I tried with that approach but I could not
>> capture and access the comment text.
>
> See Chapter Four of the manual.
Cool, I came up with something very similar. But now I have changed it to
comment_c = "/*" @comment ((any @comment_append)* - (any* "*/" any*)) "*/";
comment_cpp = "//" @comment (any - "\n")* @comment_append "\n";
Thanks for the pointer.
It just seems that my @comment_append method is not positioned correctly.
I am still getting a trailing "*" for the "comment_c". Not sure I
understand why.
>> 4. What about unicode support? I've read that UTF8 should be possible.
>> What about UTF16?
>
> Yes, parsing UTF16 is possible. Ragel is only concerned with processing
> arrays of fixed size characters. These can be 1, 2, 4, etc bytes wide.
> The rest is up to you.
Sounds like converting UTF16 -> UTF8 and then use the proper byte
sequences might be a little easier.
I found the character sequence definitions here:
http://git.wincent.com/wikitext.git?a=blob;f=ext/wikitext_ragel.rl
action non_printable_ascii {
c = *p & 0x7f;
}
action two_byte_utf8_sequence {
c = ((uint32_t)(*(p - 1)) & 0x1f) << 6 |
(*p & 0x3f);
}
action three_byte_utf8_sequence {
c = ((uint32_t)(*(p - 2)) & 0x0f) << 12 |
((uint32_t)(*(p - 1)) & 0x3f) << 6 |
(*p & 0x3f);
}
action four_byte_utf8_sequence {
c = ((uint32_t)(*(p - 3)) & 0x07) << 18 |
((uint32_t)(*(p - 2)) & 0x3f) << 12 |
((uint32_t)(*(p - 1)) & 0x3f) << 6 |
(*p & 0x3f);
}
(0x01..0x1f | 0x7f) @non_printable_ascii |
(0xc2..0xdf 0x80..0xbf) @two_byte_utf8_sequence |
(0xe0..0xef 0x80..0xbf 0x80..0xbf) @three_byte_utf8_sequence |
(0xf0..0xf4 0x80..0xbf 0x80..0xbf 0x80..0xbf) @four_byte_utf8_sequence
Still trying to figure out to use those though :)
Is there any other example available somewhere?
cheers
--
Torsten
More information about the ragel-users
mailing list