[ragel-users] Re: 'string' ranges
Paul
r.lp... at gmail.com
Sat Apr 7 08:54:09 UTC 2007
> would you mind sending a message to the list to say how it went?
Thanks Adrian, yeah I will remember to do that when I get to a good solution.
> # 0x0A07-0x0D40
> r2 =
> 0x0A ( 0x07 .. 0xFF ) |
> ( 0x0B | 0x0C ) any |
> 0x0D ( 0x00 .. 0x40 );
For now, I will attempt to run with this...
I have a script taking my unicode ranges, converting them to UTF-8 character
ranges.. and then running those through a hackish set of regular expressions
to get something like
(
(0xE4 0xB8 (0x80 .. 0xFF)) |
((0xE5 .. 0xE8) any{2}) |
(0xE9 (((0x00 .. 0xBD) any) |
(0xBE (0x00 .. 0xA5))))
)
from the unicode range [0x4E00-0x9FA5]
> You're probably aware of this but I'll mention it just to put it out
> there ... [snip]
Yeah, although, the majority of my transitions are on ascii characters and I'm
only wanting to handle proper UTF-8 strings instead of (any -- ascii)* to be
ultra neurotic in a few cases. I wanted to see what happens to the number of
states and average performance before moving to another UTF character set or
abandoning the extra correctness.
Thanks again,
- Paul
More information about the ragel-users
mailing list