Matching multibyte or wide-chars
rakrok
rak... at gmail.com
Thu Apr 10 17:22:40 UTC 2008
Hello,
I'm trying to tokenize multibyte strings. In C-land, I would read in
a mb char, convert it to widechar, and then I can use the widechar to
test if it's iswalnum, iswalpha, and then tokenize it appropriately.
In Ragel there is no mb/wc correspondence to the alnum/alpha character
classes as far as I can tell. It would have been nice to be able to
define it like so:
walpha = /./ when { iswalpha(towide(p)); };
With towide() being a wrapper around mbrtowc. Unfortunately semantic
conditions aren't supported for the unsigned long alphtype.
So my question is, is there an [easy] way to do this? Ideally it
would be nice to be able to define the acceptance criteria of a
machine to be the same as that of a code block. In that way, I can
use the built-in widechar support in the C runtime, or use ICU, or
whatnot.
I can always try to explicitly list the mb/wc i'm interested in, but
that means having to implement locale specific code, which sounds
complex to me.
Any ideas would be greatly appreciated,
More information about the ragel-users
mailing list