Multi-char terminators
Adrian Thurston
thurs... at cs.queensu.ca
Thu Oct 5 22:24:00 UTC 2006
Hello,
If you wanted to remove buffered items when the termination sequence was
variable length, you might be able to record the length of the buffer when
you start the termination sequence. This might not always work properly though.
But if you want to avoid undoing work you've done, then you need to delay
buffering. At the moment I can't think of a general way to express the
delayed buffering of ']' using pure regular languages with embedded actions.
The local error action embedding operators are related to this problem, but
not a good fit in this case.
So, some options:
1. You could build a machine manually. Basically draw out the state machine
you want and use the , and -> operators to construct it. Note that you can
still embed actions anywhere you want. In places where you go back to start
buffer the necessary number of ']' characters.
main :=
start: (
(any-']') -> start |
']'-> one
),
one: (
']' -> two |
[^\]] -> start
),
two: (
'>' -> final |
']' -> two |
[^>\]] -> start
);
2. Use a mini scanner. This is the kind of thing a scanner does really well,
but it does not give you a machine definition you can embed elsewhere. You
have to call it. This gives me an idea though. Some scanners can be
optimized into a pure state machine with no backtracking. Perhaps we can
allow these to be embedded elsewhere.
3. Take ']' out of CData and add in some patterns like ']' [^\]] which
accept only strings which look like they could start a termination sequence,
but never go all the way. When they fail they can write out necessary number
of ']' symbols.
Hope this helps.
-Adrian
Colin Fleming wrote:
> Hi all,
>
> As part of parsing XML, I have the following rules for CData sections:
>
> CDStart = '<![CDATA[';
>
> CDEnd = ']]>';
>
> CData = (Char* -- CDEnd) $each_char;
>
> CDSect = CDStart CData CDEnd;
>
> where each_char is a simple action that stores fc in a buffer. The
> problem is that the last two characters in the buffer are always ]],
> because the machine doesn't know until it encounters the > if it
> should exit the CData machine. I work around this with the following:
>
> CDSect = CDStart CData CDEnd %trim_content;
>
> where trim_content strips the last two characters of the buffer, but
> it's a bit ugly. It also wouldn't work if the terminator were some
> variable-length production. Is there any general way to handle this
> case?
>
> Cheers,
> Colin
>
>
More information about the ragel-users
mailing list