Multi-char terminators

Adrian Thurston thurs... at cs.queensu.ca
Thu Oct 5 22:24:00 UTC 2006


Hello,

If you wanted to remove buffered items when the termination sequence was 
variable length, you might be able to record the length of the buffer when 
you start the termination sequence. This might not always work properly though.

But if you want to avoid undoing work you've done, then you need to delay 
buffering. At the moment I can't think of a general way to express the 
delayed buffering of ']' using pure regular languages with embedded actions.

The local error action embedding operators are related to this problem, but 
not a good fit in this case.

So, some options:

1. You could build a machine manually. Basically draw out the state machine 
you want and use the , and -> operators to construct it. Note that you can 
still embed actions anywhere you want. In places where you go back to start 
buffer the necessary number of ']' characters.

main :=
     start: (
         (any-']') -> start |
         ']'-> one
     ),
     one: (
         ']' -> two |
         [^\]] -> start
     ),
     two: (
         '>' -> final |
         ']' -> two |
         [^>\]] -> start
     );


2. Use a mini scanner. This is the kind of thing a scanner does really well, 
but it does not give you a machine definition you can embed elsewhere. You 
have to call it. This gives me an idea though. Some scanners can be 
optimized into a pure state machine with no backtracking. Perhaps we can 
allow these to be embedded elsewhere.

3. Take ']' out of CData and add in some patterns like ']' [^\]] which 
accept only strings which look like they could start a termination sequence, 
but never go all the way. When they fail they can write out necessary number 
of ']' symbols.

Hope this helps.

-Adrian

Colin Fleming wrote:
> Hi all,
> 
> As part of parsing XML, I have the following rules for CData sections:
> 
> CDStart = '<![CDATA[';
> 
> CDEnd = ']]>';
> 
> CData = (Char* -- CDEnd) $each_char;
> 
> CDSect = CDStart CData CDEnd;
> 
> where each_char is a simple action that stores fc in a buffer. The
> problem is that the last two characters in the buffer are always ]],
> because the machine doesn't know until it encounters the > if it
> should exit the CData machine. I work around this with the following:
> 
> CDSect = CDStart CData CDEnd %trim_content;
> 
> where trim_content strips the last two characters of the buffer, but
> it's a bit ugly. It also wouldn't work if the terminator were some
> variable-length production. Is there any general way to handle this
> case?
> 
> Cheers,
> Colin
> 
> 



More information about the ragel-users mailing list