Multi-char terminators

Fri Oct 6 18:27:20 UTC 2006

Hi Colin,

This might do what you want:

action bchar { buff( fpc ); }
action bbrack1 { buff( "]" ); }
action bbrack2 { buff( "]]" ); }

main :=
start: (
     ']'-> one |
     (any-']') @bchar ->start
),
one: (
     ']' -> two |
     [^\]] @bbrack1 @bchar ->start
),
two: (
     '>' -> final |
     ']' @bbrack1 -> two |
     [^>\]] @bbrack2 @bchar ->start
);

Colin Fleming wrote:
> Hi Adrian,
> 
> Thanks for the response! I need to think about it a bit more.
> Obviously in this case it's not a huge problem, but it might be if I
> move to marking strings rather than copying and a buffer boundary
> happens to break up the terminator. The problem with constructing the
> machine manually is that I can't really do any better than Ragel does
> - if you have no look-ahead, you never know if you're on a terminator
> until the end of it.
> 
> I'll read up a bit about scanners, too, it sounds interesting.
> 
> Cheers,
> Colin
> 
> On 10/5/06, Adrian Thurston <thurs... at cs.queensu.ca> wrote:
>> Hello,
>>
>> If you wanted to remove buffered items when the termination sequence was
>> variable length, you might be able to record the length of the buffer when
>> you start the termination sequence. This might not always work properly though.
>>
>> But if you want to avoid undoing work you've done, then you need to delay
>> buffering. At the moment I can't think of a general way to express the
>> delayed buffering of ']' using pure regular languages with embedded actions.
>>
>> The local error action embedding operators are related to this problem, but
>> not a good fit in this case.
>>
>> So, some options:
>>
>> 1. You could build a machine manually. Basically draw out the state machine
>> you want and use the , and -> operators to construct it. Note that you can
>> still embed actions anywhere you want. In places where you go back to start
>> buffer the necessary number of ']' characters.
>>
>> main :=
>>      start: (
>>          (any-']') -> start |
>>          ']'-> one
>>      ),
>>      one: (
>>          ']' -> two |
>>          [^\]] -> start
>>      ),
>>      two: (
>>          '>' -> final |
>>          ']' -> two |
>>          [^>\]] -> start
>>      );
>>
>>
>> 2. Use a mini scanner. This is the kind of thing a scanner does really well,
>> but it does not give you a machine definition you can embed elsewhere. You
>> have to call it. This gives me an idea though. Some scanners can be
>> optimized into a pure state machine with no backtracking. Perhaps we can
>> allow these to be embedded elsewhere.
>>
>> 3. Take ']' out of CData and add in some patterns like ']' [^\]] which
>> accept only strings which look like they could start a termination sequence,
>> but never go all the way. When they fail they can write out necessary number
>> of ']' symbols.
>>
>> Hope this helps.
>>
>> -Adrian
>>
>> Colin Fleming wrote:
>>> Hi all,
>>>
>>> As part of parsing XML, I have the following rules for CData sections:
>>>
>>> CDStart = '<![CDATA[';
>>>
>>> CDEnd = ']]>';
>>>
>>> CData = (Char* -- CDEnd) $each_char;
>>>
>>> CDSect = CDStart CData CDEnd;
>>>
>>> where each_char is a simple action that stores fc in a buffer. The
>>> problem is that the last two characters in the buffer are always ]],
>>> because the machine doesn't know until it encounters the > if it
>>> should exit the CData machine. I work around this with the following:
>>>
>>> CDSect = CDStart CData CDEnd %trim_content;
>>>
>>> where trim_content strips the last two characters of the buffer, but
>>> it's a bit ugly. It also wouldn't work if the terminator were some
>>> variable-length production. Is there any general way to handle this
>>> case?
>>>
>>> Cheers,
>>> Colin
>>>
>>>
> 
>