Feature Request: Inline Scanner
Adrian Thurston
thurs... at cs.queensu.ca
Thu Nov 2 22:55:37 UTC 2006
Hi Carlos,
It seems to me that there are actually two separate features here. One being
inline scanners and the other being automatic capture/markup of text. I
think both of these raise their own set of questions so it's easiest to talk
about them as separate problems.
Consider that automatic capture/markup could be implemented on arbitrary
machine definitions and need not be associated with scanners. Scanners
always do automatic capture by default because the scanner may require
backtracking up to at most the head of the current pattern. This is solved
by marking the head of the current pattern so the safety of the backtrack
can be guaranteed. The pattern markup is more like a bonus.
But if you start doing automatic capture/markup of arbitrary machine
definitions, then for each machine that you want to capture, do you use new
variables or some common variables like tokstart/tokend?
If you use new variables, this allows machines that you capture to overlap
or be contained in one another. But then the question arises, how do you
know where to preserve the input from when you're breaking the stream into
buffer blocks? You have to consult all possible machine capture starting
points. That's a cost to consider.
If you use a common var like tokstart you only need to check one variable to
find out if you need to preserve some prefix of the input. But then captured
patterns cannot overlap.
With inline scanners there are a few questions also: What constitutes "would
begin the machine?" Since there can be a number of patterns in a scanner, is
it any pattern at all? Or is it a specific pattern.
On the other end what constitutes "finishing the scanner?" Again, any
pattern at all? I'm not sure about the answers to these questions.
Cheers,
Adrian
Carlos Antunes wrote:
> Hello!
>
> I've been playing with Ragel now for roughly three weeks and I find it
> a wonderful piece of software. There is one particular thing, however,
> that Ragel could do better, in my opinion, that of capturing matched
> input.
>
> Let's look at an example:
>
> # Whitespace including folding
> sp = ( ( '\r'? '\n' )? [ \t] )+;
>
> # From mail header (simplified for illustrative purposes)
> from_header = sp? ( display_name sp? )? '<' email_address '>';
>
> The state machine Ragel implements is wonderful for ensuring correct
> syntax. The scanner Ragel implements is wonderful for repeatedly match
> tokens. But how about just capuring input matched by the state
> machine? Well, in this case, things get a little bit more complicated.
> In my opinion, it would be wonderful to have what I'm calling an
> inline scanner as a complement to the two currently implemented
> choices.
>
> An example of an inline scanner in action would be something like:
>
> display_name = |> display_name_pattern; { capture_display_name(ts, te); }; <|;
>
> email_address = |> email_address_pattern; { capture_email_address(ts,
> te); }; <|;
>
> Both capture_display_name and capture_email_address are used defined
> functions accepting locally declared (automatically by Ragel)
> variables 'ts' and 'te' (for tokstart and tokend, respectively).
>
> With this kind of inline scanner, not only would the syntax be
> enforced but the input easily captured.
>
> Transitions from the state machine to the inline scanner would happen
> only if and only if the state machine would transition to the state
> machine defined by the inline scanner pattern. And, once the inline
> scanner finishes matching, it would transition the the following state
> machine as usual.
>
> Although the functionality described above can be achieve with clever
> use of the current state machine and scanner paradigms, in my opinion,
> things would be a lot easier with this inline scanner concept.
>
> So, Adrian and everybody else, what say you?
>
> Thanks!
>
> Carlos
>
More information about the ragel-users
mailing list