Feature Request: Inline Scanner

Thu Nov 2 22:55:37 UTC 2006

Hi Carlos,

It seems to me that there are actually two separate features here. One being 
inline scanners and the other being automatic capture/markup of text. I 
think both of these raise their own set of questions so it's easiest to talk 
about them as separate problems.

Consider that automatic capture/markup could be implemented on arbitrary 
machine definitions and need not be associated with scanners. Scanners 
always do automatic capture by default because the scanner may require 
backtracking up to at most the head of the current pattern. This is solved 
by marking the head of the current pattern so the safety of the backtrack 
can be guaranteed. The pattern markup is more like a bonus.

But if you start doing automatic capture/markup of arbitrary machine 
definitions, then for each machine that you want to capture, do you use new 
variables or some common variables like tokstart/tokend?

If you use new variables, this allows machines that you capture to overlap 
or be contained in one another. But then the question arises, how do you 
know where to preserve the input from when you're breaking the stream into 
buffer blocks? You have to consult all possible machine capture starting 
points. That's a cost to consider.

If you use a common var like tokstart you only need to check one variable to 
find out if you need to preserve some prefix of the input. But then captured 
patterns cannot overlap.

With inline scanners there are a few questions also: What constitutes "would 
begin the machine?" Since there can be a number of patterns in a scanner, is 
it any pattern at all? Or is it a specific pattern.

On the other end what constitutes "finishing the scanner?" Again, any 
pattern at all? I'm not sure about the answers to these questions.

Cheers,
  Adrian

Carlos Antunes wrote:
> Hello!
> 
> I've been playing with Ragel now for roughly three weeks and I find it
> a wonderful piece of software. There is one particular thing, however,
> that Ragel could do better, in my opinion, that of capturing matched
> input.
> 
> Let's look at an example:
> 
> # Whitespace including folding
> sp = ( ( '\r'? '\n' )? [ \t] )+;
> 
> # From mail header (simplified for illustrative purposes)
> from_header = sp? ( display_name sp? )? '<' email_address '>';
> 
> The state machine Ragel implements is wonderful for ensuring correct
> syntax. The scanner Ragel implements is wonderful for repeatedly match
> tokens. But how about just capuring input matched by the state
> machine? Well, in this case, things get a little bit more complicated.
> In my opinion, it would be wonderful to have what I'm calling an
> inline scanner as a complement to the two currently implemented
> choices.
> 
> An example of an inline scanner in action would be something like:
> 
> display_name = |> display_name_pattern; { capture_display_name(ts, te); }; <|;
> 
> email_address = |> email_address_pattern; { capture_email_address(ts,
> te); }; <|;
> 
> Both capture_display_name and capture_email_address are used defined
> functions accepting locally declared (automatically by Ragel)
> variables 'ts' and 'te' (for tokstart and tokend, respectively).
> 
> With this kind of inline scanner, not only would the syntax be
> enforced but the input easily captured.
> 
> Transitions from the state machine to the inline scanner would happen
> only if and only if the state machine would transition to the state
> machine defined by the inline scanner pattern. And, once the inline
> scanner finishes matching, it would transition the the following state
> machine as usual.
> 
> Although the functionality described above can be achieve with clever
> use of the current state machine and scanner paradigms, in my opinion,
> things would be a lot easier with this inline scanner concept.
> 
> So, Adrian and everybody else, what say you?
> 
> Thanks!
> 
> Carlos
>