[ragel-users] Ruby buffer code for streaming scanner
Seamus Abshere
seamus at abshere.net
Mon Jun 13 15:52:30 UTC 2011
hi,
The Ragel Guide has an excellent set of guidelines for how to "take on
some buffer management functions" when using the longest-match operator
(for scanners):
> \begin{itemize}
> \setlength{\parskip}{0pt}
> \item Read a block of input data.
> \item Run the execute code.
> \item If \verb|ts| is set, the execute code will expect the incomplete
> token to be preserved ahead of the buffer on the next invocation of the execute
> code.
> \begin{itemize}
> \item Shift the data beginning at \verb|ts| and ending at \verb|pe| to the
> beginning of the input buffer.
> \item Reset \verb|ts| to the beginning of the buffer.
> \item Shift \verb|te| by the distance from the old value of \verb|ts|
> to the new value. The \verb|te| variable may or may not be valid. There is
> no way to know if it holds a meaningful value because it is not kept at null
> when it is not in use. It can be shifted regardless.
> \end{itemize}
> \item Read another block of data into the buffer, immediately following any
> preserved data.
> \item Run the scanner on the new data.
> \end{itemize}
I believe this is a correct implementation in Ruby: (see the #scan!
method for the buffering)
> =begin
> %%{
> machine foo_scanner;
>
> foo_open = 'START_FOO';
> foo_close = 'STOP_FOO';
> foo = foo_open any* :>> foo_close;
>
> main := |*
> foo => { emit data[ts...te].pack('c*') };
> any;
> *|;
> }%%
> =end
>
> class FooScanner
> # read stuff in 1 meg at a time
> CHUNK_SIZE = 1_048_576
>
> attr_reader :target
>
> def initialize(target)
> @target = target
> %% write data;
> end
>
> def emit(foo_entity)
> puts "I found a foo entity!"
> puts foo_entity
> end
>
> def scan!
> # Set pe so that ragel doesn't try to get it from data.length
> pe = -1
> eof = File.size(target)
>
> %% write init;
>
> prefix = []
> File.open(target) do |f|
> while chunk = f.read(CHUNK_SIZE)
> # \item Read a block of input data.
> data = prefix + chunk.unpack("c*")
>
> # \item Run the execute code.
> p = 0
> pe = data.length
> %% write exec;
>
> # \item If \verb|ts| is set, the execute code will expect the incomplete token to be preserved ahead of the buffer on the next invocation of the execute code.
> unless ts.nil?
> # \begin{itemize}
> # \item Shift the data beginning at \verb|ts| and ending at \verb|pe| to the beginning of the input buffer.
> prefix = data[ts..pe]
> # \item Shift \verb|te| by the distance from the old value of \verb|ts| to the new value. The \verb|te| variable may or may not be valid. There is no way to know if it holds a meaningful value because it is not kept at null when it is not in use. It can be shifted regardless. [SWAPPED ORDER]
> if te
> te = te - ts
> end
> # \item Reset \verb|ts| to the beginning of the buffer. [SWAPPED ORDER]
> ts = 0
> # \end{itemize}
> else
> prefix = []
> end
> # \item Read another block of data into the buffer, immediately following any preserved data.
> # \item Run the scanner on the new data.
> end
> end
> end
> end
You can run it with
> foo_scanner = FooScanner.new 'foo.txt'
> foo_scanner.scan!
If that is good code, then perhaps it could be added as another example
to the Ragel website?
Thanks,
Seamus
--
Seamus Abshere
123 N Blount St Apt 403
Madison, WI 53703
1 (201) 566-0130
_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users
More information about the ragel-users
mailing list