[ragel-users] simple parser for #include statements
Mark Olesen
Mark.Olesen at esi-group.com
Mon Apr 23 06:29:36 UTC 2018
Background:
In OpenFOAM (www.OpenFOAM.com) we have a flex-based dependency parser.
It simply goes through the file, finds all the #include "file..."
statements and in turn processes each of them. It has some internal
hashing and few other bits that make if faster than 'cpp -M'.
However, this flex solution has it's own problems, one of which is that
its internal buffer switching means that we can quickly exceed 1024 open
file descriptors and there doesn't see to be a way to close them after
processing a file.
I thus had a run at writing a ragel-based version that executes about
60% faster than the flex-based version and also does a better job of
closing file descriptors. I was pleased to have found an example to work
from (https://github.com/danmiley/ragel/blob/master/examples/cppscan.rl).
Problem at hand:
In a stripped down version, I have the following grammar snippet:
%%{
machine wmkdep;
action buffer { tok = p; /* Local token start */ }
action process { processFile(std::string(tok, (p - tok))); }
white = [ \t\f\r]; # Horizontal whitespace
nl = white* '\n'; # Newline
dnl = [^\n]* '\n'; # Discard up to and including newline
comment := any* :>> '*/' @{ fgoto main; };
main := |*
space*; # Discard whitespace, empty lines
white* '#' white* 'include' white*
('"' [^\"]+ >buffer %process '"') dnl;
'//' dnl; # 1-line comment
'/*' { fgoto comment; }; # Multi-line comment
dnl; # Discard all other lines
*|;
}%%
However, the stripping of multi-line C-comments was failing and any
#include ... mentioned in a comment was also being seen.
I figured that the example that I'd found with fgoto must be the right
way, but maybe it wasn't switching at the correct parse point so I
experimented with this instead:
comment := any* :>> '*/' %{ fgoto main; };
But it was still parsing (not stripping) the c-comment.
Finally, I did away with the fgoto and coded it straight up:
%%{
machine wmkdep;
action buffer { tok = p; /* Local token start */ }
action process { processFile(std::string(tok, (p - tok))); }
white = [ \t\f\r]; # Horizontal whitespace
nl = white* '\n'; # Newline (allow trailing whitespace)
dnl = (any* -- '\n') '\n'; # Discard up to and including newline
dquot = '"'; # double quote
dqarg = (any+ -- dquot); # double quoted argument
main := |*
space*; # Discard whitespace, empty lines
white* '#' white* 'include' white*
(dquot dqarg >buffer %process dquot) dnl;
'//' dnl; # 1-line comment
'/*' any* :>> '*/'; # Multi-line comment
dnl; # Discard all other lines
*|;
}%%
I'm fine with this solution. It strips the c-comments as I wanted, but
I'd like to understand why the first attempt failed.
Additionally, I found the behaviour of 'dnl' construction (same name and
behaviour as m4 dnl) rather intriguing. Since the purpose is to delete
through to and including the newline, I'd expressed it like this:
dnl = [^\n]* '\n';
However, I found that the following version
dnl = (any* -- '\n') '\n';
produced a machine that was slightly tighter. I'd have thought that the
matching would be identical, but the first 'dnl' variant had an
additional intermediate stage in the machine. All machines were
generated with ragel 6.9 (since that's what opensuse leap 42.3 ships with).
/mark
More information about the ragel-users
mailing list