Primitive lookahead question
Wincent Colaiuta
w... at wincent.com
Thu Sep 20 18:01:04 UTC 2007
Hi,
I'm trying to parse the output of "git diff" and in particular lines
which look like this:
diff --git a/my file b/my file
Where "a/my file" is the "from file" and "b/my file" is the "to
file". This is slightly tricky because as you can see there are no
delimiters between the two paths other than a space, but spaces are
also allowed inside the paths (and Git only uses quotation marks here
when the filenames contain embedded tabs, newlines, double-quotes or
backslash charcters).
This means that the only sign that the "from file" has ended and the
"to file" has begun is when you hit " b/", but by the time you see
that you're already inside the "to file" part. So I made rules to
capture the "from file" and the "to file", but my initial attempt at
a "from file" rule was broken:
from_file = "a/" (any+ -- " b/") ;
The resulting state machine (quite correctly) takes input like:
a/hello b/world
And identifies the "from file" as:
a/hello b
Which is not what we want. One tactic is mash the "from_file" and
"to_file" rules into a single rule:
from_to_files = "a/" (any - linefeed)+ " b/" (any - linefeed)+ ;
But that produces a fairly ugly DFA (especially when you add in rules
for parsing quotes filenames with embedded escape sequences). So I
tried to implement a primitive form of manual lookahead as follows:
from_file = "a/" (any - linefeed)+ %store " b/" @jumpback;
Where "store" is an action which records the recognized file and
"jumpback" is just:
action jumpback { p -= 3; }
The idea being that I have to "lookahead" and see the " b/" to know
that the "from file" has been scanned, but then bump the current
character pointer back by three so that the machine can resume
scanning and looking for the "to file".
The generated DFA for the rule looks correct to me and isn't too ugly
(7 states, about 14 transitions). Is my approach ok, or is there a
better way?
Apart from that the format I am trying to parse is totally regular,
unambiguous, and can be parsed without backtracking, which is nice
for a change!
Cheers,
Wincent
More information about the ragel-users
mailing list