[ragel-users] Primitive lookahead question
Adrian Thurston
thurs... at cs.queensu.ca
Thu Sep 20 19:30:43 UTC 2007
Hi Wincent,
I would suggesting using your first solution, then manually shorten the
from_file by 2 characters. It's a simple solution which avoids more
elaborate backtracking/scanning approaches.
What does git produce if the from file contains " b/"?
-Adrian
Wincent Colaiuta wrote:
> Hi,
>
> I'm trying to parse the output of "git diff" and in particular lines
> which look like this:
>
> diff --git a/my file b/my file
>
> Where "a/my file" is the "from file" and "b/my file" is the "to
> file". This is slightly tricky because as you can see there are no
> delimiters between the two paths other than a space, but spaces are
> also allowed inside the paths (and Git only uses quotation marks here
> when the filenames contain embedded tabs, newlines, double-quotes or
> backslash charcters).
>
> This means that the only sign that the "from file" has ended and the
> "to file" has begun is when you hit " b/", but by the time you see
> that you're already inside the "to file" part. So I made rules to
> capture the "from file" and the "to file", but my initial attempt at
> a "from file" rule was broken:
>
> from_file = "a/" (any+ -- " b/") ;
>
> The resulting state machine (quite correctly) takes input like:
>
> a/hello b/world
>
> And identifies the "from file" as:
>
> a/hello b
>
> Which is not what we want. One tactic is mash the "from_file" and
> "to_file" rules into a single rule:
>
> from_to_files = "a/" (any - linefeed)+ " b/" (any - linefeed)+ ;
>
> But that produces a fairly ugly DFA (especially when you add in rules
> for parsing quotes filenames with embedded escape sequences). So I
> tried to implement a primitive form of manual lookahead as follows:
>
> from_file = "a/" (any - linefeed)+ %store " b/" @jumpback;
>
> Where "store" is an action which records the recognized file and
> "jumpback" is just:
>
> action jumpback { p -= 3; }
>
> The idea being that I have to "lookahead" and see the " b/" to know
> that the "from file" has been scanned, but then bump the current
> character pointer back by three so that the machine can resume
> scanning and looking for the "to file".
>
> The generated DFA for the rule looks correct to me and isn't too ugly
> (7 states, about 14 transitions). Is my approach ok, or is there a
> better way?
>
> Apart from that the format I am trying to parse is totally regular,
> unambiguous, and can be parsed without backtracking, which is nice
> for a change!
>
> Cheers,
> Wincent
>
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "ragel-users" group.
> To post to this group, send email to ragel-users at googlegroups.com
> To unsubscribe from this group, send email to ragel-users-unsubscribe at googlegroups.com
> For more options, visit this group at http://groups.google.com/group/ragel-users?hl=en
> -~----------~----~----~----~------~----~------~--~---
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20070920/0b139318/attachment-0001.sig>
More information about the ragel-users
mailing list