[ragel-users] Priority issues when doing a street name parser
William Lachance
wrlach at gmail.com
Thu Sep 24 19:23:44 UTC 2009
(sorry about the duplicated mail-- stupid gmail sent my message before
it was ready) :)
Hi Adrian,
Thanks for the quick response. Trying to unpack what you're saying--
do you mean I should try to define a scanner (as defined in section
6.3 of the manual) which tries the various possibilities for street
names (in order from most preferred to least)?
So one might have
main := |*
streetWithSuffixAndDirection;
streetWithDirection;
streetWithSuffix
street
?
I was looking a little bit more at regular expressions, and it seems
like perl compatible re's have some special options which allow you to
define how matches are supposed to occur. For example:
http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
"*? Matches the previous atom zero or more times, while consuming as
little input as possible." seems like exactly what I need (a quick
test indicates it gives the desired behaviour). Would it not be
possible for ragel to do this sort of thing?
Will
2009/9/23 Adrian Thurston <thurston at complang.org>:
> Hi William,
>
> I think what you need is a traditional lexer. See section 6.3 of the manual.
>
> -Adrian
>
> William Lachance wrote:
>> Hi,
>>
>> I'm trying to construct a parser for street addresses using Ragel.
>> That is to say, a machine that will take a free form address like
>> "5553 Barrington Street NW" and parse out the individual components
>> (street number, name, suffix, direction). Everything was going
>> swimmingly until I started to try to add support for street names with
>> multiple tokens in them (e.g. "Bella Vista Avenue NW")
>>
>> Right now my main machine looks like this:
>>
>> streetNumber = (digit+ >getStartStr %endNumber);
>> streetName = (alpha+ (space+ alpha+)*) >getStartStr %endName;
>> suffixFull = space+ suffix
>> dirFull = space+ direction
>> main := (streetNumber alpha? space+)? streetName suffixFull? dirFull?
>>
>> The suffix and dir expressions are really long and boring
>> concatenations like this:
>>
>> directionWest = ("w"i|"west"i) >getStartStr %endDirWest;
>>
>> Anyway, the problem with this simple regular expression is that it
>> doesn't give up on parsing the streetName when it begins parsing the
>> direction and suffix. So in the above example, it will correctly parse
>> "Bella Vista", but then overwrite it with "Avenue", and later "NW". I
>> thought that perhaps adding a few ":>>"'s (to stop the processing of
>> the streetname when suffixes and directions appear) would help:
>>
>> main := (streetNumber alpha? space+)? streetName :>> suffixFull? :>> dirFull? 0;
>>
>> Unfortunately, that seems to have the side effect of terminating
>> parsing of the street name prematurely (bringing us back to square
>> one).
>>
>> It _seems_ like what I'm doing should be straightforward. Basically
>> the rule should be: "keep on parsing the street until you find a token
>> that unambiguously matches a suffix and/or direction; at that point,
>> stop, only keeping the previous tokens". Surely there's a way of
>> expressing that in Ragel?
>>
>
> _______________________________________________
> ragel-users mailing list
> ragel-users at complang.org
> http://www.complang.org/mailman/listinfo/ragel-users
>
--
William Lachance
wrlach at gmail.com
More information about the ragel-users
mailing list