I have posted a question in <a href="http://stackoverflow.com/questions/8784903/failed-to-convert-url-parser-regular-expression-to-ragel">StackOverflow</a> about it. <div><br></div><div><div>I found an URL parser regular expression at RFC 2396 and RFC 3986.</div>
<div><br></div><div> ^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?</div><div><br></div><div>I converted it to Ragel:</div><div><br></div><div> %%{ </div><div> # RFC 3986 URI Generic Syntax (January 2005)</div>
<div> machine url_parser;</div><div> </div><div> action pchar {</div><div> printf("%c", fc);</div><div> }</div><div> action scheme { printf("scheme\n"); }</div>
<div> action scheme_end { printf("\nscheme_end\n"); }</div><div> action authority { printf("authority\n"); }</div><div> action authority_end { printf("\nauthority_end\n"); }</div>
<div> action path { printf("path\n"); }</div><div> action path_end { printf("\npath_end\n"); }</div><div> action query { printf("query\n"); }</div>
<div> action query_end { printf("\nquery_end\n"); }</div><div> action fragment { printf("fragment\n"); }</div><div> action fragment_end { printf("\nfragment_end\n"); }</div>
<div> </div><div> scheme = (any - [:/?#])+ >scheme $pchar %scheme_end ;</div><div> authority = (any - [/?#])* >authority $pchar %authority_end ;</div><div> path = (any - [?#])* >path $pchar %path_end ;</div>
<div> query = (any - [#])* >query $pchar %query_end ;</div><div> fragment = (any)* >fragment $pchar %fragment_end ; </div><div> main := (( scheme ":" )?) <: (( "//" authority )?) <: path ( "?" query )? ( "#" fragment )?;</div>
<div> }%%</div><div> </div><div> #include <cstdio></div><div> #include <cstdlib></div><div> #include <string></div><div> </div><div> /** Data **/</div><div> %% write data;</div><div>
</div><div> int main(int argc, char **argv) {</div><div> std::string str(argv[1]);</div><div> char const* p = str.c_str();</div><div> char const* pe = p + str.size();</div><div> char const* eof = pe;</div>
<div> int cs = 0;</div><div> </div><div> %% write init;</div><div> %% write exec;</div><div> </div><div> return p - str.c_str();</div><div> }</div><div><br></div><div>It's work when I input absolute URI.</div>
<div><br></div><div> liangxu@dev64:~$ ./uri_test "<a href="http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20">http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20</a>"</div><div> scheme</div>
<div> http</div><div> scheme_end</div><div> authority</div><div> <a href="http://www.ics.uci.edu">www.ics.uci.edu</a></div><div> authority_end</div><div> path</div><div> /pub/ietf/uri/</div><div> path_end</div>
<div> query</div><div> c=www&rot=1&e=%20%20</div><div> query_end</div><div><br></div><div>And success when I input authority and path:</div><div><br></div><div> liangxu@dev64:~$ ./uri_test "//<a href="http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20">www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20</a>"</div>
<div> authority</div><div> <a href="http://www.ics.uci.edu">www.ics.uci.edu</a></div><div> authority_end</div><div> path</div><div> /pub/ietf/uri/</div><div> path_end</div><div> query</div><div> c=www&rot=1&e=%20%20</div>
<div> query_end</div><div><br></div><div>But failed when I input only path:</div><div><br></div><div> liangxu@dev64:~$ ./uri_test "/pub/ietf/uri"</div><div><br></div><div>What's wrong?</div><br>-- <br>
Liang Xu<br><br></div>