[ragel] Fixing issues with ragel HTML grammar.
Adrian Thurston
thurston at colm.net
Mon Jan 23 07:28:18 UTC 2017
Ah sorry Michael, I didn't look at all my mail before I started
responding and so I didn't notice you already responded.
Adrian
On 2017-01-19 19:54, Michael Laing wrote:
> Try changing the definition of 'content' to:
>
> content = (
> any - (space | '<')
> )+;
>
> Cheers,
> ml
>
>> On Jan 18, 2017, at 13:38 , Andrey Kulikov <amdeich at gmail.com> wrote:
>>
>> Hello,
>>
>> In my project I need to extract links from HTML document. For this purpose I've prepared ragel HTML grammar, primarily based on this work:
>> https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl [1] (mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript [2] )
>>
>> Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
>>
>> If I specify this thext as an input:
>> bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx"> my parser can correctly extract first link, but not the second one. The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.
>> In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.
>>
>> Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?
>>
>> Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ). There is also input file input-nbsp.html , which expected to contain input for the application.
>>
>> In order to play with it, make .c-file from grammar: ragel ngx_url_html_portion.rl
>>
>> then compile resulting .c-file and run programm.
>> Input file should be in the same directory.
>>
>> Will be sincerely grateful for any clue.
>>
>> --
>> Andrey <input-nbsp.html><ngx_url_html_portion.rl>_______________________________________________
>> ragel mailing list
>> ragel at colm.net
>> http://www.colm.net/cgi-bin/mailman/listinfo/ragel
>
> _______________________________________________
> ragel mailing list
> ragel at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel [3]
Links:
------
[1]
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
[2]
http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript
[3] http://www.colm.net/cgi-bin/mailman/listinfo/ragel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170123/1c9b4ab2/attachment-0002.html>
More information about the ragel-users
mailing list