Changes to Ragel in Response to the Cloudflare Incident
Intro
If you’re reading this, then you’re likely familiar with the Cloudflare incident that was disclosed a few days ago. A lot has been written about the problem, following a very detailed blog post by Cloudflare. Unfortunately, due to the initial wording of that blog post, there was a lot of blame placed on Ragel by people who are unfamiliar with Ragel and the realities of using it to produce parsers.
Ragel fgoto Command
Ragel is a state machine compiler. It helps you specify a state machine using regular language constructs. In the Ragel model it is possible to jump from sub-sections of a state machine with an fgoto command in an action. This alters the current state. This is outside of the regular lagnuage model and is something you do once in a while.
An fgoto command does not affect the input increment operation. When you issue an fgoto, processing resumes on the next input character. If you want this, then you just use fgoto. If you want to resume processing on the current character you issue an fhold first. If you want to resume somewhere else, you can as well.
Ragel Error Actions
Ragel allows the programmer to embed actions into a state machines. These execute in various circumstances. Actions are blocks of code written in the host language (in this case C). The Cloudflare incident involved error actions. Error actions execute when the state machine cannot continue normal processing. They are executed in two distinct cases.
-
Failure occurs on the current character.
-
Failure occurs at EOF (not terminating in a final state).
Combining Error Actions and fgoto Command
In the first error action case it is okay to not issue an fhold before fgoto. In the second case it is not. Since we are already past the end of the buffer, we must fix the input. This is true for any kind of action that executes in the EOF case.
This requirement is where things went wrong. There was a missing fhold before the fgoto, however, since Cloudflare was never telling Ragel where the end of the input was, the action was not executing in the EOF case.
Cloudflare would have heavily tested the buggy error action when it executed in the first case and it would have worked just fine. However, turn on the second case and that’s where the problem arises.
The fact that error actions execute in two distinct cases is certainly a part of working with Ragel that requies some attention. A single construct has two contexts of sort, and it’s possible to get a false sense of security when your code works okay in one of those contexts, the more commonly occuring one.
This is part of Ragel’s design that I have debated over the years. However, I maintain that this design is better than the alternative, which would be for a parsing error handler to not execute because it occurred on EOF and not on an input character. It is better for “error action” to cover all senses of “error”.
Changes to Ragel
What can be done here without really altering existing programs much is to enhance the generated fgoto code in the context of EOF actions. Such a change would catch failures to fixup the input pointer when it is vital and forgotten.
The alternative is a warning on a lack of fhold when an fgoto is present in an EOF action, however there could be no static guarantee that the fhold was correctly placed and it would be possible for bad code to make it through with no warning.