A simple intro to writing a lexer with Ragel.
It seems that there is a fair variety of tools designed to make writing Lexers, Scanners and Tokenizers easier, but Ragel has a reputation for being simple and consistently producing the fastest final code. This is a short and simple intro to Ragel for a common use-case: writing a Lexer for a programming language.
Start by making sure you have Ragel installed. This process varies for each OS so I'm not going to cover it in detail. On OS X it's as easy as installing MacPorts and then opening a terminal and typing 'sudo port install ragel'. You should now be able to use Ragel via the command-line.
Next you need to decide on your 'host language'. The host language is preferably the same language that the rest of your project is written in, though many people opt to use C for their lexer because of the dramatic increase in speed that it provides. Ragel supports a number of host languages including: C, C++, Objective-C, D, Java and Ruby. For the purpose of this intro we will be using Ruby as it won't obscure the Ragel-specific code as much as other languages would.
First let's create a blank file called 'lexer.rl' and then define an empty state machine inside it. Ragel state machines are defined inside blocks demarcated with '%%{ }%%' like so:
In this Ragel code block we are defining a blank state machine called 'test_lexer', then telling Ragel that the state machine should be compiled in this file using the 'write data' directive. The ability to define where the state machine should be compiled is more useful once we start defining machines that span multiple files. Regions outside of '%%{ }%%' blocks and lines that do not start with '%%' are assumed to be written in the host language.
In Ragel the syntax for defining a lexer/scanner differs from that required for creating a normal state-machine and looks like this:
If the 'scanner_name' is 'main' then it is automatically run when our state machine is executed. An 'action' is a section of host-language code that is executed whenever the token represented by 'token_description' is found. It could be something as simple as printing out the token, or it could be code to affect the state of something external like a parser.
Token descriptions can either be a string literal (like 'keyword') or a regular expression literal (like [0-9]*). However it is much easier to read our final scanner if we store our token descriptions inside variables with useful names like so:
We can compile our state machine by passing the file to Ragel like so:
The '-R' switch tells Ragel that we're using Ruby as our host language. This command will produce an output file called 'lexer.rb' which we can run by itself or incorporate into a larger project.
In order to actually run a state machine we need to encapsulate several Ragel directives (lines starting with '%%') into a function definition like this:
Our 'run_lexer' function serves several purposes: it unpacks our data string into an array of ordinal values, it tells the state machine how long our data is using the 'eof' variable, it creates a blank array for our tokens, initializes the state machine with 'write init', executes it with 'write exec' and then tells us about the tokens found by outputting a human-readable version of our token_array to stdout.
Now that these pieces are in place we're ready to start defining our lexer. The first step is defining our token description for an integer:
We have to define a description for each token that may be present in the source data. It's worth noting that in Ragel encountering an unknown pattern/token counts as an error, so we will have to later define a pattern for everything that could be in the grammar we're lexing, including whitespace, even though we're not actually doing anything with it.
Now we will create a basic scanner definition that looks for an integer, but does nothing with it:
Next we'll create our first action based on a token:
Sections of Ragel code inside braces like '{ puts "Integer" }' are actually written in the host language, in this case Ruby. So at this point if we were to run the lexer against a string like "190" you would simply see "Integer" written to stdout.
If we want to do something more useful with the token we have captured then we need to use the 'ts' and 'te' variables that are defined by the Ragel scanner. 'ts' stands for 'token start', while 'te' stands for 'token end'. These represent the indices from our data array that match the start and end of the current token. In Ruby we could use them like so:
If we were to run this scanner against a string like "-999" we would see "Integer: -999" on stdout. This shows that it's pretty easy to capture the data we care about, from here we just need to devise a method for storing this data. Let's setup a function called 'emit' that will append the current token to an array:
And then incorporate it into the action associated with the integer token:
You'll notice the action definition is now spread across multiple lines for readability. Running this lexer against a string like "-101" will now produce an array like:
So you can see we now have everything in place to build out our lexer to handle the full target grammar, which we can do quite simply by adding further token descriptions and their associated actions, the code for our full lexer will look like this:
You'll notice that our last token description 'space' is not defined anywhere, that's because it is a token description built into Ragel. There is no action associated with this token because we don't want to do anything with it, we simply need to define it to say that whitespace is valid in our target grammar.
We can run our new lexer against data very easily like so:
This obviously provides a pretty solid base against which we can start implementing something more serious. If you're interested in learning more about Ragel and it's possible applications check out the Ragel site and the Ragel user guide.