A quick intro to writing a parser with Treetop
Treetop is a Ruby library that allows you to create parsers easily by describing them using a Parsing Expression Grammar (PEG). Writing a parser using Treetop is a fairly pain-free process, but getting started can be non-trivial, especially if you’re not familiar with PEGs, so this is going to be a fairly short ‘getting started’ guide. To keep things simple our example parser is going to process a subset of S-Expression syntax like so:
1 (define test (lambda ()
2 (begin
3 (display "something")
4 (display 1)
5 (display 3.08))))
6
7 (test)
It will be a fairly simplistic example, but it should be enough to get you started!
Installing Treetop
Provided that you already have Ruby and RubyGems installed, installing Treetop is a very simple process. Simply enter this at your command line:
1 gem install treetop
If you do not have Ruby and RubyGems installed then you should have a look at the Ruby website for installation instructions.
Getting Started
There are going to be three main parts to our parser:
- The Ruby class
Parser
which will contain the API for interacting with the parser - The Treetop grammar file in which we will create all the rules that define how our parser will behave
- A series of simple classes that subclass
Treetop::Runtime::SyntaxNode
, these will allow us to easily describe our parsed structure by having each type of entity map directly to a custom syntax node class
The first thing we need to do is create skeletons for the three files involved:
1 # In file parser.rb
2 require 'treetop'
3
4 # Find out what our base path is
5 base_path = File.expand_path(File.dirname(__FILE__))
6
7 # Load our custom syntax node classes so the parser can use them
8 require File.join(base_path, 'node_extensions.rb')
9
10 class Parser
11
12 # Load the Treetop grammar from the 'sexp_parser' file, and
13 # create a new instance of that parser as a class variable
14 # so we don't have to re-create it every time we need to
15 # parse a string
16 Treetop.load(File.join(base_path, 'sexp_parser.treetop'))
17 @@parser = SexpParser.new
18
19 end
1 # In file sexp_parser.treetop
2 grammar Sexp
3
4 end
1 # In file node_extensions.rb
2 module Sexp
3
4 end
Basic Parser API & Error Reporting
Our API is going to be very simple, the Parser
class will only have only public method: parse
. Error reporting for the moment will also be minimal, however if you are making a parser that will be used widely then this is definitely not an area you should skimp on. Inscrutable error messages can make an otherwise well-written parser very frustrating to deal with.
1 # In file parser.rb
2 class Parser
3
4 ...
5
6 def self.parse(data)
7 # Pass the data over to the parser instance
8 tree = @@parser.parse(data)
9
10 # If the AST is nil then there was an error during parsing
11 # we need to report a simple error message to help the user
12 if(tree.nil?)
13 raise Exception, "Parse error at offset: #{@@parser.index}"
14 end
15
16 return tree
17 end
18 end
Abstract Syntax Tree Nodes
In order to be able to build a meaningful tree structure from our input we need to be able to distinguish between different types of nodes, for example:
Expression
eg: (1 2 3)Identifier
eg: foo, bar, other_thingIntegerLiteral
eg: 1, 2, 99StringLiteral
eg: “test”
and so on
1 # In file node_extensions.rb
2 module Sexp
3 class IntegerLiteral < Treetop::Runtime::SyntaxNode
4 end
5
6 class StringLiteral < Treetop::Runtime::SyntaxNode
7 end
8
9 class FloatLiteral < Treetop::Runtime::SyntaxNode
10 end
11
12 class Identifier < Treetop::Runtime::SyntaxNode
13 end
14
15 class Expression < Treetop::Runtime::SyntaxNode
16 end
17
18 class Body < Treetop::Runtime::SyntaxNode
19 end
20 end
The Parser Rules
Now that we have all of our other components in place we can start actually defining our PEG rules. We’ll start with the smallest parts: Identifier
, FloatLiteral
, StringLiteral
, space
and IntegerLiteral
1 # In file sexp_extensions.rb
2 grammar Sexp
3 rule integer
4 ('+' / '-')? [0-9]+ <IntegerLiteral>
5 end
6
7 rule float
8 ('+' / '-')? [0-9]+ (('.' [0-9]+) / ('e' [0-9]+)) <FloatLiteral>
9 end
10
11 rule string
12 '"' ([^"\\] / "\\" . )* '"' <StringLiteral>
13 end
14
15 rule identifier
16 [a-zA-Z\=\*] [a-zA-Z0-9_\=\*]* <Identifier>
17 end
18
19 rule space
20 [\s]+
21 end
22 end
Each rule above comes in two parts: first the regular expression that defines what this entity looks like, and second what type of syntax node it should be turned into. Treetop looks inside the <...>
and matches the name up with the custom syntax nodes that we defined earlier.
Rules in Treetop are written using what amounts to a superset of regular expressions, more info can be found at the Treetop website.
You might be wondering why we have a PEG rule for space
… This is so we can allow whitespace in between entities in the input string, if we did not have a rule for this we would have to be very strict about how source files were laid out.
Next let’s define the rule for expressions and bodies (the inner part of an expression), these will build upon the rules we have defined already and are an example of what makes Treetop so great: Treetop allows you to write rules by easily composing sets of smaller rules to make a larger one. This code re-use makes parsers based on PEGs like Treetop very easy to maintain and expand.
1 # In file sexp_parser.treetop
2 rule expression
3 space? '(' body ')' space? <Expression>
4 end
5
6 rule body
7 (expression / identifier / float / integer / string / space )* <Body>
8 end
At this point we should be able to parse the entire of our example S-Expression shown above, let’s try it with a simple example and see what the result is:
1 Parser.parse('(this "is" a test( 1 2.0 3))')
2
3 => Expression+Expression0 offset=0, "...s\" a test( 1 2.0 3))" (body):
4 SyntaxNode offset=0, ""
5 SyntaxNode offset=0, "("
6 Body offset=1, "...is\" a test( 1 2.0 3)":
7 Identifier+Identifier0 offset=1, "this":
8 SyntaxNode offset=1, "t"
9 SyntaxNode offset=2, "his":
10 SyntaxNode offset=2, "h"
11 SyntaxNode offset=3, "i"
12 SyntaxNode offset=4, "s"
13 SyntaxNode offset=5, " ":
14 SyntaxNode offset=5, " "
15 StringLiteral+String1 offset=6, "\"is\"":
16 SyntaxNode offset=6, "\""
17 ...
18 ...
You can see we have created a tree from our input, but it has a lot of extraneous nodes that we don’t need or care about. Let’s put in a system for cleaning up the tree:
1 # In file parser.rb
2 class Parser
3
4 def self.parse
5 ...
6 ...
7
8 self.clean_tree(tree)
9
10 return tree
11 end
12
13 private
14
15 def self.clean_tree(root_node)
16 return if(root_node.elements.nil?)
17 root_node.elements.delete_if{|node| node.class.name == "Treetop::Runtime::SyntaxNode" }
18 root_node.elements.each {|node| self.clean_tree(node) }
19 end
20
21 end
This method is going to traverse the entire tree and strip out any nodes that are not one of our custom classes, that way we are only left with nodes we care about. Let’s try our example again and see what we get:
1 Parser.parse('(this "is" a test( 1 2.0 3))')
2
3 => Expression+Expression0 offset=0, "...s\" a test( 1 2.0 3))" (body):
4 Body offset=1, "...is\" a test( 1 2.0 3)":
5 Identifier+Identifier0 offset=1, "this"
6 StringLiteral+String1 offset=6, "\"is\""
7 Identifier+Identifier0 offset=11, "a"
8 Identifier+Identifier0 offset=13, "test"
9 Expression+Expression0 offset=17, "( 1 2.0 3)" (body):
10 Body offset=18, " 1 2.0 3":
11 IntegerLiteral+Integer0 offset=19, "1"
12 FloatLiteral+Float2 offset=21, "2.0"
13 IntegerLiteral+Integer0 offset=25, "3"
Much better! Now we have a nice clean representation of our input. But to be honest for most cases this is still more information than we need. Let’s alter our custom syntax node classes so they can output a simpler representation of themselves using simple nested sets of arrays and native Ruby data-types, we can achieve this in a simple fashion by giving each node a to_array
method and using the delegation pattern:
1 # In file node_extensions.rb
2 module Sexp
3 class IntegerLiteral < Treetop::Runtime::SyntaxNode
4 def to_array
5 return self.text_value.to_i
6 end
7 end
8
9 class StringLiteral < Treetop::Runtime::SyntaxNode
10 def to_array
11 return eval self.text_value
12 end
13 end
14
15 class FloatLiteral < Treetop::Runtime::SyntaxNode
16 def to_array
17 return self.text_value.to_f
18 end
19 end
20
21 class Identifier < Treetop::Runtime::SyntaxNode
22 def to_array
23 return self.text_value.to_sym
24 end
25 end
26
27 class Expression < Treetop::Runtime::SyntaxNode
28 def to_array
29 return self.elements[0].to_array
30 end
31 end
32
33 class Body < Treetop::Runtime::SyntaxNode
34 def to_array
35 return self.elements.map {|x| x.to_array}
36 end
37 end
38 end
You can see here that each complex node (Expression
, Body
) delegates the job of producing the array down to it’s children, and each primitive child node (StringLiteral
, FloatLiteral
, Identifier
) knows how to return a simple version of itself. Let’s add some code to the Parser
class to use this and then try it out:
1 # In file parser.rb
2 class Parser
3
4 def self.parse
5 ...
6 ...
7
8 return tree.to_array
9 end
10
11 end
1 Parser.parse('(this "is" a test( 1 2.0 3))')
2 #=> [:this, "is", :a, :test, [1, 2.0, 3]]
Perfect!
While the example here is fairly simplistic it touches all the parts needed to make a more complex parser using Treetop. You can have a look at the completed code from this example at GitHub, and if you want to have a look at a more involved example you can check out the reference parser for the Koi programming language which is also written using Treetop.