Image of Aaron Gough, Software Developer

A quick intro to writing a parser with Treetop

Treetop is a Ruby library that allows you to create parsers easily by describing them using a Parsing Expression Grammar (PEG). Writing a parser using Treetop is a fairly pain-free process, but getting started can be non-trivial, especially if you’re not familiar with PEGs, so this is going to be a fairly short ‘getting started’ guide. To keep things simple our example parser is going to process a subset of S-Expression syntax like so:

1 (define test (lambda ()
2   (begin
3     (display "something")
4     (display 1)
5     (display 3.08))))
6     
7 (test)

It will be a fairly simplistic example, but it should be enough to get you started!

Installing Treetop

Provided that you already have Ruby and RubyGems installed, installing Treetop is a very simple process. Simply enter this at your command line:

1 gem install treetop

If you do not have Ruby and RubyGems installed then you should have a look at the Ruby website for installation instructions.

Getting Started

There are going to be three main parts to our parser:

The first thing we need to do is create skeletons for the three files involved:

 1 # In file parser.rb
 2 require 'treetop'
 3 
 4 # Find out what our base path is
 5 base_path = File.expand_path(File.dirname(__FILE__))
 6 
 7 # Load our custom syntax node classes so the parser can use them
 8 require File.join(base_path, 'node_extensions.rb')
 9 
10 class Parser
11   
12   # Load the Treetop grammar from the 'sexp_parser' file, and 
13   # create a new instance of that parser as a class variable 
14   # so we don't have to re-create it every time we need to 
15   # parse a string
16   Treetop.load(File.join(base_path, 'sexp_parser.treetop'))
17   @@parser = SexpParser.new
18   
19 end
1 # In file sexp_parser.treetop
2 grammar Sexp
3 
4 end
1 # In file node_extensions.rb
2 module Sexp
3 
4 end

Basic Parser API & Error Reporting

Our API is going to be very simple, the Parser class will only have only public method: parse. Error reporting for the moment will also be minimal, however if you are making a parser that will be used widely then this is definitely not an area you should skimp on. Inscrutable error messages can make an otherwise well-written parser very frustrating to deal with.

 1 # In file parser.rb
 2 class Parser
 3 
 4   ...
 5   
 6   def self.parse(data)
 7     # Pass the data over to the parser instance
 8     tree = @@parser.parse(data)
 9     
10     # If the AST is nil then there was an error during parsing
11     # we need to report a simple error message to help the user
12     if(tree.nil?)
13       raise Exception, "Parse error at offset: #{@@parser.index}"
14     end
15     
16     return tree
17   end
18 end

Abstract Syntax Tree Nodes

In order to be able to build a meaningful tree structure from our input we need to be able to distinguish between different types of nodes, for example:

and so on

 1 # In file node_extensions.rb
 2 module Sexp
 3   class IntegerLiteral < Treetop::Runtime::SyntaxNode
 4   end
 5   
 6   class StringLiteral < Treetop::Runtime::SyntaxNode
 7   end
 8   
 9   class FloatLiteral < Treetop::Runtime::SyntaxNode
10   end
11   
12   class Identifier < Treetop::Runtime::SyntaxNode
13   end
14   
15   class Expression < Treetop::Runtime::SyntaxNode
16   end
17   
18   class Body < Treetop::Runtime::SyntaxNode
19   end
20 end

The Parser Rules

Now that we have all of our other components in place we can start actually defining our PEG rules. We’ll start with the smallest parts: Identifier, FloatLiteral, StringLiteral, space and IntegerLiteral

 1 # In file sexp_extensions.rb
 2 grammar Sexp
 3   rule integer
 4     ('+' / '-')? [0-9]+ <IntegerLiteral>
 5   end
 6   
 7   rule float
 8     ('+' / '-')? [0-9]+ (('.' [0-9]+) / ('e' [0-9]+)) <FloatLiteral>
 9   end
10   
11   rule string
12     '"' ([^"\\] / "\\" . )* '"' <StringLiteral>
13   end
14   
15   rule identifier
16     [a-zA-Z\=\*] [a-zA-Z0-9_\=\*]* <Identifier>
17   end
18   
19   rule space
20     [\s]+
21   end
22 end

Each rule above comes in two parts: first the regular expression that defines what this entity looks like, and second what type of syntax node it should be turned into. Treetop looks inside the <...> and matches the name up with the custom syntax nodes that we defined earlier.

Rules in Treetop are written using what amounts to a superset of regular expressions, more info can be found at the Treetop website.

You might be wondering why we have a PEG rule for space… This is so we can allow whitespace in between entities in the input string, if we did not have a rule for this we would have to be very strict about how source files were laid out.

Next let’s define the rule for expressions and bodies (the inner part of an expression), these will build upon the rules we have defined already and are an example of what makes Treetop so great: Treetop allows you to write rules by easily composing sets of smaller rules to make a larger one. This code re-use makes parsers based on PEGs like Treetop very easy to maintain and expand.

1 # In file sexp_parser.treetop
2 rule expression
3   space? '(' body ')' space? <Expression>
4 end
5 
6 rule body
7   (expression / identifier / float / integer / string / space )* <Body>
8 end

At this point we should be able to parse the entire of our example S-Expression shown above, let’s try it with a simple example and see what the result is:

 1   Parser.parse('(this "is" a test( 1 2.0 3))')
 2   
 3   =>  Expression+Expression0 offset=0, "...s\" a test( 1 2.0 3))" (body):
 4         SyntaxNode offset=0, ""
 5         SyntaxNode offset=0, "("
 6         Body offset=1, "...is\" a test( 1 2.0 3)":
 7           Identifier+Identifier0 offset=1, "this":
 8             SyntaxNode offset=1, "t"
 9             SyntaxNode offset=2, "his":
10               SyntaxNode offset=2, "h"
11               SyntaxNode offset=3, "i"
12               SyntaxNode offset=4, "s"
13           SyntaxNode offset=5, " ":
14             SyntaxNode offset=5, " "
15           StringLiteral+String1 offset=6, "\"is\"":
16             SyntaxNode offset=6, "\""
17             ...
18             ...

You can see we have created a tree from our input, but it has a lot of extraneous nodes that we don’t need or care about. Let’s put in a system for cleaning up the tree:

 1 # In file parser.rb
 2 class Parser
 3 
 4   def self.parse
 5     ...
 6     ...
 7     
 8     self.clean_tree(tree)
 9     
10     return tree
11   end
12   
13   private
14   
15     def self.clean_tree(root_node)
16       return if(root_node.elements.nil?)
17       root_node.elements.delete_if{|node| node.class.name == "Treetop::Runtime::SyntaxNode" }
18       root_node.elements.each {|node| self.clean_tree(node) }
19     end
20   
21 end

This method is going to traverse the entire tree and strip out any nodes that are not one of our custom classes, that way we are only left with nodes we care about. Let’s try our example again and see what we get:

 1   Parser.parse('(this "is" a test( 1 2.0 3))')
 2   
 3   =>  Expression+Expression0 offset=0, "...s\" a test( 1 2.0 3))" (body):
 4         Body offset=1, "...is\" a test( 1 2.0 3)":
 5           Identifier+Identifier0 offset=1, "this"
 6           StringLiteral+String1 offset=6, "\"is\""
 7           Identifier+Identifier0 offset=11, "a"
 8           Identifier+Identifier0 offset=13, "test"
 9           Expression+Expression0 offset=17, "( 1 2.0 3)" (body):
10             Body offset=18, " 1 2.0 3":
11               IntegerLiteral+Integer0 offset=19, "1"
12               FloatLiteral+Float2 offset=21, "2.0"
13               IntegerLiteral+Integer0 offset=25, "3"

Much better! Now we have a nice clean representation of our input. But to be honest for most cases this is still more information than we need. Let’s alter our custom syntax node classes so they can output a simpler representation of themselves using simple nested sets of arrays and native Ruby data-types, we can achieve this in a simple fashion by giving each node a to_array method and using the delegation pattern:

 1 # In file node_extensions.rb
 2 module Sexp
 3   class IntegerLiteral < Treetop::Runtime::SyntaxNode
 4     def to_array
 5       return self.text_value.to_i
 6     end
 7   end
 8   
 9   class StringLiteral < Treetop::Runtime::SyntaxNode
10     def to_array
11       return eval self.text_value
12     end
13   end
14   
15   class FloatLiteral < Treetop::Runtime::SyntaxNode
16     def to_array
17       return self.text_value.to_f
18     end
19   end
20   
21   class Identifier < Treetop::Runtime::SyntaxNode
22     def to_array
23       return self.text_value.to_sym
24     end
25   end
26   
27   class Expression < Treetop::Runtime::SyntaxNode
28     def to_array
29       return self.elements[0].to_array
30     end
31   end
32   
33   class Body < Treetop::Runtime::SyntaxNode
34     def to_array
35       return self.elements.map {|x| x.to_array}
36     end
37   end
38 end

You can see here that each complex node (Expression, Body) delegates the job of producing the array down to it’s children, and each primitive child node (StringLiteral, FloatLiteral, Identifier) knows how to return a simple version of itself. Let’s add some code to the Parser class to use this and then try it out:

 1 # In file parser.rb
 2 class Parser
 3 
 4   def self.parse
 5     ...
 6     ...
 7     
 8     return tree.to_array
 9   end
10   
11 end
1 Parser.parse('(this "is" a test( 1 2.0 3))')
2 #=> [:this, "is", :a, :test, [1, 2.0, 3]]

Perfect!

While the example here is fairly simplistic it touches all the parts needed to make a more complex parser using Treetop. You can have a look at the completed code from this example at GitHub, and if you want to have a look at a more involved example you can check out the reference parser for the Koi programming language which is also written using Treetop.