Writing a Simple Parser
Now, let’s try writing a very simple parser. The Scala standard library already comes with a parser-combinators library, so we won’t need to install anything.
We start with a new package and add an import to the parsing library.
package chapter2
import scala.util.parsing.combinator.RegexParsers
RegexParsers comes with a few handy functions to use regular expressions and string literals that will serve as the basic building blocks for our parser.
We again start with an object, but this time let it extend RegexParsers
. A RegexParser
skips whitespace characters by default, but for pedagogical purposes, we want to do this ourselves, so we disable the skipping of whitespace by setting the value to false
. The override
keyword is necessary whenever we override a concrete method or value.
object Parser extends RegexParsers {
override val skipWhitespace = false
// insert the rest of the code from this chapter here
}
Now, we’ll define a parser that recognizes one of the symbols allowed in Scheme identifiers.
val symbol = regex("[!#$%&|*+/:<=>?@^_~-]".r)
The regex
method we inherited from RegexParsers
lets us create a parser from a regular expression. Regular expressions can be created with the r
method on any String
instance. The parser is assigned to the value symbol
. The type of symbol
is Parser[String]
, but Scala can figure this out from the regex
expression, so we can omit the type. If you prefer to see the types in the source code, you can also write:
val symbol: Parser[String] = regex("[!#$%&|*+/:<=>?@^_~-]".r)
Let’s define a method to call our parser and handle any possible errors:
def readExpr(input: String): String = {
parse(symbol, input) match {
case Success(res, _) => "Found value: «"+ res +"»"
case NoSuccess(msg, _) => "No match: «" + msg +"»"
}
}
As you can see from the type signature, readExpr is a method from a String
to a String
. We name the parameter input
, and pass it, along with the symbol
parser we defined above to the method parse
.
parse
returns a value of type ParseResult[String]
and there are exactly two kinds of results we care about: either the parsed String
value or an error message. To distinguish between the two possible cases, we use a pattern-matching expression. Pattern matching is similar to C and Java’s switch-case statements, except that you’re not limited to a predefined set of types, but can switch on all types and do much more.
The block with the case statements is a function which transforms a ParseResult
into a String
, and match
applies the function to the result of parse
. Pattern matching is often done on algebraic data types because Scala provides a convenient way to deconstruct an object and to bind the object’s components to values. In our code, we bind the result to the name res
, which can then be used in the code following the fat arrow. And in the failure case, we bind the message to msg
.
case Success(res, _) => ... res
case NoSuccess(msg, _) => ... msg
Don’t worry if you don’t fully understand pattern matching yet. It’s a very powerful feature and we will be using it a lot in this tutorial.
What about the underscore? Success
and NoSuccess
also provide us with the remainder of the input, but we don’t care about it and ignore it by binding it to _
. You will encounter the underscore in various places in Scala. It’s typically used when you don’t want to give something an explicit name, either because you want to ignore it or because an explicit name would not provide any useful additional information.
Finally we also need a main
method to be able to run our program!
def main(args: Array[String]) {
println(readExpr(args(0)))
}
How to run your code depends on your chosen IDE, or you can run your program from the command line. The command line has the benefit that it’s easy to play with different arguments, which can be painful (i.e. needs several UI actions) in an IDE. If you follow this tutorial’s code organization, you should have a directory hierarchy with several class files, similar to this:
~/Write-Yourself-a-Scheme-Code/.target/scala-2.9.0.1/classes % tree
.
├── chapter1
│ ├── HelloWorld.class
│ └── HelloWorld$.class
├── chapter2
│ ├── Parser$$anonfun$10.class
│ ├── Parser.class
│ ├── Parser$.class
│ ├── ...
You can now run your program with scala chapter2.Parser "-"
. Note that you need to invoke this from the root of your package hierarchy, don’t enter the chapter2
folder.
% scala chapter2.Parser :
Found value: «-»
% scala chapter2.Parser abc
No match: «string matching regex `[!#$%&|*+/:<=>?@^_~-]' expected but `a' found»
Whitespace
Next, we’ll add a series of improvements to our parser that’ll let it recognize progressively more complicated expressions. The current parser chokes if there’s whitespace preceding our symbol:
% scala chapter2.Parser " :"
No match: «string matching regex `[!#$%&|*+/:<=>?@^_~-]' expected but ` ' found»
Let’s fix that, so that we ignore whitespace.
First, lets define a parser that recognizes any number of whitespace characters. Incidentally, this is why we overrode the skipWhitespace
val when we derived our parser, because without it our parser would already skip whitespace by itself.
val space = rep1(" ")
Our space
parser simply accepts a non-emtpy sequence of spaces. We also need to adapt our readExpr
function to use the parser:
parse(space ~> symbol, input) match {
case Success(res, _) => "Found value: «"+ res +"»"
case NoSuccess(msg, _) => "No match: «"+ msg +"»"
}
We used the ~>
operator to combine the two parsers. ~>
attempts to match the first parser, then attempts to match the second with the remaining input, and fail if either fails. The result of the first parser is ignored, that’s why the operator “points” to the second parser.
Compile and run this code. Note that since we defined spaces in terms of rep1
, it will no longer recognize a plain old single character. Instead you have to precede a symbol with some whitespace. We’ll see how this is useful shortly:
% scala chapter2.Parser " %"
Found value: «%»
% scala chapter2.Parser "%"
No match: «` ' expected but `%' found»
% scala chapter2.Parser " abc"
No match: «string matching regex `[!#$%&|*+/:<=>?@^_~-]' expected but `a' found»
Return Values
Right now, the parser doesn’t do much of anything – it just tells us whether a given string can be recognized or not. Generally, we want something more out of our parsers: we want them to convert the input into a data structure that we can traverse easily. In this section, we learn how to define a data type, and how to modify our parser so that it returns this data type.
First, we need to define a data type that can hold any Lisp value:
abstract class LispVal
case class Atom(name: String) extends LispVal
case class LispList(elems: List[LispVal]) extends LispVal
case class DottedList(elems: List[LispVal], last: LispVal) extends LispVal
case class Number(i: Int) extends LispVal
case class LispString(s: String) extends LispVal
case class Bool(b: Boolean) extends LispVal
This is an example of an algebraic data type: it defines a set of possible values that a variable of type LispVal
can hold. Scala models algebraic data types with inherintance: we have a LispVal
class and one derived subclass for each of the possible Lisp values our parser will recognize. Our subclasses are not ordinary classes but case classes. In this example, a LispVal
can be:
- An
Atom
, which stores aString
naming the atom - A
LispList
, which stores an ordinary ScalaList
of other LispVals; also called a proper list - A
DottedList
, representing the Scheme form(a b . c)
; also called an improper list. This stores a list of all elements but the last, and then stores the last element as another field - A
Number
, containing a Scala Int - A
LispString
, containing a ScalaString
- A
Bool
, containing a ScalaBoolean
value
The parameter list following the class name is the parameter list of the primary constructor. You can have several constructors in Scala, but there always needs to be a primary constructor that has to be called eventually. Because we defined our class to be a case
class, the Scala compiler will generate several useful methods for us, and most importantly, we can use the classes in pattern matching to access their attributes, which is also where their name comes from.
Next, let’s add a few more parsing functions to create values of these types. A string is a double quote mark, followed by any number of non-quote characters, followed by a closing quote mark:
val parseString = "\"" ~> rep("[^\"]".r) <~ "\"" map {
ss: List[String] => LispString(ss.mkString)
}
Instead of just returning the matched string, we map the Scala String
into our own LispString
data type. The argument to map is an (anonymous) function (or lambda if you’re coming from C++11), which takes an argument ss
of type List[String]
and returns a LispString
. The mkString
method simply concatenates all the single character strings. A case class also acts like a function to create new instances of itself, but you could also have written new
LispString(...)
instead.
Type inference can also help us to shorten anonymous function expressions; it’s allowed to omit the type annotation of the argument:
val parseString = "\"" ~> rep("[^\"]".r) <~ "\"" map {
ss => LispString(ss.mkString)
}
What type does parseString
have? symbol
was of type Parser[String]
, but because we mapped our String
to a LispString
, parseString
is now of type Parser[LispString]
. This is an important and recurring concept, so let us repeat it: We start with a Parser[String]
, and an anonymous function that
takes a String
and gives us a LispString
(the functions is of type String
using map
, and what we end up with is a Parser[LispString]
.
Now let’s move on to Scheme variables. An atom is a letter or symbol, followed by any number of letters, digits, or symbols:
val letter = regex("[a-zA-Z]".r)
val digit = regex("[0-9]".r)
val parseAtom = (letter | symbol) ~ rep(letter | symbol | digit) map {
case first ~ rest =>
val atom = first + rest.mkString
atom match {
case "#t" => Bool(true)
case "#f" => Bool(false)
case _ => Atom(atom)
}
}
The letter
and digit
parser are straight forward, we simply use a regular expression to do all the work.
Then, we introduce another parser combinator, the choice operator |
. This tries the first parser, then if it fails, tries the second. If either succeeds, then it returns the value returned by the succeeding parser.
We also want to get a LispVal
, so we have to map the parser’s result. Note that this time, we have to handle the results of two parsers that are combined with ~
, we don’t ignore one side of the parser like before. The letter | symbol
parser gives us a String
, and the rep
combinator will result in a List[String]
. So to map the result, we need a function of type String ~ List[String] => LispVal
.
Now instead of passing just a function to map
, we pass a pattern matching expression (the same kind of expression we previously used with match
), which allows us to break down the parser result. We bind the result of the first parser to first
and the second one to rest
.
Once we have read the first character and the rest of the atom, we need to put them together. The val
statement defines a new String
variable atom
.
Then we use a pattern matching expression to determine which LispVal
to create and return, matching against the literal strings for true and false. The underscore _
alternative is a readability trick: Scala will try all case
clauses or throw an error if none matches. So if the code falls through to the _
case, it always matches, and returns the value of atom
.
Finally, we create one more parser, for numbers.
val parseNumber: Parser[LispVal] = rep1(digit) map {
n => Number(n.mkString.toInt)
}
Let’s create a parser that accepts either a string, a number, or an atom:
val parseExpr: Parser[LispVal] = parseString | parseNumber | parseAtom
And edit readExpr so it calls our new parser:
def readExpr(input: String) = {
parse(parseExpr, input) match {
case Success(res, _) => "Found value: «"+ res +"»"
case NoSuccess(msg, _) => "No match: «"+ msg +"»"
}
}
Compile and run this code, and you’ll notice that it accepts any number, string, or symbol, but not other input:
% scala chapter2.Parser "\"this is a string\""
Found value: «LispString(this is a string)»
% scala chapter2.Parser 25
Found value: «Number(25)»
% scala chapter2.Parser symbol
Found value: «Atom(symbol)»
% scala chapter2.Parser "(symbol)"
No match: «`"' expected but `(' found»
Exercises
- Change
parseNumber
to support the Scheme standard for different bases. - Add a Character constructor to
LispVal
, and create a parser for character literals as described in R5RS. - Add a
Float
constructor toLispVal
, and support R5RS syntax for decimals. - Add data types and parsers to support the full numeric tower of Scheme numeric types.
Recursive Parsers: Adding lists, dotted lists, and quoted datums
Next, we add a few more parser actions to our interpreter. Start with the parenthesized lists that make Lisp famous:
val parseList = repsep(parseExpr, space) map {
exprs => LispList(exprs)
}
The repsep
combinator takes two parsers and repeatedly uses parseExpr
interleaved with space
.
The dotted-list parser is somewhat more complex, but still uses only concepts that we’re already familiar with:
val parseDottedList = rep1(parseExpr <~ space) ~ ("." ~ space ~> parseExpr) map {
case head ~ tail => DottedList(head, tail)
}
With a clever use of ~>
and <~
we can get a parser that gives us just the parts we are interested in, which we can then convert to a DottedList
.
Next, let’s add support for the single-quote syntactic sugar of Scheme:
val parseQuoted = "'" ~> parseExpr map {
expr => LispList(Atom("quote") :: expr :: Nil)
}
Most of this is fairly familiar stuff: it reads a single quote character, reads an expression and binds it to expr
, and then returns a LispList
with an additional quote
atom. The ::
is used to prepend (or “cons” in Lisp terms) the quote
atom and the expr
to the empty list Nil
. We could also have written it like this:
LispList(List(Atom("quote"), expr))
Finally, edit our definition of parseExpr
to include our new parsers:
val parseExpr: Parser[LispVal] = parseAtom | parseNumber | parseString | parseQuoted |
"(" ~> (parseDottedList | parseList) <~ ")"
Compile and run this code:
% scala chapter2.Parser "(a test)"
Found value: «LispList(List(Atom(a), Atom(test)))»
% scala chapter2.Parser "(a (nested) test)"
Found value: «LispList(List(Atom(a), LispList(List(Atom(nested))), Atom(test)))»
% scala chapter2.Parser "(a (dotted . list) test)"
Found value: «LispList(List(Atom(a), DottedList(List(Atom(dotted)),Atom(list)), Atom(test)))»
% scala chapter2.Parser "(a '(quoted (dotted . list)) test)"
Found value: «LispList(List(Atom(a), LispList(List(Atom(quote),
LispList(List(Atom(quoted), DottedList(List(Atom(dotted)),Atom(list)))))), Atom(test)))»
% scala chapter2.Parser "(a '(imbalanced parens)"
No match: «`)' expected but end of source found»
Note that by referring to parseExpr within our parsers, we can nest them arbitrarily deep. Thus, we get a full Lisp reader with only a few definitions. That’s the power of recursion.
In the next chapter, we’re going to write an evaluator!