Wednesday, February 27, 2013

Pattern Matching with String Interpolation

So far in this series we’ve seen how to use string interpolation in Scala 2.10 to construct values of various types. Optionally, we can perform run-time or compile-time checking to make sure that the processed strings are sensible.

In this post we consider how to use string interpolation with pattern matching to deconstruct values. Instead of interpolating expression values, we will interpolate arbitrary patterns and thereby be able to extend pattern matching in nice ways.

The complete code for all examples in this series will be made available in the accompanying BitBucket project.

A more general version of s

As it stands, the library interpolator s can only be used to construct string values.

println (s"One plus one is ${1 + 1}")
One plus one is 2

Suppose that we want to use s to pattern match as well. In other words, we want to get exact matches for the constant strings and to recursively match interpolated sub-patterns.

Here’s an example of what we want to achieve. Our new interpolator is called mys. Construction works just as for s.

println (mys"One plus one is ${1 + 1}")

We also want to use mys in a pattern-matching context; e.g., in a case of a match expression. In this example, we match the string "The sky is blue" against the pattern "The $thing is $colour".

val msg = "The sky is blue" match {
            case mys"The $thing is $colour" =>
              mys"A $colour thing is $thing"
            case _ =>
              "no match"
          }
println (msg)

The constant strings "The ", " is " and an empty string at the end match exactly. The interpolated patterns thing and colour match the characters between the constant strings. In this case the interpolated patterns are variable patterns so the effect is to bind those variables to the strings that are matched in those positions. The bound variables are used in the body of the case to construct the result.

One plus one is 2
A blue thing is sky

Construction and deconstruction together

Our first step toward building mys is to consider how we can make it both construct and deconstruct. In our earlier examples, the function that performed the interpolation was a method of the context implicit class. Clearly we can’t make a single method both construct and deconstruct and keep the same interface. But we can use an object to provide both directions. We rely on the fact that an object with an apply method can be used as a method.

The overall structure of MySContext is as follows.

implicit class MySContext (val sc : StringContext) {

  object mys {

    def apply (args : Any*) : String =
      sc.s (args : _*)

    ...

  }

}

Instead of a mys method, we now have a mys object. The mys.apply method implements the construction direction by calling s. We can add other methods to mys as necessary. In particular, we will add an unapplySeq method to make mys into an extractor object that can be called in a pattern matching situation.

Desugaring pattern matching

Before we finish writing MySContext, we need to understand how a processed string desugars when it is used in a pattern context. The general form is the same as before, except that the interpolated pieces marked with dollars are patterns now, not expressions.

id"text0${pat1}text1 ... ${patn}textn"

This general form is translated by the compiler into

StringContext ("text0", "text1", ... , "textn").id (pat1, ... , patn)

The assumption is that id is an extractor object. If you are hazy on how extractor objects work you can find a quick overview in the Tour of Scala.

The unappply (or more usually, unapplySeq) method of id is responsible for doing the matching and returning values that are to be recursively matched against pat1 to patn.

Making mys into an extractor object

Now that we know how an interpolated pattern desugars, we need to add an unapplySeq method to the mys object. We will use unapplySeq instead of unapply because we want to be able to return an arbitrary number of results when a match is found. We can’t predict how many nested patterns there will be.

Our method for implementing the matching process is to construct a regular expression and to rely on regular expression matching to do the hard work. The constant strings are left unchanged, while each interpolated pattern is represented by (.+). The parentheses define a group so when we run the match we will be given the value that matched this sub-regexp.

Here is the full version of mys with unapplySeq.

implicit class MySContext (val sc : StringContext) {

  object mys {

    def apply (args : Any*) : String =
      sc.s (args : _*)

    def unapplySeq (s : String) : Option[Seq[String]] = {
      val regexp = sc.parts.mkString ("(.+)").r
      regexp.unapplySeq (s)
    }

  }

}

Many readers will observe that this implementation is ripe for an injection attack. unapplySeq does not deal with a situation where the constant strings contain regular expression notation. You might like to extend the code to escape that notation before trying the match.

Using more complex sub-patterns

Let’s look at a more complex example of using this interpolator where sub-patterns are used to constrain what can match.

Suppose that we want to match strings like "val age = 34". Furthermore, we want to make sure that the value name is a valid identifier and that the assigned value is a number. We assume that identifiers are any string of letters and digits that begins with a letter, and that numbers are a non-empty sequence of digits.

The method matchValDef uses an interpolation to match the overall structure of the strings we want to accept. Sub-patterns defined using regular expressions further constrain the non-constant parts. Since the interpolated parts are just patterns, we can use any valid Scala pattern matching syntax. matchValDef returns a pair of the matched non-constant parts if the match succeeds, or None if it doesn’t.

def matchValDef (s : String) : Option[(String,String)] = {

  val Ident = """(\p{Alpha}\p{Alnum}*)""".r
  val Value = """(\d+)""".r

  s match {
    case mys"val ${Ident (ident)} = ${Value (value)}" =>
      Some ((ident, value))
    case _ =>
      None
  }

}

Now we can use matchValDef to pull different candidate strings apart.

println (matchValDef ("val age = 34"))
println (matchValDef ("val sum88 = 12345"))
println (matchValDef ("val age = bob"))
println (matchValDef ("val 99 = 34"))
println (matchValDef ("age = 34"))
println (matchValDef ("val age 34"))
Some((age,34))
Some((sum88,12345))
None
None
None
None

We could easily convert matchValDef into an extractor that could be used directly in a case.

What’s next?

The ability to extend pattern matching using extractors is very powerful. Connecting extractors to string interpolation as we’ve seen in this post makes that power more natural to use since the fixed parts are written explicitly.

Our examples have been confined to deconstructing strings. We can also build deconstructing interpolators that take non-strings as input and return non-string values. In the following posts in this series, we will show examples where the values being matched and returned are abstract syntax trees that represent more complex structures. Construction and deconstruction will use a formal language syntax but the values being manipulated will be trees. This capability is extremely useful if we want to write programs that manipulate structured data.

No comments: