How to Hash-Lang

8.7

How to Hash-Lang

William J. Bowman <wjb@williamjbowman.com>

My goal is to present a self-contained tutorial so that someone familiar with Racket programming and with implementing interpreters, but unfamiliar with Racket’s metaprogramming and module systems, could implement a #lang. I use a single running example language, but implement it multiple ways. I introduce only as much of the metaprogramming and module systems as necessary to implement a simple #lang. I break abstractions and do things the wrong way to demonstrate how they work and fully explain why we want to do things the way all the guides and tutorials teach. I occassionally over simplify some concepts to get to the heart of them.

My goal is not to replace the existing guides or references, but supplement them. I include references to the Racket documentation for more details.

1 From a whole-program interpreter

Racket is particularly well suited for embedding a new language using macros and the module system.

However, that is not how most people first implement a language. A common starting point is a language implemented using a whole-program recursive interpreter (or, equivalently, the visitor pattern).

How does such a person learn to implement a #lang? Well, if we follow the guide on Creating Languages, we first must learn macros, then the module system, then interposition points, completely restructured our interpreter to fit into these abstractions, and only THEN ...

NO! We want to implement a #lang; we shall only learn as much of those things as we need to get a #lang working.

First, let’s write an interpreter for a little simply-typed λ-calculus. The implementation technique isn’t important; we only need a whole-program interpreter to which we can pass a representation of a program and get back a value.

Below we define the grammar for our STLC.

A	::=	Bool
	\|	Int
	\|	(-> A A)

e	::=	x
	\|	integer?
	\|	boolean?
	\|	(lambda (x : A) e)
	\|	(lambda (x) e)
	\|	(e e)
	\|	(if e e e)
	\|	(primop e ...)
	\|	(letrec ([x : A = e] ...) e)

x	::=	symbol?

primop	::=	+
	\|	*
	\|	-
	\|	zero?

STLC includes integers, booleans, and functions, and sufficient machinery to implement factorial. Integer and boolean literals from Racket are embedded directly, and variables can be any symbol. The primitive operations +, -, *, zero?, and if correspond to the procedures and forms of the same name in Racket, but are required to be statically well typed. letrec binds a set of mutually recursive functions, but it can also bind non-recursive values. The bindings of letrec require explicit type annotations. We include single-parameter lambda, with both an explicit type annotation and without, and application.

We could implement the interpreter for this as follows.

(define (interp e)
  (let interp ([e e]
               [env (make-hasheq)])
    (match e
      [(? symbol?) (dict-ref env e)]
      [(? integer?) e]
      [(? boolean?) e]
      [(or `(lambda (,x : ,_) ,e)
           `(lambda (,x) ,e))
       (lambda (v)
         (define env^ (dict-copy env))
         (dict-set! env^ x v)
         (interp e env^))]
      [`(letrec ([,xs : ,As = ,es] ...)
          ,e)
       (for ([x xs]
             [e es])
         (dict-set! env x (interp e env)))
       (interp e env)]
      [`(if ,e1 ,e2 ,e3)
       (if (interp e1 env)
           (interp e2 env)
           (interp e3 env))]
      [`(,prim-op ,es ...)
       #:when (memq prim-op '(- + * zero?))
       (apply (eval prim-op (module->namespace 'racket/base))
              (map (curryr interp env) es))]
      [`(,e1 ,e2)
       ((interp e1 env) (interp e2 env))]
      [_ (error 'interp "Invalid term: ~a" e)])))

The interpreter is a standard big-step environment-passing interpreter, which matches on the input syntax represented as quasiquoted lists, and evalutes the term to a Racket value. Booleans and integers are embedded directly. Variables are symbols dereferenced from the environment. Functions are embedded as Racket functions, which close over the environment when the function is created, and interpret their body under that environment when called. The environment is mutable to handle recursion; a recursive function bound by letrec is first evaluated to a Racket function, and then its environment is updated to include a binding to itself. Each function creates a local copy of that mutable environment when called to maintain lexical scope. All other operations correspond to their Racket equivalents: conditional expressions conditionally interpreter each branch depending on the scrutinee, primitive operations are interpreted as their corresponding implementation in Racket, and function application applies the interpreted function to the interpreted argument.

The interpreter assumes its input is well typed, and we implement a separate bidirectional type-checking function type-check. The type checker is not important to this tutorial, so we leave it in the appendix.

Examples:

> (require "stlc-interp.rkt")
> (define (fact-template x)
    `(letrec ([fact : (-> Int Int)
                    = (lambda (x)
                        (if (zero? x)
                            1
                            (* x (fact (- x 1)))))])
       (fact ,x)))
> (type-check (fact-template 5))
'Int
> (interp (fact-template 5))
120

So far, so standard. But how do we move from this to a Racket #lang?

We can turn this interpreter directly into a #lang using a single interposition point: #%module-begin. An interposition point is a syntactic form implicitly added around non-syntactic linguistic concepts to enable extension or redefinition. #%module-begin is implicitly wrapped around any module, such as the contents of a file. By redefining #%module-begin to call our interpreter with its contents as the argument, we can define a #lang that calls our interpreter with the contents of any file (or, module). This is the only interposition point we need to know about for now.

See Implicit Form Bindings for more information on these interposition points.

So, how do we redefine or extend an interposition point, and #%module-begin in particular?

1.1 Syntax Transformers

Every interposition point is bound to a syntax transformer. A syntax transformer is a Racket procedure defined at compile time (roughly) that consumes syntax and produces syntax. To define a syntax transformer, we use define-syntax instead of define. We must define a function with one parameter, and the type of the function is syntax to syntax, but otherwise we can write the function using all of Racket, but at compile time. All macros are syntax transformers, so interposition points are macros, but ones that are (typically) implicitly inserted and not called by the user.

For more on syntax transformers, see transformer and General Macro Transformers.

We don’t have to learn much about macros to do this, and we won’t for this first #lang. The only thing we need to do is import all our familiar Racket abstractions at compile time, or "for syntax" using (require (for-syntax racket)).

Below we define an example syntax transformer to introduce syntax objects, and the distinction between compile time and run time.

Examples:

> (require (for-syntax racket))
> (define-syntax a-transformer
    (lambda (syn)
      (begin
        (printf "~a~n" syn)
        (printf "~a~n" (syntax->datum syn))
        #'(begin
            (printf "~a!~n" 42)
            42))))
> (define (foo)
    (a-transformer))
#<syntax:eval:3:0 (a-transformer)>
(a-transformer)
> (define (bar)
    (a-transformer some arbitrary inputs))
#<syntax:eval:4:0 (a-transformer some arbitrary inputs)>
(a-transformer some arbitrary inputs)
> (foo)
42!
42

In this example, we define the syntax transformer a-transformer, which prints its input syntax both as a syntax object and as a list, and then produces the syntax for printing "42!" and producing 42 as a value. We define foo and bar, which use the transformer, and see the compile-time effects printed as those definitions are compiled by the REPL. When we call foo at run time, "42!" is printed and 42 returned as a value to the REPL.

One idiosyncracy is that the syntax transformer receives the entire s-expression, including its own name, as its input.

In this example, we use syntax->datum to convert the syntax object to a list. This is normally something to avoid, as syntax objects contain additional information—such as source location information (for debugging), and binding information for identifiers—that is lost when converted to a list. Abstractions such as syntax-rules, syntax-case, and syntax-parse work directly over syntax objects, but we ignore those for our first #lang, and use pattern matching over lists, just as in our implementation of interp.

1.2 Module Languages

To define a #lang, we start with a module language. A module language is a module that exports the default syntactic forms that are available to be used in the body of a new module.

For more on module languages, see Module Languages.

Each module begins by declaring its language, and only the identifiers provided by that language are bound in the module by default. When we declare a module as being in racket, the new module is written in the racket module language, and we get everything provided from the core Racket language and the Racket standard library.

For more on modules, see Module Basics.

For example, we could define a new module language that only provides #%module-begin as follows.

Examples:

> (module A racket
    (require (for-syntax racket))
    (provide (rename-out [a-transformer #%module-begin]))
    (define-syntax a-transformer
      (lambda (syn)
        (begin
          (printf "~a~n" syn)
          (printf "~a~n" (syntax->datum syn))
          #'(#%module-begin
             (printf "~a!~n" 42)
             42)))))
> (module B 'A
    hello
    This is a module in the 'A module language.)
#<syntax:eval:2:0 (#%module-begin hello This is a module in the (quote A) module language.)>
(#%module-begin hello This is a module in the (quote A) module language.)
> (require 'B)
42!
42

Module A uses the module language racket and has access to the usual Racket features. It defines a version of our earlier a-transformer, but this time, it wraps the run-time code in Racket’s #%module-begin. Racket expects programs to be wrapped in its original #%module-begin, so we usually extend #%module-begin by using Racket’s version in our own version. We accomplish this either by renaming Racket’s version on import using rename-in, or (as done here) renaming our version on export using rename-out.

In module B, we declare module A as the module language. Because #%module-begin is implicitly wrapped around any module, everything in the body of module B is implicitly an argument to our syntax transformer, which we exported under the name #%module-begin. We see the compile-time output in the example when module B is defined (and thus compiled), and only get the run-time output when module B is imported for use at run time.

When working with first-class modules, i.e. those that do not correspond to a file, we must quote their name to refer to them. So to refer to module A and B, we use the names 'A and 'B. Unquoted module names, like racket, correspond to a file path in Racket’s module system. Otherwise, first-class modules and modules-as-files are interchangable.

See Module Paths for more details about module paths and their use in require and provide declarations.

1.3 The STLC Module Language

For our first language, the only form we provide is #%module-begin, which takes the entire program, converts it to list at compile time, and generate a run-time call to type-check and interp.

Example:

> (module STLC racket
    (require
     (for-syntax racket)
     "stlc-interp.rkt")
    (provide (rename-out [new-module-begin #%module-begin]))

    (define-syntax (new-module-begin stx)
      (match (syntax->datum stx)
        [`(,_ ,program)
         #`(#%module-begin
            (let ([p '#,program])
              (and (type-check p) (interp p))))]
        [e (error 'STLC "Expected a single-top level expression, but found ~a"
                  (cdr e))])))

First, we define a module STLC, defined in the racket language. We import "stlc-interp.rkt" (the file that contains the implementation of interp) normally, that is, at run time, so we can use it in the genereated output, and import racket at compile time so we can use it when defining the language. Because the module uses racket as the base language, we also have run-time bindings for Racket, so we can use #%module-begin, let, and and at run time.

The module STLC exports a new definition of #%module-begin. The converts its input syntax object to a list, pattern matches on that list, and expects a single top-level expression following the implicit #%module-begin name. We generate a program, wrapped in Racket’s default #%module-begin, that type checks the program and then runs the interpreter. Because we’re generating run-time syntax, we use quasisyntax (#‘) to build the syntax object of the run-time program, and unsyntax #, to splice in the compile-time value program. We must explicitly quote the compile-time value program in the generated run-time syntax, to transform the compile-time value into a run-time value. We also let-bind program, as otherwise we would duplicate the value program in output.

We can now use STLC as module language.

Examples:

> (module B 'STLC
    (letrec ([fact : (-> Int Int)
                   = (lambda (x)
                       (if (zero? x)
                           1
                           (* x (fact (- x 1)))))])
      (fact 5)))
> (require 'B)
120

We declare the just-defined module STLC as the module language, and can write a single well-typed STLC expression in the body of the module, and expect it to type check and then evaluate.

We get different kinds of errors in this language: syntax errors, which happen at compile time from the perspective of the Racket host language, and type errors, which happen at run time from the perspective of Racket. We can see the difference between compile-time and run-time errors in the following modules C and D.

Examples:

> (module C 'STLC
    (letrec ([x : Bool = 5])
      x))
> (require 'C)
type-error: Expected bound term 5 to have type Bool but has
type Int
> (module D 'STLC
    5
    6)
STLC: Expected a single-top level expression, but found (5
6)

A type error is a run-time error, from the perspective of the syntax transformer, since we generate a run-time call to type checker. This is still before run time from the perspective of the interpreter, however. Having mulitple expressions in an STLC module is a compile-time error from the transformer’s perspective, since the syntax transformer raises that error while generating the call to the type checker and interpreter.

To see what’s going on a little better, we can print the syntax of the run-time program at compile time.

Examples:

> (module STLC-debug racket
    (require
     (for-syntax racket racket/pretty)
     "stlc-interp.rkt")
    (provide (rename-out [new-module-begin #%module-begin]))

    (define-syntax (new-module-begin stx)
      (match (syntax->datum stx)
        [`(,_ ,program)
         (define call-interp-program
           #`(let ([p '#,program])
               (displayln "Running interpreter")
               (and (type-check p) (interp p))))
         (displayln "Generating syntax:")
         (pretty-print (syntax->datum call-interp-program))
         #`(#%module-begin #,call-interp-program)]
        [e (error 'stlc-lang "Expected a single-top level expression, but found ~a"
                  (cdr e))])))
> (module E 'STLC-debug
    (letrec ([fact : (-> Int Int)
                   = (lambda (x)
                       (if (zero? x)
                           1
                           (* x (fact (- x 1)))))])
      (fact 5)))
Generating syntax:
'(let ((p
        '(letrec ((fact
                   :
                   (-> Int Int)
                   =
                   (lambda (x) (if (zero? x) 1 (* x (fact (- x 1)))))))
           (fact 5))))
   (displayln "Running interpreter")
   (and (type-check p) (interp p)))
> (require 'E)
Running interpreter
120

We can see module E gets compiled to a call to the type checker and interpreter at compile time, but when imported at run time, runs the interpreter and produces the expected value.

1.4 An s-exp #lang from a Module Language

To define a #lang, that is, one that can be used in the #lang line that begins a Racket file, we need our module language to be in a file. Since this line is the first line of the file, there can’t be a first-class module before it, and Racket must rely on the module system to locate the module from a file.

We can create a file stlc-lang.rkt that contains the same body as the earlier module STLC.

#lang racket
(require
(for-syntax racket)
"stlc-interp.rkt")
(provide (rename-out [new-module-begin #%module-begin]))

(define-syntax (new-module-begin stx)
  (match (syntax->datum stx)
    [`(,_ ,program)
     #`(#%module-begin
        (let ([p '#,program])
          (and (type-check p) (interp p))))]
    [e (error 'stlc-lang "Expected a single-top level expression, but found ~a"
              (cdr e))]))

This module is equivalent to our previous first-class STLC module, except now that it is a file the module system can find when give the right module path.

We can still use this module as a module language by giving the path, as a string, as the module language.

Examples:

> (module B "stlc-lang.rkt"
(letrec ([x : Int = 5])
x))
> (require 'B)
5

However, the #lang line expects a name that resolves to a path in the module system, and cannot accept a string path directly. So #lang racket works, since the module system knows which file the module racket corresponds to, but #lang "stlc-lang.rkt" does not work.

We can work around this using the s-exp language. This language is a meta-language which expects a module, including an unresolved file path, as an argument to the language. This way, we can use #lang s-exp "stlc-lang.rkt" to get a #lang without any additional work!

See Using #lang s-exp for more details about the s-exp meta-language.

To use our new language as a #lang, we can write the following to a file named stlc-s-exp-test.rkt and run racket -t stlc-s-exp-test.rkt, or copy it into an IDE such as DrRacket and run it.

#lang s-exp "stlc-lang.rkt"

(letrec ([fact : (-> Int Int)
               = (lambda (x)
                   (if (zero? x)
                       1
                       (* x (fact (- x 1)))))])
  (fact 5))

1.5 A Proper #lang

s-exp is a great way to get a #lang working quickly, but it is annoying—the language isn’t "first class", not like racket is, which can be used as in #lang racket.

To implement a proper, first-class #lang, we need to tell the module system how to find the file that contains our module language. Register a proper #lang requires two steps:

Installing a package, and thus registering the files in the package with the module system;
Structuring our module with the interface expected by the #lang system, which is more general than a module language to support features I’m not covering here (in particular, reader extensions).

See Reader Extensions and Using #lang reader for more on reader extensions and languages with custom readers.

Until now, we have been able to work with first-class language concepts in Racket; all our examples worked fine in a REPL. Now, however, we must interact with the file system, directory structures, and command line tools.

First, we create a directory structure for the package. We create the simplest package structure: a directory name stlc-lang, which creates a package of the same name. If we run raco pkg install inside this directory, then the files in the package are registered with the module system. From then on, stlc-lang is a resolved module, as are all the modules inside the directory, and we can use any of them that are module languages in a #lang line and as a module language without quotation, the same as racket.

For simplicity, we can move the previously referenced files, stlc-interp.rkt and stlc-lang.rkt into this directory. We should have the following directory structure now.

stlc-lang/

stlc-interp.rkt

stlc-lang.rkt

Moving the files into the directory isn’t necessary; the modules inside the package could import definitions from outside the package using relative path strings, such as "../stlc-lang.rkt". However, such a package would be difficult to distribute, so we shall keep the package self-contained.

Next, we need to set up the expected module structure in the package. We follow the simplest structure. When the module system encounters stlc-lang on its own, it looks for the file stlc-lang/main.rkt. This file must be a module language. The definitions don’t need to be in that file; we can import and reexport using all-from-out, for example as follows. We create this file with the following contents.

#lang racket

(require "stlc-lang.rkt")
(provide (all-from-out "stlc-lang.rkt"))

Now our directory structure is the following.

stlc-lang/

main.rkt

stlc-interp.rkt

stlc-lang.rkt

Finally, any #lang language requires a submodule named reader which exports at least read-syntax, to figure out how to read a string into a syntax object (in case we want a non-s-expression syntax, for example). We can add a module named reader to our file stlc-lang/main.rkt. But, we cannot simply use Racket’s read-syntax. Our read-syntax must wrap what it reads in a module that explicitly uses our module language stlc-lang. To add this module, we append the following to stlc-lang/main.rkt.

(module reader racket
  (provide
   (rename-out [new-read-syntax read-syntax]))

  (define (new-read-syntax src in)
    #`(module anything stlc-lang
        #,(read-syntax src in))))

Because this pattern is very common, Racket provides an abstraction for implementing this reader module. The syntax/module-reader module language expects a module path in its body, and generates the read and read-syntax definitions we just wrote when given the input stlc-lang. Using this abstraction, we could have written the following instead.

See Using #lang s-exp syntax/module-reader for more on syntax/module-reader.

(module reader syntax/module-reader
stlc-lang)

Finally, we either install or recompile the package. If we already installed the package, we should recompile the package using raco setup --pkgs stlc-lang. If we have not installed it, we should run raco pkg install from inside the stlc-lang directory.

Now we’re able to use #lang stlc-lang; we have a first-class #lang for whole programs that expands to a call to our interpreter.

My implementation of this package can be viewed here.

TODO: Add link to github?

1.6 Summary

The key concepts.

Interposition points, which are syntactic manifestations of linguistic concepts, such as the concept of a module and its contents.
Syntax transformers, which are not only macros that extend the surface language of Racket, but the way we implement arbitrary compile-time extensions and redefinitions, including to the interposition points.
Phased execution. We’ve only seen compile time and run time, where a syntax transformer executes at compile time and generates code that executes at run time. However, since Racket itself is used to define the compile time code, and contains macros, compile time code has its own compile time, so there are an arbitrary number of phases, starting at phase 0 (run time), phase 1 (the compile or expansion time for phase 0), etc. See General Phase Levels for more on phases.
Module languages, which define the set of identifiers included when creating a module. Any module can be used as module langauge, but it needs to at least export #%module-begin.
Packages and module resolution. See Packages and Collections for more on packages.

2 Macro-embedding a #lang

Creating a #lang from an existing, whole-program interpreter is useful at times. However, the resulting #lang is limited. To name a few limitations: it has no module system, no macros, cannot be used in the REPL, and Racket does not understand its binding structure (so IDE support and Scribble hyperlinking do not work properly). Really, all of these are the same problem: the deeply embedded language does not interoperate with Racket, and so cannot reuse host language features as-is. Instead, we want to shallowly embed the language, and use macros to provide essentially the same surface syntax that our deep embedding provides.

TODO: deep embedding vs shallow? bleh

We’ll do this in stages to understand the relation between the macro-embedding and the interpreter.

First, we implement a macro for each form in our language that generates a call to our interpreter. Currently, we have a single macro, #%module-begin, that generates a single call to the interpreter. Instead, each form in our language will be macro and each will generate a call to the interpreter. #%module-begin won’t have to do anything anymore. This works since our interpreter is just a fold over the syntax of terms with some shared state (the environment).
Next, we essentially inline each case of the interp function as the syntax generated by the macro for that syntactic form. This works because the macro expander is already a fold over the syntax of terms; there’s no need to do a second traversal. Instead, the macro can compile to the Racket expressions that the interpreter would evaluate at run time. This shifts evaluation phases so that run time of our interpreter also corresponds to the run time from the perspective of the syntax transformer, rather than a separate phase confined to inside the dynamic extend of evaluation of a call to interp.
Then, we’ll remove the explicit environment and instead use Racket’s environment, by generating binding forms at compile time.
Finally, we’ll move the type checking code into the definition of each macro.

These intermediate stages will actually be more complicated than the final code we end up with. The direct embedding into Racket, using Racket expressions, is simpler than trying to coordinate multiple macros to call an interpreter. However, the attempt is instructive.

We’ll start by defining all of this as a module language, and wrap it up into a #lang by the end.

2.1 Exposing Interpreter Internals

Our current interpreter, interp, hides the internal details of the environment it’s passing around. This is fine when the interpreter has a single entry point, is expected to consume a whole program and produce the value. However, when we expose our language as separate macros, each macro is an entry point to the interpreter. This means the macros need to be able to share any internal state, and in particular, the environment.

In the same module that we will define our macros, we define a new version of the interpreter that exposes the second parameter, the reference to the current environment, and define a global environment that is used by each macro.

(define env (make-hash))

(define (interp e env)
  (match e
    [(? symbol?) (dict-ref env e)]
    [(? integer?) e]
    [(? boolean?) e]
    [(or `(lambda (,x : ,_) ,e)
         `(lambda (,x) ,e))
     (lambda (v)
       (define env^ (dict-copy env))
       (dict-set! env^ x v)
       (interp e env^))]
    [`(letrec ([,xs : ,As = ,es] ...)
        ,e)
     (for ([x xs]
           [e es])
       (dict-set! env x (interp e env)))
     (interp e env)]
    [`(if ,e1 ,e2 ,e3)
     (if (interp e1 env)
         (interp e2 env)
         (interp e3 env))]
    [`(,prim-op ,es ...)
     #:when (memq prim-op '(- + * zero?))
     (apply (eval prim-op (module->namespace 'racket/base))
            (map (curryr interp env) es))]
    [`(,e1 ,e2)
     ((interp e1 env) (interp e2 env))]
    [_ (error 'interp "Invalid term: ~a" e)]))

2.2 Defining the Macros, but Wrong

We need to define a syntax transformer, and more explicitly a macro, for each form in our language. Most of these macros are not implicit forms like #%module-begin; they will define the syntax of our #lang, and a user will use them directly.

The forms we need to define correspond to the keywords in our grammar for STLC: lambda, letrec, if, +, *, -, zero?.

We also need to redefine a few interposition points corresponding to parts of STLC with no syntactic keyword.

For application, we redefine #%app, which is implicitly called when an expression appears in the operator position of an s-expression. That is, the Racket term ((lambda (x) x) 5) implicitly expands as (#%app (lambda (x) x) 5), so by redefining #%app we can redefine application.

Our values, integers and booleans, can be run directly without redefining any interposition points, since they are already Racket values. However, in order to type check them, we need to redefine #%datum, which is implicitly wrapped around any datum literal.

Finally, since we’re managing the environment manually, we need to redefine variable reference. A variable in the surface syntax will no longer be a symbol but a raw identifier.

All of the above is true of our macro embedding. But how to implement all these macros is daunting.

The first thing we can try is to define the macros to simply call the interpreter, the same way #%module-begin did. This will go wrong in many ways, but will at least teach us about macros and the structure of what we’re trying to do.

Let’s start with a definition for +. Since we are defining the syntax of our language, we’ll always be defining a syntax transformer. We can define + as follows:

Examples:

> (module A racket
    (require
     "stlc-exposed-interp.rkt"
     (for-syntax
      racket
      syntax/parse)
     racket/trace)

    (provide +)

    (trace-define-syntax (+ syn)
      (syntax-parse syn
        [(_ e1 e2)
         #`(interp '(+ e1 e2) env)])))
> (require 'A)
> (require "stlc-exposed-interp.rkt")

We use trace-define-syntax from racket/trace, which is just like define-syntax but it also prints the call to the macro, and the output, on each expansion.

We require syntax/parse, which provides syntax-parse, a pattern matching form that works over syntax objects directly. This avoids the need for syntax->datum, and for some syntax unquoting, thus simplifying the definition, although making some details implicit. The pattern language of syntax-parse differs from match; we only specify the shape of the syntax object, without any explicit quotes or unquotes to distinguish symbols from pattern variables. In the pattern (_ e1 e2), we declare that the shape of the syntax object should be an s-expression with 3 subterms. We bind the operator to _, the convention for an irrelevant pattern, and bind the operands to the syntax pattern variables e1 and e2. Syntax pattern variables can be used directly, without unsyntax, inside the syntax template.

See also syntax-pattern-variable? for some advanced details.

We wrapped everything in a module A so we can study how the macro we defined interacts with the module system.

We’ve redefined +, so we can no longer access Racket’s addition procedure of the same name after requireing module A. We need to explicitly reimport it to a new name if we want to use both.

Examples:

> (require (rename-in racket [+ r:+]))
> (r:+ 5 6)
11
> (r:+ 5 6 7)
18
> (+ 5 6)
>(+ #<syntax:eval:7:0 (+ 5 6)>)
<#<syntax:eval:1:0 (interp (quote (+ 5 6)) env)>
11
> (+ 5 6 7)
>(+ #<syntax:eval:8:0 (+ 5 6 7)>)
eval:8:0: +: unexpected term
at: 7
in: (+ 5 6 7)

Racket’s + just adds the numbers and is arbitrary arity, while our + is restricted two operands, calls our interpreter, and prints trace information.

Our definition of + generates exactly what we would write by hand if calling the interpreter manually. To interpret an STLC expression such as '(+ 5 6), we call the interpreter (interp '(+ 5 6) env), using the current environment env defined in the module.

Examples:

> (interp '(+ 5 6) env)
11
> (+ 5 6)
>(+ #<syntax:eval:10:0 (+ 5 6)>)
<#<syntax:eval:1:0 (interp (quote (+ 5 6)) env)>
11

The macro explicitly inserts the quote around the argument to interp, since the interpreter expects a quoted list, but the macro is working over syntax, and is writing the same symbols we would write by hand. Without the quote, the argument to interp would be evaluated as a Racket expression before calling interp. We can see this more clearly using trace-call from racket/trace. This takes a name, a procedure, and all the arguments to the procedure, and displays arguments used in the call and the result of the call.

Examples:

> (require racket/trace)
> (trace-call 'interp
interp (+ 5 6) env)
>(+ #<syntax:eval:12:0 (+ 5 6)>)
<#<syntax:eval:1:0 (interp (quote (+ 5 6)) env)>
>(interp 11 '#hash())
<11
11
> (trace-call 'interp
interp '(+ 5 6) env)
>(interp '(+ 5 6) '#hash())
<11
11

In the first call to interp, we see that (+ 5 6) is expanded using our macro and evaluated before our explicit call to interp, which only receives the number 11 as input. In the second call, interp receives the quoted list '(+ 5 6) as input.

In this case, without the quote, we’d be generating an expression recursively referring to the + macro we’re redefining, resulting in an infinite loop. (+ 5 6) would expand to (interp (+ 5 6) env), which would expand to (interp (interp (+ 5 6) env) env), etc.

There’s a consequence of explicitly quoting the operands e1 and e2: no further macro expansion or other evaluation in Racket can happen in them. They are compiled directly to quoted data and passed to interp.

Examples:

> (+ 5 6)
>(+ #<syntax:eval:14:0 (+ 5 6)>)
<#<syntax:eval:1:0 (interp (quote (+ 5 6)) env)>
11
> (+ (+ 5 6) 7)
>(+ #<syntax:eval:15:0 (+ (+ 5 6) 7)>)
<#<syntax:eval:1:0 (interp (quote (+ (+ 5 6) 7)) env)>
18
> (+ (+ 5 6) (+ 7 8))
>(+ #<syntax:eval:16:0 (+ (+ 5 6) (+ 7 8))>)
<#<syntax:eval:1:0 (interp (quote (+ (+ 5 6) (+ 7 8))) env)>
26

We see in all examples that the + macro is only ever expanded once: producing a single static call to the interpreter, with nested uses of + quoted. Since they’re quoted, they aren’t macros and don’t expand. The interpreter is entered once, taking over all the rest of evaluation.

These examples only produce the expected integer because of a trick: I happened to name the macro with the same name that the interpreter uses internally. This means I can use any of the features the interpreter already knows about in the argument to +, even if I haven’t defined the correspdoning macro yet.

Example:

> (+ ((lambda (x : Int) x) 5) 2)
>(+ #<syntax:eval:17:0 (+ ((lambda (x : Int) x) 5) 2)>)
<#<syntax:eval:1:0 (interp (quote (+ ((lambda (x : Int) x) 5) 2)) env)>
7

The interpreter already knows how to interpret '((lambda (x : Int) x) 5), and since + quotes its argument, it evaluates correctly as an argument to +. But, since I haven’t defined this as syntax, it will not work at the top level:

Examples:

> (interp '((lambda (x : Int) x) 5) env)
5
> ((lambda (x : Int) x) 5)
arity mismatch;
the expected number of arguments does not match the given
number
expected: 3
given: 1

Racket expects (lambda (x : Int) x) to be a Racket lambda, interpreting : and A as parameter names, so this causes an error.

This also means that the + macro does not behave as expected within the Racket module system. For example, if a user decides to import a renamed version of the macro and try to use, we’ll get an error:

Examples:

> (require (rename-in 'A [+ stlc+]))
> (stlc+ (stlc+ 5 6) 6)
>(+ #<syntax:eval:21:0 (stlc+ (stlc+ 5 6) 6)>)
<#<syntax:eval:1:0 (interp (quote (+ (stlc+ 5 6) 6)) env)>
interp: Invalid term: (stlc+ 5 6)

Here, we rename + to stlc+, perhaps because we want STLC to interoperate with Racket. The outer stlc+ expands, transforming the syntax according to out definition. It expands to a call to the interpreter, quoting its arguments. However, since one of those arguments contains a call to stlc+, that name gets quoted and becomes a symbol. The macro expander does not transform data, only syntax, so that symbol does not get renamed. And the interpreter does not know what to do with the symbol 'stlc+.

We could work around this by quasiquoteing the representation of the program, and unquoteing the arguments that might contain macros.

Example:

> (module B racket
    (require "stlc-exposed-interp.rkt")

    (require
     (for-syntax
      racket
      syntax/parse))

    (require racket/trace)

    (provide +)
    (trace-define-syntax (+ syn)
      (syntax-parse syn
        [(_ e1 e2)
         #`(trace-call 'interp interp `(+ ,e1 ,e2) env)])))

Now, we quasiquote the program representing (+ e1 e2), but unquote e1 and e2. Since they could contain programs which we expect to be represented in the surface syntax by macros, we need them to be Racket expression that expand to calls to the interpreter.

Now, if we rename + and use it in an argument, we can see a second expansion of our macro, and a second traced call to the interpreter.

Examples:

> (require (rename-in 'B [+ stlc+]))
> (stlc+ (stlc+ 5 5) 6)
>(+ #<syntax:eval:3:0 (stlc+ (stlc+ 5 5) 6)>)
<#<syntax:eval:1:0 (trace-call (quote interp) interp (quasiquote (+ (unquote (stlc+ 5 5)) (unquote 6))) env)>
>(+ #<syntax:eval:3:0 (stlc+ 5 5)>)
<#<syntax:eval:1:0 (trace-call (quote interp) interp (quasiquote (+ (unquote 5) (unquote 5))) env)>
>(interp '(+ 5 5) '#hash())
<10
>(interp '(+ 10 6) '#hash())
<16
16

We can see the both instances of the stlc+ macro expand to calls to the interpreter. Values are embedded directly, so they don’t need to be wrapped in anything special.

Now that operands are evaluated first as Racket expressions, an expression that we have not defined as a macro will cause an error.

Example:

> (+ ((lambda (x : Int) x) 5) 2)
arity mismatch;
the expected number of arguments does not match the given
number
expected: 3
given: 1

Now we need to finish defining all any syntactic forms that get used in any subexpression. We could continue with this approach, defining lambda and local variable references as follows.

Examples:

> (require "stlc-exposed-interp.rkt")
> (require
   racket/trace
   (for-syntax
    racket
    syntax/parse))
> (trace-define-syntax (lambda syn)
    (syntax-parse syn
      [(_ (x : A) e)
       #`(lambda (x) e)]
      [(_ (x) e)
       #`(trace-call 'interp interp
                     `(lambda (x)
                        ,(let-syntax ([x (lambda (syn) #''x)])
                          e))
                     env)]))
> (+ ((lambda (x : Int) x) 5) 2)
>(lambda #<syntax:eval:8:0 (lambda (x : Int) x)>)
<#<syntax:eval:7:0 (lambda (x) x)>
>(lambda #<syntax:eval:7:0 (lambda (x) x)>)
<#<syntax:eval:7:0 (trace-call (quote interp) interp (quasiquote (lambda (x) (unquote (let-syntax ((x (lambda (syn) (syntax (quote x))))) x)))) env)>
>(interp '(lambda (x) x) '#hash())
<#<procedure:...-exposed-interp.rkt:14:5>
7
> x
x: undefined;
cannot reference an identifier before its definition
  in module: top-level

We define lambda recursively, with two cases: either the parameter is type annotated or not. When its type annotated, we generate a recursive call without the type annotation. When there is no type annotation, we generate a (traced) call to the interpreter, quasiquoting the expression but unquoting the body.

In the body, we bind a local macro using let-syntax. We bind the local variable x to a syntax transformer that generates the syntax object representing the symbol corresponding to x. That is, within the body of the lambda, we expand the variable to a quoted version of itself. This transforms the variable into a symbol, which the interpreter knows how to dereference.

In the run-time trace, we can see that the data the interpreter receives includes '(lambda (x) x), i.e., that by run time, the local macro has expanded the variable into the symbol 'x in the body of the quoted representation of the function. The macro for x is local, so we cannot refer to it at the top level.

We could also add top-level definitions to the language with this structure, by defining define and #%top. The #%top interposition point is wrapped around any free variable, and defaults to raising an error if that variable is not bound in the top-level scope.

Examples:

> (require racket/dict)
> (define-syntax (define syn)
    (syntax-parse syn
      [(_ id e)
       #`(dict-set! env 'id (interp e env))]))
> (define-syntax (#%top syn)
    (syntax-parse syn
      [(_ . id)
       #`(interp 'id env)]))
> (define x 5)
> (+ x 5)
10

We expand define to update the global environment, binding the (quoted) variable name to the interpreted value of the expression. We expand a top-level variable refernece to evaluate the quoted variable name in the interpreter under the current environment. #%top has an idiosyncracy: the argument to #%top always follows a dot; the syntax object is formed from a pair that is not a list.

2.3 Inlining the Interpreter

The above transition to macros that expand to calls to the interpreter sort of makes sense if we started from an interpreter. However, as we see, we must avoid several pitfalls.

The most insidious one, however, is that we are duplicating code and work. This is can be seen when we observe the trace of a call to stlc+, the version that explicitly unquotes operands to support macros in the operands.

Example:

> (stlc+ (stlc+ 5 6) (stlc+ 7 8))
>(+ #<syntax:eval:15:0 (stlc+ (stlc+ 5 6) (stlc+ 7 8))>)
<#<syntax:eval:1:0 (trace-call (quote interp) interp (quasiquote (+ (unquote (stlc+ 5 6)) (unquote (stlc+ 7 8)))) env)>
>(+ #<syntax:eval:15:0 (stlc+ 5 6)>)
<#<syntax:eval:1:0 (trace-call (quote interp) interp (quasiquote (+ (unquote 5) (unquote 6))) env)>
>(+ #<syntax:eval:15:0 (stlc+ 7 8)>)
<#<syntax:eval:1:0 (trace-call (quote interp) interp (quasiquote (+ (unquote 7) (unquote 8))) env)>
>(interp '(+ 5 6) '#hash((x . 5)))
<11
>(interp '(+ 7 8) '#hash((x . 5)))
<15
>(interp '(+ 11 15) '#hash((x . 5)))
<26
26

What’s going on here? Despite having nested expressions, interp is never called with an expression that has non-trivial sub-expressions.

We see that the example is compiled to

(interp `(+ ,(interp `(+ 5 6) env) ,(interp `(+ 7 8) env)) env)

which evaluates as

(interp `(+ 11 15) env)

By contrast, had we manually called the interpreter as (interp '(+ (+ 5 6) (+ 7 8)) env), the interpreter would have recursively interpreted '(+ 5 6) and '(+ 7 8). This recursion is gone in our macro-embedded version. Why?

Well, the interpreter and the macro embedding are both traversing the entire syntax of the program. The interpreter was designed this way, necessarily when we had a deep embedding. The macro expander must traverse the entire syntax of the program to find all the macros, and we explicitly designed stlc+ to unquote operands and enable the macro expand to traverse them, thus inlining all recursive interpreter calls for nested expression.

Right now, we’ve half-inlined each case of the interpreter into the corresponding macro definition. We need to finish this: the macro embedding should be the fully inlined version of the interpreter. Each macro should define one case of our interpreter. We should rely on the macro expander to fully traverse the program rather than writing a recursive traversal ourselves.

To do this, we need two names of every construct that the host language and the object language have in common. Since both contain the name +, we’ll need to either rename ours or Racket’s to use Racket’s in our definition.

Let’s start in a fresh module, and define a new completely inlined version of our +.

Examples:

> (module C racket
    (require
     (for-syntax
      racket
      syntax/parse)
     racket/trace)

    (provide (rename-out [stlc+ +]))
    (trace-define-syntax (stlc+ syn)
      (syntax-parse syn
        [(_ e1 e2)
         #`(+ e1 e2)])))
> (require 'C)
> (+ 5 6)
>(stlc+ #<syntax:eval:3:0 (+ 5 6)>)
<#<syntax:eval:1:0 (+ 5 6)>
11
> (+ (+ 5 6) (+ 7 8))
>(stlc+ #<syntax:eval:4:0 (+ (+ 5 6) (+ 7 8))>)
<#<syntax:eval:1:0 (+ (+ 5 6) (+ 7 8))>
>(stlc+ #<syntax:eval:4:0 (+ 5 6)>)
<#<syntax:eval:1:0 (+ 5 6)>
>(stlc+ #<syntax:eval:4:0 (+ 7 8)>)
<#<syntax:eval:1:0 (+ 7 8)>
26

This definition... trivial. In fact, when the definition of the macro is this trivial, we don’t even need to define it—it’s just the same as Racket’s implementation already. We only need to define those constructs that differ at compile time, so those that are syntactically different.

Examples:

> (module D racket
    (require
     (for-syntax
      racket
      syntax/parse))

    (provide env)
    (define env (make-hasheq))

    (provide (rename-out [new-lambda lambda])
             #%app - * zero? if)

    (define-syntax (new-lambda syn)
      (syntax-parse syn
        [(_ (x : A) e)
         #`(new-lambda (x) e)]
        [(_ (x) e)
         #`(lambda (v)
             (define env^ (dict-copy env))
             (dict-set! env^ 'x v)
             (let-syntax ([x (lambda (stx) #`(dict-ref env^ 'x))])
               e))])))
> (require 'D)
> (+ ((lambda (x : A) x) 5) 6)
>(stlc+ #<syntax:eval:7:0 (+ ((lambda (x : A) x) 5) 6)>)
<#<syntax:eval:1:0 (+ ((lambda (x : A) x) 5) 6)>
11

In this module, we provide Racket’s version of #%app, +, *, zero?, and if as our own, since they evaluate exactly as our interpreter evaluated those forms. Our interpreter recursively evaluated the function and argument in an application, and then function to the argument. That’s exactly how procedure application evaluates in Racket. We don’t need the explicit recursion; the macro expander will recur over subterms implicitly, and the evaluation order is the same in STLC and Racket.

To define our version of functions, which do different syntactically, we define new-lambda and expand to a Racket lambda. The body of the macro is almost a copy/pasted version of the case from our interpreter. We create a local copy of the current environment, extend it with a map from the symbol representing the parameter to the value we get as an argument at run time. As we did earlier, we compile a variable reference using a local macro. But we inline the definition of variable lookup from our environment: the local macro generates a dereference from the local copy of the environment.

To get our desired interface, we rename new-lambda to lambda on export.

We can define letrec similarly.

Example:

> (module E racket
    (require 'D)
    (require
     racket/dict
     (for-syntax
      racket
      syntax/parse))

    (provide (rename-out [new-letrec letrec]))
    (define-syntax (new-letrec syn)
      (syntax-parse syn
        [(_ ([xs : As = es] ...) body)
         #`(begin
             (for ([x '(xs ...)]
                   [e (list es ...)])
               (dict-set! env x e))
             body)]))

    (provide (rename-out [new-top #%top]))
    (define-syntax (new-top syn)
      (syntax-parse syn
        [(_ . x)
         #`(dict-ref env 'x)])))

We import module D in module E to have access to the shared environment.

We copy and paste the implementation from our interpreter into the syntax object generated by the macro for new-letrec. We have to make some small changes to account for differences between the pattern matchers. We explicitly expand the list of variables into a quoted list, since all the variables should be treated as symbols, but build the list of right-hand sides using list, so the the elements can be implicitly recursively expanded and evaluated.

Because new-letrec modified the global environment and doesn’t generate a local binding form, we define new-top to implement top-level variable references. We copy and paste the definition from the interpreter: we dereference the symbol from the environment. We have to insert an explicit quote to cast the identifier to a symbol.

Once we import module E, we can implement our earlier version of factorial in STLC.

Examples:

> (require 'E)
> (letrec ([fact : (-> Int Int)
                 = (lambda (x)
                     (if (zero? x)
                         1
                         (* x (fact (- x 1)))))])
    (fact 5))
120

So far, we’ve only extended Racket with new syntax that corresponds to our STLC syntax. To implement a module language, we need to define a single module that exports all and only those forms in STLC, plus #%module-begin.

Example:

> (module STLC racket
    (require 'E 'D)
    (provide
     (all-from-out 'E 'D)
     #%module-begin
     #%datum

     require provide only-in rename-in rename-out all-from-out all-defined-out))

We define a new module language STLC that imports modules D and E, which defined everything in STLC, and simply re-export. We also export Racket’s default #%module-begin, which is required, but we do not need to extend it. Because we include some Racket literals, namely integer and boolean literals, we also export the default #%datum interposition point, which is wrapped around all literal data.

We can now use this macro-embedded STLC as a module language. If we try to use anything not defined in STLC, such as list, we get an error coming from new-top: the identifier isn’t found in the environment. This is our version of an unbound identifier error.

Examples:

> (module test 'STLC
    (letrec ([fact : (-> Int Int)
                   = (lambda (x)
                       (if (zero? x)
                           1
                           (* x (fact (- x 1)))))])
      (fact 5)))
> (module test2 'STLC
    (list 1 2 3))
> (require 'test2)
hash-ref: no value found for key
  key: 'list

We also export require, provide, and some of their helpers, equipping STLC with a module system by reusing Racket’s. This isn’t necessary, but probably something we want to do. For examepl, we can quickly extend STLC with lists just by importing features from Racket.

Examples:

> (module STLC+Lists 'STLC
    (require
     (only-in racket quote cons car cdr list empty empty?)
     'STLC)
    (provide
     (all-from-out 'STLC)
     cons
     car
     cdr
     empty?
     empty
     list))
> (module test2 'STLC+Lists
    (letrec ([map : (-> (-> Int Int) (-> (List Int) (List Int))) =
                  (lambda (f)
                    (lambda (x)
                      (if (empty? x)
                          empty
                          (cons (f (car x))
                                ((map f) (cdr x))))))])
      ((map (lambda (x : Int) (+ x 1)))
       (list 1 2 3 4))))
'(#<procedure> #<procedure> #<procedure> #<procedure>)
> (require 'test2)

From this module language, we can implement a #lang following the recipe from A Proper #lang. We move the module to a file in the right package structure, add a reader submodule, and install it.

But first, let’s sand off some rough edges.

2.4 Reuse Racket’s Environment

We’ve decided to manage our environment manually, casting identifiers to symbols and using a mutable hash table. This sort of made sense coming from our interpreter, but it’s unnecessary—we could reuse Racket’s. We used symbols in our interpreter because we had to do everything ourself at run time—represent program terms, traverse syntax, implement an environment, But at compile time, in the macro system, we can reuse all of that. We use syntax objects to represent program terms, we reuse the macro expander to traverse syntax, and we can reuse Racket’s run-time environment as our environment by reusing identifiers as our variables.

Racket’s environment is not an explicit data structure that was access with function calls. Instead, we access it indirectly through Racket identifiers and Racket binding forms. When we use Racket’s lambda or letrec, we implicit bind a variable in the environment. When we generate a syntax object containing an identifier, we implicitly dereference from Racket’s environment.

So, let’s rewrite our two binding forms. Since those were the only macros we had to write, we might as well rewrite our entire module language.

Examples:

> (module STLC2 racket
    (require
     (for-syntax
      racket
      syntax/parse))

    (provide
     + * - zero? if #%app
     (rename-out [new-lambda lambda]
                 [new-letrec letrec])

     #%module-begin #%top #%datum

     require provide only-in rename-in rename-out all-from-out all-defined-out)

    (define-syntax (new-lambda syn)
      (syntax-parse syn
        [(_ (x : A) e)
         #`(lambda (x) e)]
        [(_ (x) e)
         #`(lambda (x) e)]))

    (define-syntax (new-letrec syn)
      (syntax-parse syn
        [(_ ([xs : As = es] ...) body)
         #`(letrec ([xs es] ...) body)])))
> (module A 'STLC2
    (letrec ([fact : (-> Int Int)
                   = (lambda (x)
                       (if (zero? x)
                           1
                           (* x (fact (- x 1)))))])
      (fact 5)))
> (require 'A)
120

To implement "lambda", we simply expand to Racket’s lambda but discard the type annotation. Tbe run time behaviour is the same. lambda implicitly binds the parameter x, so when x is used in the body, it will refer to the argument when the function is applied. There’s no need to introduce a syntax transformer that explicitly rewrites the variable to anything—the syntax object representing the variable already implements a variable dereference.

To implement "letrec", we do the same thing, but expand to letrec, which introduces mutually recursive binding structure in Racket’s environment. We don’t need to explicitly implement recursive binding using Landin’s Knot, because Racket already implements recursive binding for us.

We export Racket’s version of #%top, which raises unbound identifier errors when we try to reference an unbound variable. However, Racket’s also tracks the environment at compile time, while we only had a run-time environment This means Racket’s version gives us a compile-time error rather than a run-time error.

Examples:

> (module F 'STLC2
    (list 1 2 3))
eval:21:0: list: unbound identifier
  in: list
> (module G 'STLC
    (list 1 2 3))
> (require 'G)
hash-ref: no value found for key
  key: 'list

2.5 Type Checking in Macros

2.6 Making it all a #lang

3 Appendix: STLC Interpreter Implementation

(require stlc-lang/stlc-interp)

package: stlc-lang

value
fact5 : STLC

An STLC term that implements the recursive factorial function, and calls it on 5.

procedure
(interp e) → any/c
e : STLC

Takes a well-typed e from STLC and produces a Racket value representing the value of e.

Examples:

> (require stlc-lang/stlc-interp)
> fact5
'(letrec ((fact
           :
           (-> Int Int)
           =
           (lambda (x) (if (zero? x) 1 (* x (fact (- x 1)))))))
   (fact 5))
> (interp fact5)
120

procedure
(type-check e) → any/c
e : any/c

If e is a well-typed STLC expression, produces the symbol representing its type. Otherwise, raises an error.

Examples:

> (require stlc-lang/stlc-interp)
> fact5
'(letrec ((fact
           :
           (-> Int Int)
           =
           (lambda (x) (if (zero? x) 1 (* x (fact (- x 1)))))))
   (fact 5))
> (type-check fact5)
'Int
> (type-check `(letrec ([x : Bool = 5]) x))
type-error: Expected bound term 5 to have type Bool but has
type Int
> (type-check (list 5))
type-error: Expected a term but got (5)

4 Appendix: #lang stlc-lang

#lang stlc-lang

package: stlc-lang

Defines a #lang for a single whole-program expression in STLC that is run using interp.

1	From a whole-program interpreter
2	Macro-embedding a #lang
3	Appendix: STLC Interpreter Implementation
4	Appendix: #lang stlc-lang