2 Syntax in Redex
The first part of my language model is defining the collection expressions in my
language.
This includes the expressions of my language, the types, and the data.
Sometimes I encode the meta structures—
decide whether an s-expression matches a BNF nonterminal
decompose a term via pattern matching
decide α-equivalence of two terms
perform substitution
generate expressions the match a nonterminal
2.1 Syntax TLDR
In short, Redex makes defining and working with syntax damn simple. I use define-language to create BNF grammars. With #:binding-forms (binding forms), I get capture-avoiding substitution (substitute), and α-equivalence (alpha-equivalent?), for free while retaining named identifiers rather than annoying representations such as de Bruijn. Using define-extended-language, I can easily extend existing languages, which can make it easy to modularize grammars; although, I usually avoid it because then I have to remember too many language identifiers. I use redex-match?, redex-match, and redex-let to query whether expressions match nonterminals and to let Redex decompose syntax for me. I use the test functions test-predicate, test-equal, test-results, and the parameter default-equiv, for writing test suites about syntax, and redex-check to do random testing of syntactic properties.
One caveat to be aware of is that grammars with side conditions prevent redex-check from working over the grammar.
There are two common pitfalls to avoid when working with syntax in Redex.
First, be careful about using untagged, arbitrary variables, such as by using variable or variable-not-otherwise-mentioned in your syntax definition. This makes it really easy to create arbitrary variable names, but also easy to make typos be interpreted as variables. This can cause very unexpected failures. Apparently valid matches will not fail to match and metafunctions can have undefined behavior. One way to help avoid this problem is to tag variables, either in the grammar or with a symbolic prefix using variable-prefix.
Second, avoid unicode subscripts, and be careful with TeX input mode. Unicode subscripts are different than a nonterminal followed by an underscore, i.e., any₁ is a symbol while any_1 is a pattern variable matching an any pattern. This is made worse since TeX input mode will transparently replace the second expression with the first. DrRacket has better LaTeX input that avoids this problem by default.
2.2 Defining Syntax
> (require redex/reduction-semantics)
> (define-language BoxyL (e ::= x natural (+ e e) (cons e e) (car e) (cdr e) (λ (x : A) e) (e e) (box e) (let ((box y) e) e)) (A B ::= Nat (A × B) (A → B) (□ A)) (x y ::= variable-not-otherwise-mentioned) (v ::= natural (box e) (λ (x : A) e) (cons v v)) #:binding-forms (λ (x : A) e #:refers-to x) (let ((box y) e_1) e_2 #:refers-to y))
I usually start my models by requiring redex/reduction-semantics instead of redex. The latter loads GUI stuff I rarely use, and sometimes don’t even install.
The define-language form takes a language identifier, which is used by some functions to specify which grammar and binding specification to use, and BNF-esque grammar, in s-expression notation. After the grammar, I give a binding specification, from which Redex infers definitions of substitution and α-equivalence.
This language defines expression using meta-variable e, which includes variables x, natural numbers, cons pairs, λ, application, the box introduction form, and a pattern matching form for box elimination. It also defines their types, as meta-variables A and B. Note that while productions need to be s-expressions, they need not be in prefix notation. Variables, x and y, are any symbol not used as a keyword or literal elsewhere in the grammar, variable-not-otherwise-mentioned.
This is essentially equivalent to the grammar from the Coq model in
Preface—
2.3 Pattern Matching
Redex has a sophisticated formal pattern language, but in essence any symbols not recognized as a BNF symbol is treated as a keyword. When I write (car e) in the grammar, this is understood to mean the literal symbol car followed by some expression e is also an expression. The pattern language also has some built-in nonterminals, such as natural which indicates a natural number literal such as 5.
> (define eg1 (term (box 5))) > (define eg2 (term (cdr 5))) > (redex-match? BoxyL e eg1) #t
> (redex-match? BoxyL v eg1) #t
> (redex-match? BoxyL v eg2) #f
This is helpful particularly with large complex languages, or low-level languages with detailed machine configurations. In these kinds of languages, its common to restrict syntax to achieve various properties, and may be non-obvious whether a term is valid.
Note that keywords and literal symbols from the grammar are not variables, but anything else is.
> (redex-match? BoxyL x (term car)) #f
> (redex-match? BoxyL x (term λ)) #f
> (redex-match? BoxyL x (term :)) #f
> (redex-match? BoxyL x (term kar)) #t
> (redex-match? BoxyL x (term lambda)) #t
> (redex-match? BoxyL any (term car)) #t
> (redex-match? BoxyL (any_1 any_1) (term (car car))) #t
> (redex-match? BoxyL any (term kar)) #t
> (redex-match? BoxyL any (term :)) #t
The any pattern can be useful for defining generic metafunctions, grammars, and judgments, or to partially specify contracts on Redex definitions. In large developments, I sometimes create a base language with generic operations on syntax, using any.
When in Racket, we use term to inject syntax into Redex. term acts like quasiquote, and even supports unquote, so we can use Racket to compute terms through templating. Most quoted s-expressions are also valid Redex terms, which can be useful if we want to move between s-expressions in Racket and terms in Redex.
> (define eg3 (term (cdr ,(+ 2 3)))) > eg3 '(cdr 5)
> (redex-match? BoxyL e eg3) #t
> (define eg4 '(cdr 5)) > (redex-match? BoxyL e eg4) #t
> (redex-match BoxyL (e_1 e_2) (term ((λ (x : Nat) x) 5))) (list (match (list (bind 'e_1 '(λ (x : Nat) x)) (bind 'e_2 5))))
> (redex-match BoxyL (box e_1) eg1) (list (match (list (bind 'e_1 5))))
The structure returned by redex-match is a little annoying to look at. Redex supports non-deterministic matching, so we get back the set of all possible ways to match. The elements of the set are matches, which contain lists of bindings of the pattern variable to the term.
I don’t often use redex-match directly for decomposing terms, because redex-let has a much better interface. But redex-let requires a deterministic single match.
> (redex-let BoxyL ([(e_1 e_2) (term ((λ (x : Nat) x) 5))]) (displayln (term e_1)) (displayln (term e_2)))
(λ (x : Nat) x)
5
2.4 Extending Syntax and Syntax with Holes (Contexts)
define-extended-language allows extending a base language with new nonterminals, or extending existing nonterminals from the base language with new productions. For example, I could extend expressions with booleans literals as follows.
> (define-extended-language BoxyBoolL BoxyL (e ::= .... boolean)) > (redex-match? BoxyL e (term #t)) #f
> (redex-match? BoxyL e (term (car #t))) #f
> (redex-match? BoxyBoolL e (term #t)) #t
> (redex-match? BoxyBoolL e (term (car #t))) #t
This defines a new language, BoxyBoolL, which extends BoxyL with an extra production in the nonterminal e. To extend a nonterminal, I use ...., which means "all the current productions of this nonterminal", followed by the new productions.
We can also add new nonterminals by extending a language.
> (define-extended-language BoxyEvalL BoxyL (E ::= hole (E e) (v E) (cons v ... E e ...) (+ v ... E e ...) (car E) (cdr E) (let ((box x) E) e)))
This defines a new language, BoxyEvalL, which extends BoxyL with a call-by-value evaluation context. Redex syntax also support contexts, that is, programs with a hole. I use this for evaluation and program contexts. In this language, we define the context E by listing the usual evaluation contexts for call-by-value. For operators with many arguments, we can easily specify left-to-right evaluation order using non-deterministic ellipses patterns.
After specifying a context, we can easily decompose a term into its evaluation context and its redex.
> (redex-match BoxyEvalL (in-hole E v) (term (car (cons (+ 1 2) 2))))
(list
(match (list (bind 'E '(car (cons (+ hole 2) 2))) (bind 'v 1)))
(match (list (bind 'E '(car (cons (+ 1 hole) 2))) (bind 'v 2))))
> (redex-match BoxyEvalL (in-hole E e) (term (car (cons (+ 1 2) 2))))
(list
(match (list (bind 'E hole) (bind 'e '(car (cons (+ 1 2) 2)))))
(match (list (bind 'E '(car hole)) (bind 'e '(cons (+ 1 2) 2))))
(match (list (bind 'E '(car (cons (+ hole 2) 2))) (bind 'e 1)))
(match (list (bind 'E '(car (cons (+ 1 hole) 2))) (bind 'e 2)))
(match (list (bind 'E '(car (cons hole 2))) (bind 'e '(+ 1 2)))))
> (redex-let BoxyEvalL ([(in-hole E (+ 1 2)) (term (car (cons (+ 1 2) 2)))]) (displayln (term E)) (displayln (term (in-hole E 3))))
(car (cons #<hole> 2))
(car (cons 3 2))
2.5 Binding, Substitution, and α-equivalence
> (default-language BoxyL) > (alpha-equivalent? (term (λ (x : Nat) e)) (term (λ (x : Nat) e))) #t
> (alpha-equivalent? (term (λ (x : Nat) x)) (term (λ (y : Nat) y))) #t
> (alpha-equivalent? (term (λ (x : Nat) y)) (term (λ (y : Nat) y))) #f
> (term (substitute (λ (y : Nat) (x e2)) x (+ y 5))) '(λ (y«0» : Nat) ((+ y 5) e2))
In this tutorial, I give only the simplest possible use of binding forms: a single, lexical name. But Redex’s support for binding is extremely sophisticated. It can support n-ary lexical binding with shadowing, pattern-matching style binding, module-style binding. See binding forms in the Redex docs for many examples of complex binding support.
2.6 Random Generation of Syntax
> (define an-e1 (generate-term BoxyL e 10)) > (define an-e2 (generate-term BoxyL e 10)) > an-e1 '(cdr (car (λ (k : (□ Nat)) (i 0))))
> an-e2 '(box (f 0))
generate-term takes a language name, a generation specification, and a size. It can generate terms based on lots of Redex definitions, although I almost always generate terms from grammars and judgments.
2.7 Testing Syntax
Once we’re done querying, we can move to testing. Redex features a unit testing library. Tests are great for detecting breakages as you start tinkering with your model, and can also report more information than the mostly-boolean query functions.
> (default-language BoxyL) > (test-predicate (redex-match? BoxyL e) (term (car (+ 1 2)))) > (test-predicate (redex-match? BoxyL x) (term kar)) > (test-predicate (redex-match? BoxyL x) (term car))
FAILED :46.0
#<procedure:eval:46:0> does not hold for: 'car
> (test-results) 1 test failed (out of 3 total).
This pattern is hard to abstract over since redex-match? expects a language identifier.
> (test-equal (term (λ (x : Nat) y)) (term (λ (y : Nat) y)) #:equiv alpha-equivalent?)
FAILED :48.0
actual: '(λ (x : Nat) y)
expected: '(λ (y : Nat) y)
> (default-equiv alpha-equivalent?) > (test-equal (term (λ (x : Nat) x)) (term (λ (y : Nat) y)))
> (test-equal (term (substitute (λ (y : Nat) (x e2)) x (+ y 5))) (term (λ (z : Nat) ((+ y 5) e2)))) > (test-results) 1 test failed (out of 3 total).
> (redex-check BoxyL v (redex-match? BoxyL e (term v)) #:attempts 1000) redex-check: no counterexamples in 1000 attempts
When generating syntax, redex-check takes a language identifier, a pattern, a predicate, and (optionally), the number of attempts.
2.8 Caveat: Syntax with Side Conditions
> (define-language Int64L (int64 ::= (side-condition integer_1 (<= (expt -2 31) (term integer_1) (sub1 (expt 2 31)))))) > (redex-match? Int64L int64 4) #t
> (redex-match? Int64L int64 -5) #t
> (redex-match? Int64L int64 (expt 2 31)) #f
The side-condition takes a pattern and a Racket predicate which can refer to the pattern variables. You can use a side-condition pattern anywhere in a pattern.
Unfortunately, using side-conditions in patterns does not work with redex-check. You can still use generate-term, though.
2.9 Pitfall: All Symbols are Variables
In a couple of the examples above, I rely on an anti-pattern. I pun between a real expression and an expression schema, by punning between variables and meta-variables. For example, consider the Redex term ((λ (x : Nat) e_body) e_arg). This is only valid because I used variable-not-otherwise-mentioned, which makes all symbols valid variables. In BoxyL, this expression is valid, but in an unintuitive way. The subterms e_body and e_arg are actually variables, although we are pretending they are meta-variables.
> (redex-match? BoxyL x (term e_body)) #t
> (redex-match? BoxyL x (term e_arg)) #t
This can let us treat an expression as an expression schema. It is a very tempting way to create examples. Instead of being very precise about the entire term, you can create readable meta-examples only specifying parts of the term and pretending this applies to all terms of this pattern. But relying on this behavior can really cause problems if you’re not careful.
For example, consider the very similar term ((λ (x : A) e_body) e_arg). Ah, surely this meta-example represents any valid application... but it isn’t even a valid expression.
> (redex-match? BoxyL e (term ((λ (x : A) e_body) e_arg))) #f
> (redex-match? BoxyL A (term A)) #f
The problem is that A, isn’t a valid type, and we did not allow types to have variables in the grammar.
Worse, when a term doesn’t match a language nonterminal, metafunctions such as substitute can exhibit undefined behavior.
> (term (substitute (λ (y : A) (x e2)) x (+ y 5))) '(λ (y : A) ((+ y 5) e2))
> (alpha-equivalent? (term (λ (x : A) x)) (term (λ (y : A) y))) #f
One should avoid this pattern, and be aware of it because you can do it accidently. For example, if we use the wrong letter or wrong case, we can easily create an invalid term, or a valid term that isn’t what we expected.
> (redex-match? BoxyL e (term (substitute (λ (y : Nat) (x e2)) x (+ y 5)))) #t
> (redex-match? BoxyL e (term (λ (y : nat) (x e2)))) #f
> (redex-match? BoxyL e (term (kar (cons e_1 e_2)))) #t
The root of the problem is that we used variable-not-otherwise-mentioned and didn’t tag variables, so any symbol is a valid expression. This is hard to avoid in Redex unless you exclusively use variable-prefix. Be careful when using the variable specifications that give you infinite unstructured variables. It’s handy, but like in any scripting language, that usefulness comes at a cost. It will cause you problems. It has caused me problems. It bit me twice while writing this tutorial.
Even when you manage to avoid this, term simply quotes any symbol, so typos can result in valid terms, although the term may not match a nonterminal. If you use contracts, which I strongly encourage and will use in the rest of this tutorial, this is less of a problem since the invalid quoted term will hopefully not match a nonterminal.
2.10 Pitfall: Subscripts and Unicode
Redex makes it very easy to mix unicode into your formal syntax. This is handy, since the code looks closer to the paper presentation. Unfortunatley, it can cause problems if you’re using unicode subscripts or TeX input mode in emacs.
The underscore, _, has a special meaning when attached to a nonterminal in a Redex pattern: it creates a distinct pattern variable matching the preceeding nonterminal. You might be tempted to use a unicode subscript instead, but that doesn’t work.
> (redex-match BoxyL (e₁ e₂) (term ((λ (x : Nat) x) 5))) #f
> (redex-match BoxyL (e_1 e_2) (term ((λ (x : Nat) x) 5))) (list (match (list (bind 'e_1 '(λ (x : Nat) x)) (bind 'e_2 5))))
This can be a particular problem if you’re using TeX input mode in emacs. When in TeX input mode, typing _ causes the input mode to being typing a subscript unicode symbol, creating a literal symbol rather than a pattern variable. This result will be a mysterious failure.
I’m not sure if unicode subscripts should be treated the same by Redex or not, but for now, I recommend never ever using unicode subscripts in Redex. I usually use TeX input mode, and have a hotkey for turning it on and off, so I can type underscores.