fast-parse

8.16

fast-parse🔗

wilbowma

A fast monadic parsing combinator library. fast-parse values fast above all else. fast-parse is not general purpose. fast-parse is not stable. fast-parse is not safe. fast-parse is not easy to use. Maybe lesser desiderata will improve in time, but above all, fast-parse is fast.

Here are some examples from the microbenchmarks raco test -l fast-parse/fast-parse-bench:

Big let-expression microbenchmark

------------------------------------------------------------------------

fast-parse fused: cpu time: 3 real time: 3 gc time: 0

fast-parse sexpr: cpu time: 25 real time: 25 gc time: 12

Racket `read`: cpu time: 56 real time: 59 gc time: 4

Racket `read-syntax`: cpu time: 72 real time: 75 gc time: 13

parsack: cpu time: 577 real time: 597 gc time: 103

Generic s-expression microbenchmark

------------------------------------------------------------------------

fast-parse: cpu time: 0 real time: 0 gc time: 0

Racket `read`: cpu time: 3 real time: 3 gc time: 0

Racket `read-syntax`: cpu time: 5 real time: 5 gc time: 0

parsack: cpu time: 37 real time: 38 gc time: 0

How? Cheating, copiously.

Unfortunately, it still has a long way to go. The design of fast-parse is largely stolen stolen from Haskell’s flatparse, by András Kovács, and we still can’t compete.

flatparse sexp comparison

------------------------------------------------------------------------

flatparse: 1.80ms

fast-parse skip sexpr: cpu time: 32 real time: 33 gc time: 5

Racket `read`: cpu time: 182 real time: 190 gc time: 15

Racket `read-syntax`: cpu time: 356 real time: 364 gc time: 129

I may change what I have to in order to achieve my goals. If you want to use this, pin your project to a specific version.

1 Limitations🔗

fast-parse will not work on strings larger than (expt 2 30) characters, or about between 1Gb and 4.29Gb depending on the string. fast-parse will only work on 64 bit machines, with 60 bit fixnums. This may change; see spans for more.

2 Design🔗

fast-parse follows several design principles in service to its number one goal: run-time performance:

0-allocation: fast-parse avoids performing any allocation. It does not allocate intermediate data structures, unless you tell it to. It will not copy out a substring, unless you tell it to. It even avoids creating closures, as much as automatically possible.
0-memory indirects: fast-parse avoids accessing memory as much as possible. It must of course access the string to parse, but it won’t access its length more than once. It will avoid reading from the string more than once, and enable users to do the same. You can commit a peek without touching memory.
No checks: fast-parse avoids unnecessary run-time checks. This includes dynamic type checks and contracts.

There are several design decisions that follow from these principles.

fast-parse does not work over input-port?s. input-port?s violate all our design principles. Instead, fast-parse works over strings. If you give it a port, it will copy the entire port into a string before parsing.

A parser is a function that consumes a string, a nonnegative fixnum indicating the length of the string, and a nonnegtaive fixnum indicating the current position in the string, and return two values: success, or failure, both represented as multiple return values. A success result is an arbitrary type and a nonnegative fixnum indicating the new position. A failure result is a nonnegative fixnum indicating the new position and a failure tag. In Typed Racket, we could express the type of parsers as:

(define-type (Parser A)
(U (All (A) (-> String Natural Natural (Values A Natural)))
(-> String Natural Natural (Values Natural failure))))

A parser combinator is any function that returns a parser.

A consuming parser is one that advances the current position, regardless of success or failure. A nonconsuming parser does not advance the current position, again regardless of success or failure.

The multiple return values follow 0-allocation, improving run-time performance in microbenchmarking.

Essentially all parsers are and should be inlined aggressively. This is also follows from 0-allocation and 0-memory indirects, due to our representation of parsers and parser combinators. A clousre representation of parsers is quite convenient but, without aggressive inlining, it allocates and indirects through too many closures. Thankfully, these closure indirects really are mostly unnecessary, and can be optimized away, as long as Racket is given the right hints. define-parser wraps begin-encourage-inline and is used aggressively, and define-inline is used for many small functions, parsers, and parser combinators.

The aggressive inlining massively improves performance, but has two costs. It approximately doubles object code size compared to no inlining hints; this is an acceptable cost I think. It also severaly impacts debugging. If a parser combinator is used incorrectly, errors seems to come from deep inside fast-parse. I don’t have a good solution for this.

fast-parse uses racket/unsafe/ops and unsafe-assert-unreachable copiously. Any function or form that might cause a dynamic tag check or contract check is avoided. Every comparison is made using eq? or unsafe fixnum operators. It is careful to maintain its own type invariants and these uses shouldn’t leak out of the internals, but they might, and they might corrupt your program. In principle, Typed Racket could be used to make this more safe, but Typed Racket doesn’t have great support for define-inline or multiple return values.

Spans are used to represent a successful parse of a substring, indicating the start and end position of the parse string that successfully matched. Internally, spans are presented as a single 60-bit fixnum, with the 30 high-order bits representing the start, and 30 low-order bits representing the end (exclusive). The start and end are suitable arguments to substring. There is hopefully no reason to worry about this internal representation, which can be efficiently interacted with through its interface: the match expander span, constructor make-span, and accessors span-start and span-end. But if a seemingly arbitrary number is returned from a parser or shows up in an error message, it might be a span.

This representation of spans imposes a limit on the maximum size of strings that can be parsed of (expt 2 30). Microbenchmarking suggests this representation of spans is not significantly more performant than use cons, which would lift this restriction. However, the representation of spans follows from 0-allocation desgin principle so I’m reluctant to deviate without expressed user permission. I may work on a way to give a compile-time option to change the representation.

3 Library🔗

procedure
(parse p input) → A
p : (Parser A)
input : (or/c input-port? string?)

Run the parser p on input, returning either the result of the parser, or raising exn:fast-parse-fail if the parse fails.

Examples:

> (define-parser $open ($do $whitespaces $left))
> (define-parser $close ($do $whitespaces $right))
> (define-parser ($to-symbol x) ($return (string->symbol x)))
> (define-parser $atom (>> (>> $identifier $get-substring) $to-symbol))
> (define-parser $sexp
    ($or
     ($do $whitespaces $atom)
      ($do $open
          ($do0 ($foldr cons ($return '()) $sexp)
                $right))))
> (parse $sexp "(foo (bar (baz biz buzz) boing) bong)")
'(foo (bar (baz biz buzz) boing) bong)
> (parse $sexp "(foo [ (bar (baz biz buzz) boing) bong)")
Parse failed in string:1:5:
(foo [ (bar (baz biz buzz) boing) bong)
     ^

value
failure : Failure

A tag representing failure to parse.

procedure
(failure? p) → boolean?
p : Any

Checks if p is failure. Should be inlined.

procedure
(okay? p) → boolean?
p : Any

Checks if p is not failure. Should be inlined.

procedure
(okay r i) → (Parse-Result A)
r : A
i : Position

Returns a successful parse result returning r with next position i. Currently represent as (Values A Position).

procedure
(fail i) → (Parse-Result A)
i : Position

Produces a parse failure result returning with next position i. Currently represent as (Values Position failure).

syntax
(span start end)

start : Position
end : Position

A match-expander?, and a rename transformer for make-span when used in expression context.

Example:

> (match (span 0 5)
[(span start end) (+ start end)])
5

procedure
(make-span start end) → Span
start : Position
end : Position

Returns a span representing a matched substring of the input string. start must be less than end, and end must not greater than the length of the input string.

procedure
(span-start s) → Position
s : Span

Return the starting position of a span. The result is guaranteed to be within the range of the input string if produced by fast-parse.

procedure
(span-end s) → Position
s : Span

Return the end position of a span. The end position is just after the matched substring, and thus suitable for use with substring. The result is guaranteed to be within the range of the input string if produced by fast-parse.

syntax
(define-parser name expr)
(define-parser (head args ...) expr ... expr0)

Defines a parser or parser combinator, adding inlining hints.

value
$skip : (Parser Void)

Succeeds on character, advancing the current position. Fails if the position is already at the end of the string.

value
$unskip : (Parser Void)

Succeeds on character, moving the position back one. Fails if the position is already at the beginning of the string.

value
$eof : (Parser Void)

Succeeds if the current position is at the end of the string.

value
$any-char : (Parser Char)

Succeeds against any character, returning it. Only fails if the current position is not beyond the end of the string.

procedure
($char c) → (Parser Char)
c : Char

Succeeds if the current position is eq? to c. Fails otherwise, or if the position is beyond the end of the string.

value
$left : (Parser Char)

Succeeds if the current character is #\(.

value
$right : (Parser Char)

Succeeds if the current character is #\).

procedure
($return v) → (Parser A)
v : A

The return of the parser monad. Always succeeds, returning v.

syntax
(>> e ...)

Bind of the parser monad. Monadic bind. Composes a sequence of parsers by passing the successful result of each into the next parser. Each command (x <- e) runs the parser e and binds the result to x in the remaining commands, or fails if e fails. A non-binding command e is simply run for effect.

syntax
($do cmds ...)

cmd = (x <- e)
| e

Syntactic sugar for the bind of the parser monad. Composes a sequence of parsers into a single parser. Each command (x <- e) runs the parser e and binds the result to x in the remaining commands, or fails if e fails. A non-binding command e is simply run for effect.

syntax
($do0 e0 cmds ...)

cmd = (x <- e)
| e

Like $do, but returns the result of e0 after executing the remaining commands.

value
$fail : (Parser A)

Always fails.

value
$get-pos : (Parser Position)

Always succeeds, returning the current position in the string.

procedure
($or pa1 pa2) → (Parser (U A B))
pa1 : (Parser A)
pa2 : (Parser B)

Succeeds if pa1 or pa2 suceeds. Returns the result of the first to match.

syntax
($or* pa ...)

Succeeds if any of pa match, tried in order using $or. Returns the result of the first to match.

procedure
($not pa) → (Parser Void)
pa : (Parser A)

Succeeds if p fails; fails if p succeeds. Nonconsuming.

procedure
($branch pa pt pf) → (Parser (U B C))
  pa : (Parser A)
  pt : (Parser B)
  pf : (Parser C)

If pa succeeds, continues with pt, otherwise with pf. You probably want $with-option.

procedure
($with-option pa just pnothing) → (Parser (U B C))
  pa : (Parser A)
  just : (-> A (Parser B))
  pnothing : (Parser C)

If pa succeeds, passes the result to just and continues. Otherwise, continues with pf.

procedure
($many+ pa) → (Parser Span)
pa : (Parser A)

Succeeds if pa succeeds one or more times, returning the span. pa should be a consuming parser.

procedure
($many* pa) → (Parser Span)
pa : (Parser A)

Succeeds if pa succeeds zero or more times, returning the span. pa should be a consuming parser.

procedure
($until p-stop) → (Parser Span)
p-stop : (Parser A)

Succeeds as long as p-stop fails or until $eof, returning the span.

procedure
($get-substring sp) → (Parser String)
sp : Span

Always succeeds, returning the substring represented by the span sp.

procedure
($foldl f p-base p-ele) → (Parser B)
  f : (-> A B B)
  p-base : (Parser B)
  p-ele : (Parser A)

Folds left to right over the input string until p-ele fails, combining the results using f. f is used to combine results before the next iteration.

Example:

> (parse ($foldl cons ($return '()) $any-char) "abcdef")
'(#\f #\e #\d #\c #\b #\a)

procedure
($foldr f p-base p-ele) → (Parser B)
  f : (-> A B B)
  p-base : (Parser B)
  p-ele : (Parser A)

Folds right to left over the input string until p-ele fails, combining the results using f. f is used to combine results after the next iteration.

Example:

> (parse ($foldr cons ($return '()) $any-char) "abcdef")
'(#\a #\b #\c #\d #\e #\f)

value
$read-line : (Parser Span)

Succeeds until the next newline character, or the end of the string, returning the span.

value
$skip-line : (Parser Void)

Succeeds until the next newline character, or the end of the string, but without computing the span.

value
$spaces : (Parser Span)

Succeeds against zero or more #\spaces.

value
$whitespaces : (Parser Span)

Succeeds against zero or more #\space, #\newline, and #\tab characters.

procedure
($string s) → (Parser Span)
s : String

Succeeds if the characters of s match a substring of the input starting at the current position.

procedure
($satisfy pred) → (Parser Char)
pred : (-> Char Boolean)

Succeeds if pred returns true given the current character.

value
$identifier-char : (Parser Char)

Succeeds against a character that is valid in an identifier?.

value
$identifier : (Parser Span)

Succeeds against a one or more $identifier-char.

value
$digit : (Parser Char)

Succeeds against a character that is char-numeric?.

value
$digits : (Parser Span)

Succeeds against one or more $digit.