fast-parse
| (require fast-parse) | package: fast-parse-lib |
A fast monadic parsing combinator library. fast-parse values fast above all else. fast-parse is not general purpose. fast-parse is not stable. fast-parse is not safe. fast-parse is not easy to use. Maybe lesser desiderata will improve in time, but above all, fast-parse is fast.
Big let-expression microbenchmark |
------------------------------------------------------------------------ |
fast-parse fused: cpu time: 3 real time: 3 gc time: 0 |
fast-parse sexpr: cpu time: 25 real time: 25 gc time: 12 |
Racket `read`: cpu time: 56 real time: 59 gc time: 4 |
Racket `read-syntax`: cpu time: 72 real time: 75 gc time: 13 |
parsack: cpu time: 577 real time: 597 gc time: 103 |
|
Generic s-expression microbenchmark |
------------------------------------------------------------------------ |
fast-parse: cpu time: 0 real time: 0 gc time: 0 |
Racket `read`: cpu time: 3 real time: 3 gc time: 0 |
Racket `read-syntax`: cpu time: 5 real time: 5 gc time: 0 |
parsack: cpu time: 37 real time: 38 gc time: 0 |
How? Cheating, copiously.
Unfortunately, it still has a long way to go. The design of fast-parse is largely stolen stolen from Haskell’s flatparse, by András Kovács, and we still can’t compete.
flatparse sexp comparison |
------------------------------------------------------------------------ |
flatparse: 1.80ms |
fast-parse skip sexpr: cpu time: 32 real time: 33 gc time: 5 |
Racket `read`: cpu time: 182 real time: 190 gc time: 15 |
Racket `read-syntax`: cpu time: 356 real time: 364 gc time: 129 |
I may change what I have to in order to achieve my goals. If you want to use this, pin your project to a specific version.
1 Limitations
fast-parse will not work on strings larger than (expt 2 30) characters, or about between 1Gb and 4.29Gb depending on the string. fast-parse will only work on 64 bit machines, with 60 bit fixnums. This may change; see spans for more.
2 Design
0-allocation: fast-parse avoids performing any allocation. It does not allocate intermediate data structures, unless you tell it to. It will not copy out a substring, unless you tell it to. It even avoids creating closures, as much as automatically possible.
0-memory indirects: fast-parse avoids accessing memory as much as possible. It must of course access the string to parse, but it won’t access its length more than once. It will avoid reading from the string more than once, and enable users to do the same. You can commit a peek without touching memory.
No checks: fast-parse avoids unnecessary run-time checks. This includes dynamic type checks and contracts.
There are several design decisions that follow from these principles.
fast-parse does not work over input-port?s. input-port?s violate all our design principles. Instead, fast-parse works over strings. If you give it a port, it will copy the entire port into a string before parsing.
(define-type (Parser A) (U (All (A) (-> String Natural Natural (Values A Natural))) (-> String Natural Natural (Values Natural failure))))
A consuming parser is one that advances the current position, regardless of success or failure. A nonconsuming parser does not advance the current position, again regardless of success or failure.
The multiple return values follow 0-allocation, improving run-time performance in microbenchmarking.
Essentially all parsers are and should be inlined aggressively. This is also follows from 0-allocation and 0-memory indirects, due to our representation of parsers and parser combinators. A clousre representation of parsers is quite convenient but, without aggressive inlining, it allocates and indirects through too many closures. Thankfully, these closure indirects really are mostly unnecessary, and can be optimized away, as long as Racket is given the right hints. define-parser wraps begin-encourage-inline and is used aggressively, and define-inline is used for many small functions, parsers, and parser combinators.
The aggressive inlining massively improves performance, but has two costs. It approximately doubles object code size compared to no inlining hints; this is an acceptable cost I think. It also severaly impacts debugging. If a parser combinator is used incorrectly, errors seems to come from deep inside fast-parse. I don’t have a good solution for this.
fast-parse uses racket/unsafe/ops and unsafe-assert-unreachable copiously. Any function or form that might cause a dynamic tag check or contract check is avoided. Every comparison is made using eq? or unsafe fixnum operators. It is careful to maintain its own type invariants and these uses shouldn’t leak out of the internals, but they might, and they might corrupt your program. In principle, Typed Racket could be used to make this more safe, but Typed Racket doesn’t have great support for define-inline or multiple return values.
Spans are used to represent a successful parse of a substring, indicating the start and end position of the parse string that successfully matched. Internally, spans are presented as a single 60-bit fixnum, with the 30 high-order bits representing the start, and 30 low-order bits representing the end (exclusive). The start and end are suitable arguments to substring. There is hopefully no reason to worry about this internal representation, which can be efficiently interacted with through its interface: the match expander span, constructor make-span, and accessors span-start and span-end. But if a seemingly arbitrary number is returned from a parser or shows up in an error message, it might be a span.
This representation of spans imposes a limit on the maximum size of strings that can be parsed of (expt 2 30). Microbenchmarking suggests this representation of spans is not significantly more performant than use cons, which would lift this restriction. However, the representation of spans follows from 0-allocation desgin principle so I’m reluctant to deviate without expressed user permission. I may work on a way to give a compile-time option to change the representation.
3 Library
procedure
(parse p input) → A
p : (Parser A) input : (or/c input-port? string?)
> (define-parser $open ($do $whitespaces $left)) > (define-parser $close ($do $whitespaces $right)) > (define-parser ($to-symbol x) ($return (string->symbol x))) > (define-parser $atom (>> (>> $identifier $get-substring) $to-symbol))
> (define-parser $sexp ($or ($do $whitespaces $atom) ($do $open ($do0 ($foldr cons ($return '()) $sexp) $right)))) > (parse $sexp "(foo (bar (baz biz buzz) boing) bong)") '(foo (bar (baz biz buzz) boing) bong)
> (parse $sexp "(foo [ (bar (baz biz buzz) boing) bong)") Parse failed in string:1:5:
(foo [ (bar (baz biz buzz) boing) bong)
^
value
failure : Failure
procedure
(okay r i) → (Parse-Result A)
r : A i : Position
procedure
(fail i) → (Parse-Result A)
i : Position
syntax
(span start end)
start : Position
end : Position
procedure
(make-span start end) → Span
start : Position end : Position
procedure
(span-start s) → Position
s : Span
procedure
(span-end s) → Position
s : Span
syntax
(define-parser name expr)
(define-parser (head args ...) expr ... expr0)
procedure
($return v) → (Parser A)
v : A
syntax
(>> e ...)
syntax
($do cmds ...)
cmd = (x <- e) | e
syntax
($do0 e0 cmds ...)
cmd = (x <- e) | e
value
$fail : (Parser A)
value
$get-pos : (Parser Position)
syntax
($or* pa ...)
procedure
($with-option pa just pnothing) → (Parser (U B C))
pa : (Parser A) just : (-> A (Parser B)) pnothing : (Parser C)
procedure
($many+ pa) → (Parser Span)
pa : (Parser A)
procedure
($many* pa) → (Parser Span)
pa : (Parser A)
procedure
($until p-stop) → (Parser Span)
p-stop : (Parser A)
procedure
($get-substring sp) → (Parser String)
sp : Span
value
$read-line : (Parser Span)
value
$skip-line : (Parser Void)
value
$spaces : (Parser Span)
value
$whitespaces : (Parser Span)
value
$identifier-char : (Parser Char)
value
$identifier : (Parser Span)
value
$digits : (Parser Span)