Performance Internals

Sema's default execution path is a tree-walking interpreter. A bytecode VM is available via --vm (see Bytecode VM) and is significantly faster. Early optimizations brought the 1 Billion Row Challenge benchmark from ~25s to ~9.6s on 10M rows using a "mini-eval" — a minimal evaluator inlined in the stdlib that bypassed the full trampoline. The mini-eval was later removed for architectural reasons (semantic drift from the real evaluator, and blocking the path to a bytecode VM). Fast-path optimizations in the real evaluator partially recovered performance, bringing the tree-walker benchmark to ~2,900ms on 1M rows (vs ~960ms with the mini-eval). The bytecode VM (--vm) now achieves ~1,700ms on 1M rows and ~15,900ms on 10M rows, recovering much of the mini-eval's performance through compilation rather than inlining. This page documents each optimization, its history, and measured impact.

All benchmarks were run on Apple Silicon (M-series), processing the 1BRC dataset (semicolon-delimited weather station readings, one per line).

Benchmark Summary

Stage	1M rows	10M rows	Technique	Status
Baseline	2,501 ms	~25,000 ms	Naive implementation	—
+ COW assoc	1,800 ms	~18,000 ms	In-place map mutation	✅ Active
+ Env reuse	1,626 ms	16,059 ms	Lambda env recycling (mini-eval)	❌ Removed
+ Mini-eval	~960 ms	~9,600 ms	Inlined builtins, custom parser	❌ Removed
+ String interning	—	—	Spur-based dispatch	✅ Active
+ hashbrown	—	—	Amortized O(1) accumulator	✅ Active
Post-removal	~2,900 ms	~28,400 ms	Callback architecture + fast paths	✅ Current (tree-walker)
Bytecode VM	~1,700 ms	~15,900 ms	Bytecode VM (`--vm`)	✅ Active

Note: The mini-eval and its associated optimizations (env reuse, inlined builtins, custom number parser, SIMD split fast path) were removed to unblock the bytecode VM, which is now implemented and available via --vm. The bytecode VM provides a ~1.7× speedup over the tree-walker (~1,700ms vs ~2,900ms on 1M rows), recovering much of the mini-eval's performance through compilation. The tree-walker remains the default; the current architecture uses sema_core::call_callback to route stdlib → real evaluator. Fast-path optimizations (self-evaluating short-circuit, inline NativeFn dispatch, thread-local EvalContext, deferred cloning) partially recovered performance.

VM compute benchmarks (Feb 2026, post-stdlib intrinsics): TAK 1,234ms, upvalue-counter 440ms, deriv 879ms. The deriv benchmark — dominated by car/cdr/cons/pair? — improved 22% from stdlib intrinsic opcodes. The 1BRC numbers above are I/O-bound and less affected by VM compute optimizations.

1. Copy-on-Write Map Mutation

Problem: Every (assoc map key val) call cloned the entire BTreeMap, even when no other reference existed. For the 1BRC accumulator (~400 weather stations), this was O(400) per row × millions of rows.

Solution: Use Rc::try_unwrap to check if the reference count is 1. If so, take ownership and mutate in place. Otherwise, clone.

rust

// crates/sema-stdlib/src/map.rs
match Rc::try_unwrap(m) {
    Ok(map) => map,       // refcount == 1: we own it, mutate in place
    Err(m) => m.as_ref().clone(),  // shared: must clone
}

The key insight is pairing this with Env::take() — by removing the accumulator from the environment before passing it to assoc, the refcount drops to 1, enabling the in-place path. User code looks like:

scheme

(file/fold-lines "data.csv"
  (lambda (acc line)
    (let ((parts (string/split line ";")))
      (assoc acc (first parts) (second parts))))
  {})

The fold-lines implementation moves (not clones) acc into the lambda env on each iteration, keeping the refcount at 1.

Impact: ~30% of the total speedup. Eliminated the O(n) full-map clone, leaving only the O(log n) BTreeMap insert per row.

Literature:

This is the same copy-on-write strategy used by Swift's value types. (Clojure's persistent data structures solve a related problem — avoiding full copies — but via structural sharing rather than refcount-based COW.)
Phil Bagwell, "Ideal Hash Trees" (2001) — the paper behind Clojure/Scala persistent collections
Rust's Rc::make_mut provides the same semantics with less ceremony

2. Lambda Environment Reuse (removed)

Status: This optimization was part of the mini-eval's hot path in io.rs. It was removed when the mini-eval was deleted. The current file/fold-lines uses sema_core::call_callback, which routes through the real evaluator — each call creates a fresh Env via the standard apply_lambda path.

What it was: For simple lambdas (known arity, no rest params), the mini-eval created the lambda environment once and reused it across all iterations, overwriting bindings in place. Combined with a reusable line_buf, this eliminated per-iteration allocations for Env, string interning, and line buffers.

Why it was removed: The env reuse logic was tightly coupled to the mini-eval's direct lambda dispatch. The callback architecture routes through the real evaluator's apply_lambda, which always creates a fresh child Env — this is correct and avoids subtle bugs from env mutation leaking across calls.

Impact when active: ~15% speedup (2,501ms → 1,626ms combined with COW assoc).

What remains: The reusable line_buf (String::with_capacity(64) cleared each iteration) is still present in file/fold-lines — only the env reuse was lost.

3. Evaluator Callback Architecture (replacing Mini-Eval)

Status: The mini-eval was deleted and replaced with a callback architecture. Stdlib now calls the real evaluator via sema_core::call_callback.

What the mini-eval was: sema-stdlib previously contained its own minimal evaluator (sema_eval_value) that handled common forms via direct recursive calls, inlining builtins like +, =, assoc, string/split, and string->number to skip Env lookup and NativeFn dispatch entirely.

Why it was removed:

Semantic drift: The mini-eval diverged from the real evaluator — new special forms, error handling, and features had to be duplicated or were silently missing.
Blocking bytecode VM: A bytecode compiler can't target two evaluators. Removing the mini-eval ensures a single evaluation path that the VM can replace.

The callback architecture: sema-stdlib cannot depend on sema-eval (circular dependency). Instead, sema-eval registers a thread-local callback (set_call_callback) at startup, and stdlib functions call sema_core::call_callback to invoke the real evaluator. A thread-local EvalContext (with_stdlib_ctx) is shared across calls to avoid per-call context allocation.

rust

// crates/sema-stdlib/src/io.rs — file/fold-lines via callback
sema_core::with_stdlib_ctx(|ctx| {
    let mut line_buf = String::with_capacity(64);
    loop {
        line_buf.clear();
        let n = reader.read_line(&mut line_buf)?;
        if n == 0 { break; }
        // Calls the real evaluator (eval_value) via thread-local callback
        acc = sema_core::call_callback(ctx, &func, &[acc, Value::string(&line_buf)])?;
    }
    Ok(acc)
})

Performance trade-off: ~960ms → ~2,900ms on 1M rows (~3× regression). The overhead comes from the full trampoline evaluator: call stack management, span tracking, and Trampoline dispatch on every sub-expression.

Fast-path optimizations that partially recovered performance:

Self-evaluating fast path: eval_value short-circuits for integers, floats, strings, keywords, and symbols — skipping depth tracking and step limits for the most common forms.
Inline NativeFn dispatch: When the evaluator sees a Value::NativeFn in call position, it calls the function pointer directly without going through call_callback indirection.
Thread-local shared EvalContext: with_stdlib_ctx reuses a single EvalContext across all stdlib → evaluator callbacks, avoiding per-call allocation of RefCell/Cell fields.
Deferred cloning: eval_value_inner avoids cloning the expression and environment on the first trampoline iteration, only cloning if a tail call (Trampoline::Eval) is returned.

Remaining gap: The ~3× regression cannot be fully closed within the tree-walking architecture. The bytecode VM (--vm) — the reason the mini-eval was removed — reduces this to ~1.8× vs the mini-eval (~1,700ms vs ~960ms) and is ~1.7× faster than the tree-walker.

Literature:

Inline caching, pioneered by Smalltalk-80 and refined in V8's hidden classes, solves the same dispatch overhead problem but at a different architectural level
Most production Lisps (SBCL, Chez Scheme) compile to native code, making dispatch overhead negligible — Sema's callback overhead is inherent to tree-walking interpreters
Lua 5.x's bytecode VM inlines common operations (OP_ADD, OP_GETTABLE) into the dispatch loop — this is the approach Sema's bytecode VM (sema-vm) takes

4. String Interning (lasso)

Problem: Symbol/keyword equality was O(n) string comparison. Environment lookups keyed by String required comparing the full string on each BTreeMap node visit. Special form dispatch compared against 30+ string literals on every list evaluation.

Solution: Replace Rc<String> in Value::Symbol and Value::Keyword with Spur — a u32 handle from the lasso string interner. Environment bindings keyed by Spur for direct integer lookup.

rust

// Before: O(n) string comparison
Value::Symbol(Rc<String>)
env: BTreeMap<String, Value>

// After: O(1) integer comparison
Value::Symbol(Spur)  // u32
env: BTreeMap<Spur, Value>

Special form dispatch uses pre-cached Spur constants:

rust

// crates/sema-eval/src/special_forms.rs
struct SpecialFormSpurs {
    quote: Spur,
    if_: Spur,
    define: Spur,
    // ... 30 more
}

// Dispatch: integer comparison, no string resolution
if head_spur == sf.if_ {
    return Some(eval_if(args, env));
}

Caveat: The initial implementation was actually slower (2,518ms vs 1,580ms baseline) because resolve() was allocating a new String on every symbol lookup. Fixed by adding with_resolved(spur, |s| ...) which provides a borrowed &str without allocation, and switching Env to use Spur keys directly.

Impact: 1,580ms → 1,400ms (11% faster) after fixing the allocation issue.

Literature:

String interning is as old as Lisp itself — McCarthy's original LISP 1.5 (1962) interned atoms in the "object list" (oblist)
Java interns all string literals and provides String.intern(). The JVM's invokedynamic uses interned method names for O(1) dispatch
The string-interner and lasso crates are the two main Rust options; lasso was chosen for its Rodeo thread-local interner which fits Sema's single-threaded architecture

5. hashbrown HashMap

Problem: The 1BRC accumulator uses a map keyed by weather station name (~400 entries). BTreeMap provides O(log n) lookup, but the accumulator is accessed on every row. With 10M rows, the log₂(400) ≈ 9 comparisons per lookup adds up.

Solution: Added a Value::HashMap variant backed by hashbrown (the same hash map used inside Rust's std::collections::HashMap, but exposed directly for no_std compatibility and raw API access).

scheme

;; User code: opt into HashMap for the accumulator
(file/fold-lines "data.csv"
  (lambda (acc line) ...)
  (hashmap/new))  ; amortized O(1) vs O(log n)

;; Convert back to sorted BTreeMap for output
(hashmap/to-map acc)

BTreeMap remains the default for {} map literals because deterministic ordering matters for equality, printing, and test assertions. hashbrown is opt-in for performance-critical paths.

Impact: 1,400ms → 1,340ms (4% faster). Modest because BTreeMap with 400 entries and short string keys is already fast.

Literature:

hashbrown uses SwissTable, designed by Google for their C++ absl::flat_hash_map. See CppCon 2017: Matt Kulukundis "Designing a Fast, Efficient, Cache-friendly Hash Table"
Clojure's {:key val} maps use HAMTs (hash array mapped tries) which provide O(~1) lookup with structural sharing. Sema's approach is simpler: full COW on the Rc<HashMap> rather than structural sharing, which is viable because the refcount-1 fast path almost always hits

6. SIMD Byte Search (memchr) (removed)

Status: The memchr-based two-part split fast path was part of the mini-eval's inlined string/split and was removed with it. The current string/split in sema-stdlib/src/string.rs uses Rust's standard str::split() followed by map and collect. The memchr crate remains a dependency of sema-stdlib but is no longer used in the split hot path.

What it was: A SIMD-accelerated (SSE2/AVX2/NEON) byte search via the memchr crate, combined with a two-part split fast path that avoided Vec allocation when splitting on a single-byte separator with exactly one occurrence (the common case in 1BRC: "Berlin;12.3" → ["Berlin", "12.3"]).

Impact when active: Negligible for SIMD specifically (1BRC strings are 10–30 bytes), but the two-part fast path avoided iterator/Vec overhead.

Literature:

memchr is maintained by Andrew Gallant (BurntSushi), author of ripgrep. It uses a generic SIMD framework to dispatch to the best available instruction set at runtime

7. Custom Number Parser (removed)

Status: This was part of the mini-eval's inlined string->number and was removed with it. The current string->number in sema-stdlib/src/string.rs uses Rust's standard str::parse::<i64>() with fallback to str::parse::<f64>().

What it was: A hand-rolled decimal parser that handled only [-]digits[.digits], using a precomputed powers-of-10 lookup table for 1–4 fractional digits. It returned None for complex cases (scientific notation, infinity, NaN), falling back to the standard parser.

Impact when active: Part of the combined mini-eval speedup. Difficult to isolate, but avoided the overhead of Rust's dec2flt algorithm.

Literature:

Rust's float parser is based on the Eisel-Lemire algorithm (2020), which is fast for a general-purpose parser but still does more work than necessary for simple decimals
Daniel Lemire's fast_float C++ library (and its Rust port) takes a similar "fast path for common cases" approach

8. Enlarged I/O Buffer

Problem: BufReader's default 8KB buffer means frequent syscalls for large files.

Solution: 256KB buffer for file/fold-lines.

rust

let mut reader = std::io::BufReader::with_capacity(256 * 1024, file);

Impact: Minor. CPU was the bottleneck, not I/O. But it's a free win — larger buffers amortize syscall overhead and improve sequential read throughput on modern SSDs.

9. Bytecode VM Optimizations

The bytecode VM (--vm) applies several optimizations beyond basic bytecode compilation. These are documented in detail in Bytecode VM; highlights below.

Intrinsic Recognition

The compiler recognizes calls to known builtins and emits inline opcodes instead of function calls:

Arithmetic & comparison (phase 1):

Source	Compiled to	What it replaces
`(+ a b)`	`AddInt`	`CallGlobal("+", 2)` → hash lookup → NativeFn downcast → args Vec → function call
`(- a b)`	`SubInt`	Same overhead
`(* a b)`	`MulInt`	Same overhead
`(< a b)`	`LtInt`	Same overhead
`(> a b)`	`Gt`	Same overhead
`(not x)`	`Not`	Same overhead

Stdlib: list operations & type predicates (phase 2, Feb 2026):

Source	Compiled to	What it replaces
`(car x)` / `(first x)`	`Car`	Same overhead — pop list, push first element
`(cdr x)` / `(rest x)`	`Cdr`	Same — pop list, push tail
`(cons h t)`	`Cons`	Same — pop head+tail, push new list
`(null? x)`	`IsNull`	Same — push `#t` if nil or empty list
`(pair? x)`	`IsPair`	Same — push `#t` if non-empty list
`(list? x)`	`IsList`	Same — push `#t` if list
`(number? x)`	`IsNumber`	Same — push `#t` if int or float
`(string? x)`	`IsString`	Same — push `#t` if string
`(symbol? x)`	`IsSymbol`	Same — push `#t` if symbol
`(length x)`	`Length`	Same — push collection length as int
`(append a b)`	`Append`	Same — concatenate two lists (2-arg only)
`(get m k)`	`Get`	Same — map lookup, nil default (2-arg only)
`(contains? m k)`	`ContainsQ`	Same — push `#t` if key exists in map

This eliminates global hash lookup, Rc downcast, argument Vec allocation, and function pointer dispatch — the entire call overhead — for the most common operations. The *Int opcodes include NaN-boxed small-int fast paths that operate directly on raw u64 bits, avoiding Clone/Drop overhead entirely.

All standard arithmetic and comparison operators are inlined. The *Int variants include NaN-boxed fast paths; the generic opcodes (Div, Gt, Le, Ge) handle int/float coercion correctly.

Impact: Phase 1: TAK 4,352ms → 1,250ms (-71%), upvalue-counter 1,232ms → 450ms (-63%). Phase 2: deriv 1,123ms → 879ms (-22%), closure-storm 1,135ms → 1,029ms (-9%). The deriv benchmark is dominated by car/cdr/cons/pair? — exactly the functions that became intrinsics.

Constant Folding

An optimization pass (optimize.rs) runs on the CoreExpr IR between lowering and variable resolution. It folds compile-time-evaluable expressions:

Arithmetic: (+ 1 2) → 3, (* 3 4) → 12
Comparisons: (< 1 2) → #t, (= 3 3) → #t
Boolean: (not #t) → #f
Control flow: (if #t a b) → a, (and #f x) → #f, (or #t x) → #t
Dead code: (begin 42 x) → (begin x) (pure constants before the last expression are eliminated)

Impact: Eliminates unnecessary instructions at compile time. Runtime impact on benchmarks is negligible (hot loops operate on variables), but reduces code size and improves startup for programs with constant subexpressions.

Peephole: `(if (not X) ...)` → JumpIfTrue

The compiler pattern-matches (if (not expr) then else) and emits the condition with an inverted jump, eliminating both the not call and one opcode dispatch:

;; Before: CallGlobal("not") + JumpIfFalse
;; After:  JumpIfTrue (condition compiled directly)

Fused CallGlobal

Non-tail calls to global functions use a single CallGlobal instruction that combines LoadGlobal + Call, using call_vm_closure_direct to set up the call frame without needing the function value on the stack.

Specialized Local Access

Slots 0–3 have dedicated zero-operand opcodes (LoadLocal0..LoadLocal3, StoreLocal0..StoreLocal3), saving 2 bytes per access to the most common local variable slots.

Rejected Optimizations

Not everything we tried worked:

Approach	Result	Why
HashMap for Env	Slower	`BTreeMap` is faster for the very small maps (1–3 entries) typical of `let` scopes. HashMap's hashing overhead exceeds BTreeMap's few integer comparisons at that size.
im-rc / rpds (persistent collections)	Slower	Structural sharing fights the COW optimization — the whole point is to avoid sharing and mutate in place when refcount is 1.
bumpalo / typed-arena	Incompatible	Values need to escape the arena (returned from functions, stored in environments). Arena allocation only works for temporaries.
compact_str / smol_str	Redundant	Once symbols/keywords are interned as `Spur`, small-string optimization for them is pointless. String values are still `Rc<String>` but they're not in the hot path for dispatch.

Note: "Full evaluator callback" was previously listed here as rejected (4x slower than mini-eval). It is now the current architecture — the ~2.7× overhead vs the mini-eval is accepted as the cost of architectural correctness. The bytecode VM (--vm) is the path to eliminate this overhead entirely.

Architecture Diagram

The hot path for file/fold-lines with the current callback architecture (tree-walker):

file/fold-lines
  ├── BufReader (256KB buffer, reused line_buf)
  └── Per-line loop:
        ├── read_line → reused buffer (no alloc)
        ├── call_callback → real evaluator (eval_value)
        │     ├── self-evaluating fast path (ints, floats, strings skip depth tracking)
        │     ├── NativeFn inline dispatch (direct call, no callback indirection)
        │     ├── apply_lambda → fresh Env per call (no env reuse)
        │     ├── string/split → std str::split (no SIMD fast path)
        │     ├── string->number → std parse::<i64> / parse::<f64>
        │     └── assoc → COW in-place mutation (Rc refcount == 1)
        ├── thread-local EvalContext (shared, not per-call)
        └── acc moved, not cloned → preserves refcount == 1

When --vm is used, the bytecode VM bypasses the tree-walker entirely. Instead of call_callback → eval_value → trampoline, the VM compiles the lambda body to bytecode once and executes it in a tight instruction dispatch loop, eliminating trampoline overhead, per-call span tracking, and repeated AST traversal.

Performance Internals ​

Benchmark Summary ​

1. Copy-on-Write Map Mutation ​

2. Lambda Environment Reuse (removed) ​

3. Evaluator Callback Architecture (replacing Mini-Eval) ​

4. String Interning (lasso) ​

5. hashbrown HashMap ​

6. SIMD Byte Search (memchr) (removed) ​

7. Custom Number Parser (removed) ​

8. Enlarged I/O Buffer ​

9. Bytecode VM Optimizations ​

Intrinsic Recognition ​

Constant Folding ​

Peephole: (if (not X) ...) → JumpIfTrue ​

Fused CallGlobal ​

Specialized Local Access ​

Rejected Optimizations ​

Architecture Diagram ​

Performance Internals

Benchmark Summary

1. Copy-on-Write Map Mutation

2. Lambda Environment Reuse (removed)

3. Evaluator Callback Architecture (replacing Mini-Eval)

4. String Interning (lasso)

5. hashbrown HashMap

6. SIMD Byte Search (memchr) (removed)

7. Custom Number Parser (removed)

8. Enlarged I/O Buffer

9. Bytecode VM Optimizations

Intrinsic Recognition

Constant Folding

Peephole: `(if (not X) ...)` → JumpIfTrue

Fused CallGlobal

Specialized Local Access

Rejected Optimizations

Architecture Diagram