Bytecode VM
The Evaluator
Sema compiles to bytecode and runs on the VM. Every entry point — the CLI, the REPL, the embedding API, eval, import/load, macros, and async/await — compiles to bytecode and runs on the VM.
Overview
Sema's evaluator is a bytecode VM. The VM compiles Sema source code into stack-based bytecode for fast execution. On compute-heavy workloads it is fast — 500 iterations of the TAK benchmark (tak 18 12 6) run in roughly 1.9 s (≈3.7 ms/iteration) in a plain release build on a modern laptop. The PGO-optimized release binaries shipped via cargo-dist / Homebrew (v1.19.2+) run this benchmark ~30% faster again; see Performance Internals.
Macro expansion and dynamic eval are handled by sema-eval, which expands macros to a Value AST and then feeds them through the same compile-to-bytecode pipeline.
Compilation Pipeline
Source text
→ Reader (tokenize + parse → Value AST)
→ Macro expand (sema-eval expands macros)
→ Lower (Value AST → CoreExpr IR)
→ Optimize (constant folding + simplification on CoreExpr)
→ Resolve (CoreExpr → ResolvedExpr with slot/upvalue/global analysis)
→ Compile (ResolvedExpr → bytecode Chunks)
→ VM execution (dispatch loop)Phase 1: Lowering (Value → CoreExpr)
The lowering pass converts the Value AST into CoreExpr, a desugared intermediate representation. All ~40 special forms are lowered to ~35 CoreExpr variants. Several forms desugar into simpler ones:
| Source Form | Lowers To |
|---|---|
cond | Nested If |
when | If with nil else |
unless | If with swapped branches |
case | Let + nested If with Or comparisons |
defun | Define + Lambda |
Named let | Letrec + Lambda |
Tail position analysis happens during lowering. The Call node carries a tail: bool flag, set based on position:
- Tail: last expression in
lambdabody,begin,let/let*/letrecbody,ifbranches,condclauses,and/orlast operand - Not tail:
trybody (handler must be reachable),doloop body, non-last expressions
Phase 2: Variable Resolution (CoreExpr → ResolvedExpr)
The resolver walks the CoreExpr tree and classifies every variable reference as one of:
| Resolution | Opcode | Description |
|---|---|---|
Local { slot } | LoadLocal / StoreLocal | Variable in the current function's frame |
Upvalue { index } | LoadUpvalue / StoreUpvalue | Captured from an enclosing function |
Global { spur } | LoadGlobal / StoreGlobal | Module-level binding |
This is a key optimization: instead of hash-based environment chain lookup (O(scope depth) per access), variables are accessed by direct slot index (O(1)).
Upvalue Capture
Closures use the Lua/Steel upvalue model. When a lambda references a variable from an enclosing function:
- The resolver marks the outer local as captured
- An
UpvalueDescentry is added to the inner lambda:ParentLocal(slot)if capturing from the immediate parent,ParentUpvalue(index)if capturing through an intermediate function
(lambda (x) ; x = Local slot 0
(lambda () ; captures x: UpvalueDesc::ParentLocal(0)
(lambda () ; captures through chain: UpvalueDesc::ParentUpvalue(0)
x))) ; resolves to Upvalue { index: 0 }Phase 3: Bytecode Compilation (ResolvedExpr → Chunk)
The instruction format echoes the IBM 704 (1955)
The 704's Type A instruction packed four fields into a single 36-bit word: prefix (opcode), decrement (constant parameter), tag (register selector), and address (operand location). Sema's bytecode uses a strikingly similar structure — each instruction is an opcode byte followed by inline operands (constant indices, slot numbers, jump offsets). The 704 also had a CAS (Compare Accumulator with Storage) instruction that performed a 3-way branch in a single operation: skip 0, 1, or 2 instructions depending on less-than, equal, or greater-than. This is pattern matching as a hardware primitive — the ancestor of the conditional jump patterns Sema's compiler generates for cond and match.
The compiler (compiler.rs) transforms ResolvedExpr into bytecode Chunks. The Compiler struct wraps an Emitter (bytecode builder) and collects Function templates for inner lambdas.
Compilation strategies:
- Constants:
Nil,True,Falseget dedicated opcodes. All other constants useConst+ constant pool. - Variables:
LoadLocal/StoreLocalfor locals,LoadUpvalue/StoreUpvaluefor captures,LoadGlobal/StoreGlobal/DefineGlobalfor globals. - Control flow:
ifusesJumpIfFalse+Jumpfor short-circuit.and/oruseDup+ conditional jumps to preserve the last truthy/falsy value. - Lambdas: compiled to separate
Functiontemplates, referenced byMakeClosureinstruction with upvalue descriptors. doloops: compile to backwardJumpwithJumpIfTruefor exit test.try/catch: adds entries to the chunk's exception table, no inline opcodes.- Named let: desugared to
letrec+lambdaduring lowering — the loop body becomes a closure compiled viaMakeClosure.
Runtime-delegated forms — forms that can't be compiled to pure bytecode are compiled as calls to __vm-* global functions registered by sema-eval:
| Source Form | Delegate |
|---|---|
eval | __vm-eval |
import | __vm-import |
load | __vm-load |
defmacro | __vm-defmacro-form (passes entire form as quoted constant) |
define-record-type | __vm-define-record-type |
delay | __vm-delay (body wrapped in a zero-arg lambda thunk) |
force | __vm-force |
prompt, message, deftool, defagent, macroexpand | Corresponding __vm-* delegates |
Public API: compile(exprs, n_locals, known_natives) returns CompileResult { chunk, functions, native_table }. When known_natives is provided, calls to those globals emit CallNative for direct dispatch.
Compiler Optimizations
- Intrinsic recognition: Known builtins are compiled to inline opcodes instead of function calls, eliminating global lookup,
Rcdowncast, argumentVecallocation, and function pointer dispatch. Arithmetic/comparison:+,-,*,/,<,>,<=,>=,=,not. List/predicates:car/first,cdr/rest,cons,null?,pair?,list?,number?,string?,symbol?,length. Collections:append(2-arg),get,contains?,nth,mod/modulo. - Peephole:
(if (not X) ...): The pattern(if (not X) A B)compiles toJumpIfTrueinstead ofNot+JumpIfFalse, eliminating one instruction. - Fused
CallGlobal: Non-tail calls to global functions use a fusedCallGlobalinstruction that combinesLoadGlobal+Callinto a single opcode with(u32 spur, u16 argc, u16 cache_slot)operands. - Specialized
LoadLocal/StoreLocal: Slots 0–3 have dedicated zero-operand opcodes (LoadLocal0..LoadLocal3,StoreLocal0..StoreLocal3), saving 2 bytes per instruction for the most frequently accessed locals.
Phase 4: VM Execution
The VM (vm.rs) is a stack-based dispatch loop.
Core structs:
VM { stack: Vec<Value>, frames: Vec<CallFrame>, globals: Rc<Env>, functions: Rc<Vec<Rc<Function>>>, inline_cache: Vec<(u32, u64, Value)>, native_fns: Vec<Rc<NativeFn>> }
CallFrame { closure: Rc<Closure>, pc: usize, base: usize, open_upvalues: Option<Vec<Option<Rc<UpvalueCell>>>>, cache_base: usize }Key design points:
- Unsafe hot path: The dispatch loop uses
unsafeunchecked stack operations (pop_unchecked) and raw pointer bytecode reads viaread_u16!/read_i32!/read_u32!macros for performance. Opcodes are dispatched by matching the raw byte againstu8constants (theopmodule), avoiding decode overhead;std::mem::transmuteis used only to reconstructSpurhandles fromu32operands. Debug builds retain bounds checks viadebug_assert!. - Closure interop: VM closures are wrapped as
Value::NativeFnvalues so code outside the VM can call them. Each NativeFn carries anRc<dyn Any>payload containingVmClosurePayload(closure + function table), and the VM usesraw_tag()+downcast_refto avoidRcrefcount bumps on the hot path. When called from outside the VM (e.g., stdlib higher-order-function callbacks), the NativeFn wrapper creates a fresh VM instance to execute the closure's bytecode; in-VM calls unwrap the payload and run in the same VM. - Upvalue cells: Lua-style open upvalues.
UpvalueCellholds aRefCell<UpvalueState>—Open { frame_base, slot }points into the VM stack while the defining frame is alive;Closed(Value)owns the value after the frame exits. Locals are read and written directly on the stack (no cell indirection); cells are closed when a frame returns, tail-calls, unwinds — and before any non-VM call (see Current Limitations). - Exception handling:
Throwopcode triggers handler search via the chunk's exception table. Stack is restored to saved depth, error value pushed, PC jumps to handler.
Entry points: VM::execute() takes a closure and EvalContext. compile_program() is the pipeline for normal compilation: Value AST → lower → optimize → resolve → compile → CompiledProgram. compile_program_with_spans() adds span/source-file support for debug (DAP breakpoints).
VM Optimizations
- Two-level dispatch loop: An outer loop caches frame-local state (code pointer, constants pointer, base offset) into local variables. The inner loop dispatches opcodes without re-fetching frame data. Frame state is only reloaded when control flow changes frames (
Call,TailCall,Return, exceptions). - NaN-boxed int fast paths:
AddInt/SubInt/MulInt/LtInt/EqIntoperate directly on raw NaN-boxed bits — sign-extending the payload, performing the arithmetic, and re-boxing, without ever constructing aValue. - Per-instruction global inline cache: Every
LoadGlobal/CallGlobalinstruction carries au16cache-slot operand indexing into a per-VMVec<(spur, env_version, Value)>. A hit (matching spur + env version) skips theEnvlookup entirely; entries are invalidated by env version mismatch when a global is redefined. - Raw pointer bytecode reads:
read_u16!,read_i32!, andread_u32!macros read operands via raw pointer arithmetic on the code buffer, avoiding bounds checks in release builds. - Unsafe unchecked stack operations:
pop_uncheckedskips length checks (the compiler guarantees stack correctness).debug_assert!guards catch violations in debug builds. - Cold path factoring: The
handle_err!macro factors exception handling out of the hot instruction sequence, keeping the fast path compact for better instruction-cache behavior.
Opcode Set
The VM uses a stack-based instruction set with variable-length encoding. Each opcode is one byte, followed by operands (u16, u32, or i32).
Constants & Stack
| Opcode | Operands | Description |
|---|---|---|
Const | u16 index | Push constants[index] |
Nil | — | Push nil |
True | — | Push #t |
False | — | Push #f |
Pop | — | Discard top of stack |
Dup | — | Duplicate top of stack |
Variable Access
| Opcode | Operands | Description |
|---|---|---|
LoadLocal | u16 slot | Push locals[slot] |
StoreLocal | u16 slot | locals[slot] = pop |
LoadUpvalue | u16 index | Push upvalues[index].get() |
StoreUpvalue | u16 index | upvalues[index].set(pop) |
LoadGlobal | u32 spur, u16 cache_slot | Push globals[spur] (inline-cached) |
StoreGlobal | u32 spur | globals[spur] = pop |
DefineGlobal | u32 spur | Define new global binding |
LoadLocal0..LoadLocal3 | — | Push locals[0..3] (zero-operand fast path) |
StoreLocal0..StoreLocal3 | — | locals[0..3] = pop (zero-operand fast path) |
Control Flow
| Opcode | Operands | Description |
|---|---|---|
Jump | i32 offset | Unconditional relative jump |
JumpIfFalse | i32 offset | Pop, jump if falsy |
JumpIfTrue | i32 offset | Pop, jump if truthy |
Functions
| Opcode | Operands | Description |
|---|---|---|
Call | u16 argc | Call function with argc args |
TailCall | u16 argc | Tail call (reuse frame for TCO) |
Return | — | Return top of stack |
MakeClosure | u16 func_id, u16 n_upvalues, ... | Create closure from function template |
CallNative | u16 native_id, u16 argc | Direct native function call (no env lookup) |
CallGlobal | u32 spur, u16 argc, u16 cache_slot | Fused global lookup + call (inline-cached) |
Data Constructors
| Opcode | Operands | Description |
|---|---|---|
MakeList | u16 n | Pop n values, push list |
MakeVector | u16 n | Pop n values, push vector |
MakeMap | u16 n_pairs | Pop 2n values, push sorted map |
MakeHashMap | u16 n_pairs | Pop 2n values, push hash map |
Arithmetic & Comparison
| Opcode | Description |
|---|---|
Add, Sub, Mul, Div | Generic arithmetic (int/float/string) |
Negate, Not | Unary operators |
Eq, Lt, Gt, Le, Ge | Generic comparison |
AddInt, SubInt, MulInt | Specialized int fast paths |
LtInt, EqInt | Specialized int comparison |
Inline Intrinsics
Zero-operand opcodes emitted by intrinsic recognition (bypass CallGlobal overhead):
| Opcode | Description |
|---|---|
Car, Cdr, Cons | List operations |
IsNull, IsPair, IsList, IsNumber, IsString, IsSymbol | Type predicates |
Length, Append, Get, ContainsQ, Nth | Collection operations |
Mod | Integer modulo fast path |
Exception Handling
| Opcode | Description |
|---|---|
Throw | Pop value, raise as exception |
Exception handling uses an exception table on the Chunk rather than inline opcodes. Each entry specifies a PC range, handler address, and stack depth to restore.
Crate Structure
The bytecode VM lives in the sema-vm crate, which sits between sema-reader and sema-eval in the dependency graph:
sema-core ← sema-reader ← sema-vm ← sema-evalsema-vm depends on sema-core (for Value, Spur, SemaError) and sema-reader (for parsing in test helpers). It does not depend on sema-eval — the evaluator will depend on the VM, not the other way around.
Source Files
| File | Purpose |
|---|---|
opcodes.rs | Op enum — 66 bytecode opcodes |
chunk.rs | Chunk (bytecode + constants + spans), Function, UpvalueDesc |
emit.rs | Emitter — bytecode builder with jump backpatching |
disasm.rs | Human-readable bytecode disassembler |
core_expr.rs | CoreExpr and ResolvedExpr IR enums |
lower.rs | Value AST → CoreExpr lowering pass |
resolve.rs | Variable resolution (local/upvalue/global analysis) |
compiler.rs | Bytecode compiler (ResolvedExpr → Chunk) |
vm.rs | VM dispatch loop, call frames, closures, exception handling |
optimize.rs | Constant folding and simplification on CoreExpr IR |
serialize.rs | Bytecode serialization/deserialization for .semac file format |
scheduler.rs | Cooperative async task scheduler (VM-per-task, yield signals) |
debug.rs | VM debug hooks for DAP (breakpoints, stepping, state queries) |
Async Execution (VM-Only)
Async/await and channels are implemented entirely in the VM.
The model is VM-per-task with cooperative scheduling:
- Each
async/spawncreates a new VM instance that shares the parent's globalEnv(Rc<Env>) and function table (Rc<Vec<Rc<Function>>>). Tasks are cheap: no threads, no work stealing — everything stays single-threaded. - A cooperative scheduler in
scheduler.rsruns tasks round-robin. A task runs until it yields (e.g.awaiton a pending promise, channel operations,async/sleep). - Yielding is signaled via a thread-local yield signal (
sema-core/src/async_signal.rs), not an error variant. The VM checks the signal after every native call (CallNative,CallGlobal). - On yield, the VM leaves a
nilplaceholder on the stack and advances the PC past the call. On resume, the scheduler swaps the placeholder for the wake value (replace_stack_top), so from bytecode's perspective the call simply returned.
This replaced an earlier replay-based design that re-executed entire task bodies on resume (corrupting side effects). Promises support cancellation (PromiseState::Cancelled), and task wake-ups preserve FIFO order.
Yield-aware native functions must work on both closure paths (in-VM and the fresh-VM fallback described under Current Limitations) — see vm_async_test.rs for the VM-only test suite.
Current Limitations
- The compiler emits inline opcodes for common builtins (
+,-,*,/,<,>,<=,>=,=,not,car/first,cdr/rest,cons,null?,pair?,list?,number?,string?,symbol?,length,append,get,contains?,nth,mod/modulo) via intrinsic recognition. Redefining one of these names in the same program suppresses the intrinsic for that program, but a redefinition from a separate compilation unit (e.g., an earlier REPL entry) does not — the intrinsic still fires. CallNativeoptimization requires passingknown_nativesat compile time (done automatically byeval_str_compiled); without it, all global calls useCallGlobalset!to a captured local is silently lost when the closure is invoked through a stdlib higher-order function (map,filter,for-each, …) — upvalue cells are closed to snapshots before every non-VM call, so the callback mutates a detached copy. Globals and in-VM calls are unaffected. Usefoldlwith explicit accumulator threading as a workaround.
CLI Usage
The bytecode VM runs everything — no flag is required.
# Run a file
sema examples/hello.sema
# Start the REPL
semaDesign Decisions
Why Not Delete CoreExpr After Resolution?
The pipeline uses two IR types: CoreExpr (variables as names) and ResolvedExpr (variables as slots). This provides type-level safety — the compiler can only receive resolved expressions, preventing accidental use of unresolved variable references.