When the Compiler Is the Oracle

March 20, 2026 18 min read Keywords: concrete, Programming Languages, AI, compilers, optimization

A blacksmith and his family gathered around a glowing piece of iron on an anvil, the forge light illuminating their faces in deep shadow — *An Iron Forge*, Joseph Wright of Derby, 1772

I have been building Concrete for a while now. Something happened this week that I did not plan for, and it may end up mattering more than the things I set out to build on purpose.

I let an AI agent improve a Concrete program using only the compiler’s reports as feedback. No profiler. No benchmarks. The agent read what the compiler knew about the program, tried refactorings, checked if the compiler’s answers improved, and kept or reverted. It worked better than I expected. The compiler had made the search space clean enough that the agent did not have to wander through benchmark fog.

That is not just an AI trick. It points at why Concrete is useful in the first place. A language that makes authority, allocation, trust boundaries, and proof surface explicit is easier to audit, easier to optimize, and easier to automate against. You do not have to reconstruct the truth from profiler traces, stale docs, and reviewer intuition. The compiler can tell you what is true about the program, and that changes how you build systems software. To explain why, I need to start with what Concrete is and what makes it different.

#What Concrete is

Concrete is a systems language. It compiles to native code through LLVM, has explicit memory layout, manual ownership, FFI to C, and no garbage collector. It sits in the same space as Rust and Zig. But it makes a set of design bets that those languages do not make, and those bets are what made this week’s experiment possible.

The grammar is LL(1). The entire language parses with one token of lookahead. No ambiguous constructs, no context-dependent parsing. I chose this constraint on purpose. It means every program has one parse, every tool sees the same structure, and the syntactic surface is small. You can fit the grammar in your head.

Ownership is linear, not affine. In Rust, ownership is affine: values can be used at most once, but you can also forget about them and the compiler will insert a Drop call behind the scenes. In Concrete, owned values must be consumed once. If you forget to clean up a resource, the program does not compile. Cleanup is explicit with destroy and defer. This is stricter than Rust and more annoying to write, but it means the compiler can reason about resource lifetimes in full. No invisible cleanup code executes behind your back.

Capabilities replace the ambient authority model. This is the big one. In most systems languages, a function can do anything the process is allowed to do. A function with no special annotations can open files, make network connections, allocate unbounded memory, and call into C. You find out what a function does by reading its implementation, or by trusting its documentation.

In Concrete, every function declares what it is allowed to do:

fn sha256(data: &Bytes) with(Alloc) -> String { ... }
fn serve(port: u16) with(Network, Alloc, Console) -> Int { ... }
fn allow_command(cmd: Command, policy: Policy) -> Decision { ... }

sha256 can allocate but cannot touch the filesystem or network. The compiler proves this from the type system. serve can use the network, allocate, and write to the console, but cannot read files. allow_command has no capability annotation at all. It is pure. The compiler enforces this: a function can only call functions whose capabilities are a subset of its own.

There are nine capabilities: File, Network, Clock, Env, Random, Process, Console, Alloc, and Unsafe. If a function does not declare one, it cannot transitively reach any function that uses it. The type system enforces this, not convention.

Trust is split three ways. Rust has one keyword for everything the compiler cannot check: unsafe. In Concrete, that splits into three distinct mechanisms:

Capabilities (with(File, Network)): semantic effects visible to callers. “This function does I/O.” In the function signature.
trusted: containment of pointer-level tricks behind a safe API. Permits pointer arithmetic, raw dereference, raw assignment. Does NOT permit FFI, does NOT suppress capabilities, does NOT relax linearity.
with(Unsafe): authority to cross foreign boundaries. FFI, transmute. Required even inside trusted code.

When you see trusted fn load_policy(path: &String) with(File, Alloc), you know: it does pointer work internally, it reads files and allocates, it cannot touch the network. In Rust you would see unsafe and have to read the body to learn any of that.

The compiler is written in Lean 4. I chose Lean because it is a good language, and it created a structural opportunity: every Concrete program becomes a Lean data structure (Core IR) during compilation. That same data structure is what Lean’s theorem prover can reason about. The path from “systems code that runs on hardware” to “theorem about that code’s behavior” goes through one codebase, not two separate tools. We have 17 proven theorems over a pure Core fragment today. Narrow, but the architecture is designed to grow.

No hidden behavior. No implicit destructors at scope exit. No exception unwinding. No operator overloading. No closures. No trait objects. No Deref coercions. No macro system. The code you read is the code that executes. This makes the language less convenient than Rust in many ways, and I think that is the right trade for the domain Concrete targets: firmware, security boundaries, cryptographic policy, and safety-critical components.

Rust is excellent at enforcing safety properties. Concrete is better at turning the compiler into an instrument panel instead of a check-engine light.

That is the practical claim behind the language. If you are building firmware, security-sensitive infrastructure, cryptographic software, or anything else where hidden behavior is expensive, Concrete gives you a compiler that can answer higher-level questions directly: which functions can allocate, which modules gained authority, which code is pure enough to prove, which trust boundaries moved. In most languages, you piece those answers together from code review, profiling, documentation, and static-analysis tools that only partially agree. In Concrete, they come from the language semantics themselves.

#The compiler as an oracle

All of these design choices produce the same structural consequence: the compiler knows a great deal about what every function in your program does, and it knows it from the type system, not from heuristics.

Most compilers throw this knowledge away after producing a binary. Concrete keeps it and exposes it. The compiler has eight report modes:

--report proof: which functions are pure enough to prove (no capabilities, not trusted, no extern calls)
--report authority: for each capability, which functions require it and through which transitive call chain
--report alloc: where allocation, cleanup, and defer happen
--report unsafe: trust boundaries, extern functions, unsafe crossings
--report caps: per-function capability sets
--report layout: struct sizes, alignment, field offsets
--report interface: public API surface
--report mono: monomorphized generic code

These are structured facts derived from the same semantic analysis that type-checks the code. Same code, same report, same result. Deterministic.

The proof report on our JSON parser looks like this:

=== Proof Eligibility Report ===

module Types:
  ✓ mk_null
  ✓ mk_bool
  ✓ mk_number
  ✓ mk_string
  ✓ mk_array
  ✓ mk_object
  ✓ mk_error
  ✓ is_error

module StringPool:
  ✗ store  (requires capabilities: Alloc)

module Lex:
  ✓ is_ws
  ✓ is_digit
  ✓ skip_ws
  ✓ match_keyword

module Parser:
  ✓ err
  ✗ parse_string  (requires capabilities: Alloc)
  ✓ parse_number
  ✗ parse_value  (requires capabilities: Alloc)
  ✗ parse_array  (requires capabilities: Alloc)
  ✗ parse_object  (requires capabilities: Alloc)
  ✗ parse_kv  (requires capabilities: Alloc)

Totals: 27 functions, 14 eligible for ProofCore, 13 excluded

Every function gets a verdict and a reason. The authority report shows the transitive call chains:

capability Alloc (13 functions):
  pub store  <- store -> vec_push
      parse_string  <- parse_string -> store
  pub parse_value  <- parse_value -> parse_string
      parse_array  <- parse_array -> vec_push
      parse_object  <- parse_object -> vec_push
      parse_kv  <- parse_kv -> parse_string

parse_value needs Alloc because it calls parse_string, which calls store, which calls vec_push. The compiler tells you the full path. A human reviewer reads this and knows which dependency to cut. An agent reads this and has a target.

#The experiment

I pointed an AI agent at Concrete’s JSON parser. 719 lines, recursive-descent, handles the full JSON spec minus floats and unicode escapes. Real code that has been pressure-tested. The agent’s only tools were the compiler’s reports and the ability to edit code and run the program.

Starting point: 14/27 functions proof-eligible (51.9%).

The question: can the agent, guided only by structured compiler reports, improve this program step by step?

#Round 1: extracting pure logic

The agent read the proof report and noticed that several excluded functions had pure decision logic buried inside them.

parse_string handles escape sequences. The mapping from \n to character code 10, \" to 34, \\ to 92, and so on is pure decision logic. It does not allocate. It does not do I/O. But it was inlined inside an allocating function, so the compiler could not see it as a separate proof target.

The agent extracted it into its own function:

pub fn map_escape(esc: i32) -> i32 {
    if esc == 34 { return 34; }   // \"
    if esc == 92 { return 92; }   // \\
    if esc == 47 { return 47; }   // \/
    if esc == 110 { return 10; }  // \n
    if esc == 116 { return 9; }   // \t
    if esc == 114 { return 13; }  // \r
    if esc == 98 { return 8; }    // \b
    if esc == 102 { return 12; }  // \f
    return 0 - 1;
}

Same pattern for character classification in parse_value (“is this character the start of a string, number, array, or object?”), value comparison in the test harness, and trailing-content validation in parse_json.

The loop after each extraction was simple:

Run --report proof. Did the count go up?
Run the program. Does it still pass?
Yes to both? Keep the change.

After this round: 19/32 proof-eligible (59.4%). Five new pure functions, all extracted from effectful code without changing behavior.

The agent did not need to understand what the JSON parser does. It did not need to know what JSON is. It read a structured report, identified functions excluded for a specific reason, found pure logic mixed with effectful logic inside those functions, and separated them. The report was the guide. The compiler confirmed the result.

#Round 2: eliminating unnecessary allocation

Then the agent looked at --report alloc.

parse_value is the recursive core of the parser. Each JSON value passes through it. And each call was heap-allocating three strings to check keywords:

let kw_true: String = "true";
defer drop_string(kw_true);
if match_keyword(s, p, &kw_true) {
    return ParseResult { val: mk_bool(1), pos: p + 4 };
}

let kw_false: String = "false";
defer drop_string(kw_false);
// ...same for "null"

Three malloc/free pairs per call to parse_value, to compare a handful of characters. On nested JSON, parse_value is recursive. An object like {"a": {"b": {"c": true}}} calls it four times, producing twelve unnecessary heap allocations for one small document. On a large JSON file with thousands of values, that is thousands of wasted malloc/free cycles.

The fix was obvious once the report pointed it out: match keywords by comparing characters directly, no heap allocation needed:

pub fn match_true(s: &String, pos: i32) -> bool {
    let slen: i32 = string_length(s) as i32;
    if pos + 4 > slen { return false; }
    return string_char_at(s, pos as Int) as i32 == 116       // t
        && string_char_at(s, (pos + 1) as Int) as i32 == 114 // r
        && string_char_at(s, (pos + 2) as Int) as i32 == 117 // u
        && string_char_at(s, (pos + 3) as Int) as i32 == 101;// e
}

After this change, parse_value dropped out of the alloc report.

Final result: 22/35 proof-eligible (62.9%), and parse_value no longer allocates to check keywords.

That was the point where this stopped feeling like a parlor trick.

#Doing less is doing more

Both improvements came from the same refactoring. Extracting pure logic and eliminating unnecessary allocation were the same move.

The allocation-free keyword matchers are also pure functions. They show up in both reports: gone from the alloc report, added to the proof report. Better architecture and better performance, confirmed by two different compiler reports, from one structural change.

In most codebases, performance optimization and code quality feel like separate concerns. You optimize for speed in one pass, refactor for clarity in another, and sometimes they conflict. The faster version is harder to read, the cleaner version is slower. The compiler made visible something that was always true but hard to see: the simplest version of a function, the one that does only what it needs to do with no incidental allocation and no mixed concerns, is the fastest, the most auditable, and the most provable at the same time.

This follows from how Concrete’s reports work. A function that does less needs fewer capabilities. Fewer capabilities means more likely pure. Pure means proof-eligible. No incidental allocation means faster. These are the same property, simplicity, measured from different angles. The reports make it visible.

An agent optimizing for any one of these axes tends to improve the others. That is a useful property for an automated loop. The search space is not adversarial. You are not trading off proof eligibility against performance or auditability against speed. You push in one direction and get improvements across the board.

#Why this does not work in other languages

In Rust, finding unnecessary allocations means profiling. You run benchmarks, fire up a heap profiler, stare at flamegraphs, try to figure out which allocations are incidental versus essential, refactor, re-profile, and hope the numbers improved. The compiler tells you nothing about allocation patterns. It says pass or fail.

An AI agent doing this in Rust would have to generate benchmark code, run benchmarks (noisy, depends on system load), parse profiler output (tool-specific, often visual), guess which allocations are unnecessary, refactor, re-benchmark, and hope the noise did not mask the signal.

In Concrete, the agent read --report alloc, saw three cleanup sites in parse_value, read --report authority to understand the call chain, replaced the allocations with direct character comparisons, and confirmed with --report alloc that parse_value dropped out of the report. No profiler. No benchmark noise. Deterministic signal.

Rust’s compiler does not model these properties the same way. Rust does not track allocation as a first-class semantic property. It has no capability system. It does not distinguish between “this function allocates because it needs to” and “this function allocates because of a convenience pattern.” The information is not there for the compiler to report.

This is a language design question. You cannot bolt this onto an existing language after the fact. The tracking has to be in the type system and the function signatures. If with(Alloc) is not in the language, the compiler cannot compute the transitive authority chain for allocation. If capabilities are not in function signatures, the compiler cannot determine which functions are pure. The reports exist because the language was designed to make them possible.

#The feedback signal problem

Andrej Karpathy has been talking about “autoresearch”: automated loops where an agent tries changes and uses some signal to decide whether to keep them. The bottleneck is always the fitness function.

Benchmark numbers are noisy. A 2% improvement might be measurement error. A 5% regression might be system load. Test suites are binary: pass or fail, no gradient. Static analysis tools produce warnings that may or may not matter. None of these give an agent a clean signal to optimize against. Most of the time you are navigating by weather.

Compiler reports are deterministic. “Is parse_value in the alloc report?” has one answer. “How many functions are proof-eligible?” is a number that does not change between runs. “The transitive authority chain for this capability” is a structured fact the agent can parse and reason about. These are exact properties of the program’s semantic structure.

This turns program optimization into a search problem with a reliable fitness function. That is what makes it tractable for automated agents.

#The research directions, and why they all become fitness functions

The experiment used three reports. But we have been developing research directions for Concrete that each add new axes:

Authority budgets. Capabilities today are per-function. Authority budgets extend them to modules and packages. Imagine you depend on a JSON parsing library. You declare:

#[authority(Alloc)]
import json_parser;

That is a contract: json_parser may only use Alloc. If the maintainer’s next release adds a logging call, even buried three layers deep in a helper function, your build breaks. The dependency violated its authority budget. You did not need to read the changelog or audit the source. The compiler checked the transitive capability set, which it already computes, and found it exceeded the budget.

This is supply-chain security enforced by the compiler. In Rust or Go, a dependency can add network access, file system reads, or environment variable lookups without you knowing unless you audit the code or catch it in review. In Concrete, the authority budget makes this a compile error. For an agent: restructure code until a module’s capability set fits its declared budget.

Allocation budgets. with(Alloc) is currently binary. The proposal classifies functions as NoAlloc, Bounded (allocates but provably bounded), or Unbounded. The compiler walks the call graph to classify. In safety-critical code for medical devices, avionics, or industrial control, you often need to prove that a function cannot allocate unboundedly. Today that is a manual audit. With allocation budgets, the compiler classifies it. For an agent: push functions from Unbounded to Bounded to NoAlloc, with the compiler confirming each step.

Execution cost tracking. Structural classification of functions: does it loop? Is the loop bounded? Is it recursive? The max static call depth? Concrete is tractable for this because there is no dynamic dispatch, no closures, no hidden allocation, and the SSA CFG is clean. For bounded functions, you can compute abstract instruction counts. Combined with allocation budgets: “this function runs in bounded time with bounded allocation” is a statement you can make to a safety certifier. For an agent: identify unbounded loops and try to make them bounded, with the compiler verifying the classification changed.

Semantic diff and trust drift. Today, code review means reading source diffs. You see that someone changed 200 lines across four files. You try to figure out if anything important changed about the program’s behavior. You might miss that a helper function now transitively reaches the network, or that a previously pure function started allocating.

Semantic diff replaces this with structured comparison of the compiler’s reports between two versions. The output would look something like:

authority changes:
  + process_request now requires Network (via log_to_server)
  - validate_input no longer requires File

proof surface:
  - parse_header dropped from ProofCore (now requires Alloc)

trust boundaries:
  + new trusted function: fast_copy

The output is not “the lines that changed.” It is “the trust properties that changed.” A reviewer reads this and knows what to scrutinize. A CI gate reads this and blocks a PR that adds Network authority to a module that had none before, because a semantic policy was violated. For an agent: flag trust regressions and try to fix them before they reach code review.

Proof addon architecture. The compiler produces stable artifacts (ValidatedCore, ProofCore). Proof tooling (SMT solvers, symbolic execution, Lean export) is a separate consumer, not fused into compilation. Proof failure is not compile failure. You can ship code that compiles and grow proof coverage over time. ProofCore is the pure subset: functions with no capabilities, not trusted, no extern calls. The autoresearch experiment, extracting pure logic to grow ProofCore, is the workflow this architecture enables.

Each of these was designed for human consumption. Each also produces a fitness function an agent can optimize against. If the compiler can state a fact clearly enough for a human to act on, it is clear enough for an agent to act on too.

#Auditability and optimizability are the same thing

I designed the report system for human auditors who need to understand what a program does without reading every line. I wanted a security reviewer to ask “what authority does this module have?” and get an answer from the compiler, not from documentation that might be stale. I built the reports for auditing. I did not expect them to become an interface for optimization.

Auditability and machine optimization are the same property viewed from two angles. The properties that make a program auditable (explicit capabilities, visible trust boundaries, trackable allocation) are the same properties that give an agent a reliable fitness function. Making the language honest about what code does, for the benefit of human reviewers, made it optimizable by machines as a side effect.

An auditable program is one where the compiler can state facts about behavior without running the code. An optimizable program is one where the compiler can confirm that a change improved some property without running benchmarks. These are close to the same requirement. If the compiler can say “this function needs Network because of this call chain,” a human can audit it and an agent can try to eliminate it.

I was thinking about a human reviewing a firmware update tool and wanting to know “can this function access the network?” The fact that the same answer is useful to an automated optimizer says something about what compilers should do.

#The bet

The JSON parser was one program. Concrete’s MAL interpreter has 60 functions with 24 excluded from ProofCore. The same loop should work there, and on every program the compiler can report on. The technique generalizes because the reports generalize.

The vision is a compiler that participates in the development loop as more than a gatekeeper. Most compilers say pass or fail. With structured reports, the compiler says what is true about your program. With authority budgets, it says what you promised would be true, and where reality diverges. With semantic diff, it says what changed about your program’s trust properties since the last version. Each layer gives both humans and agents more to work with, and each layer compounds with the others.

The compiler becomes an oracle: a source of structured knowledge about program behavior. The agent becomes an optimizer that tries thousands of refactorings and keeps the ones that move those answers in the right direction. The programmer sets the goals, picks which properties matter and what budgets to enforce, and reviews the results. The payoff is not abstract elegance. It is tighter audits, clearer security boundaries, less performance guesswork, and a development loop where both humans and machines are steering from the same map.

That is the bet behind Concrete. A language the compiler can explain, and that machines can improve, because it was designed to be honest about what code does.