Building a Tiny Theorem Prover in Python

A group of surgeons gathered around a cadaver as Dr. Tulp demonstrates the musculature of the arm
The Anatomy Lesson of Dr. Nicolaes Tulp, Rembrandt, 1632

The first article in this series explained the Curry-Howard correspondence: propositions are types, proofs are programs. That tells you why theorem proving fits so naturally with programming languages. It does not yet tell you what the machine looks like.

A theorem prover, concretely, is smaller than most people expect. A tiny theorem prover is just:

  1. a language for terms
  2. a language for types / propositions
  3. a checker that decides whether a term has a type
  4. a tiny trusted kernel that defines the legal moves

This article builds that architecture in plain Python. We are not abusing Python’s own type system. Python is just the implementation language. The prover we build has its own terms, its own propositions, and its own checker.

That distinction matters. The goal here is not to show off host-language cleverness. I want the moving parts to be impossible to miss.

#The Smallest Useful Core

If you strip theorem proving down to the bone, you do not need arithmetic, tactics, automation, or even a parser. You need a handful of term forms and a handful of typing rules.

For a first prover, the right target is:

This is enough to represent the simply typed lambda calculus, which under Curry-Howard corresponds to intuitionistic propositional logic.

That means we can already represent and verify proofs of statements like:

These are not toy programming examples. They are logical theorems.

#Terms, Propositions, and Contexts

We need three datatypes:

In Python:

from dataclasses import dataclass


class Type:
    pass


@dataclass(frozen=True)
class Atom(Type):
    name: str


@dataclass(frozen=True)
class Arrow(Type):
    left: Type
    right: Type


class Term:
    pass


@dataclass(frozen=True)
class Var(Term):
    name: str


@dataclass(frozen=True)
class Lam(Term):
    param: str
    param_type: Type
    body: Term


@dataclass(frozen=True)
class App(Term):
    fn: Term
    arg: Term

The context is just a mapping from variable names to types:

Context = dict[str, Type]

That mapping is the whole proof language.

Why these constructors?

That is exactly how implication works in natural deduction.

#The Checker Is the Kernel

Now we write the only thing that really matters: the checker.

def infer(ctx: Context, term: Term) -> Type:
    if isinstance(term, Var):
        if term.name not in ctx:
            raise TypeError(f"unbound variable: {term.name}")
        return ctx[term.name]

    if isinstance(term, Lam):
        new_ctx = dict(ctx)
        new_ctx[term.param] = term.param_type
        body_type = infer(new_ctx, term.body)
        return Arrow(term.param_type, body_type)

    if isinstance(term, App):
        fn_type = infer(ctx, term.fn)
        arg_type = infer(ctx, term.arg)

        if not isinstance(fn_type, Arrow):
            raise TypeError(f"attempted to apply non-function: {fn_type}")

        if fn_type.left != arg_type:
            raise TypeError(
                f"function expected {fn_type.left}, got {arg_type}"
            )

        return fn_type.right

    raise TypeError(f"unknown term: {term}")

This is the trusted kernel.

If infer is correct, then any term it accepts is a valid proof of the type it returns. If infer is wrong, the system is unsound. Theorem provers target tiny kernels because every line of this function is code you have to trust.

#What the Rules Mean Logically

Each branch of infer is one proof rule.

#Variables

infer(ctx, Var("x")) == ctx["x"]

Logically, this means:

That is the rule of assumption.

#Lambdas

infer(ctx, Lam("x", A, body)) == Arrow(A, body_type)

Logically:

That is implication introduction.

#Applications

infer(ctx, App(f, a))

Logically:

That is implication elimination, better known as modus ponens.

The checker enforces logical inference rules directly.

#First Proof: A -> A

The identity function is the simplest proof in the system:

A = Atom("A")

identity = Lam("x", A, Var("x"))
print(infer({}, identity))

The inferred type is:

Arrow(Atom("A"), Atom("A"))

Under Curry-Howard, that means the term proves A -> A.

fun x => x is both:

The same term inhabits both interpretations.

#A Slightly Less Trivial Proof: A -> B -> A

Now prove that if A holds and B holds, then A holds.

A = Atom("A")
B = Atom("B")

proof = Lam("x", A, Lam("y", B, Var("x")))
print(infer({}, proof))

The inferred type is:

Arrow(A, Arrow(B, A))

Logically, that is:

A -> B -> A

The proof is just “ignore the second assumption and return the first one.”

Ordinary functional programming and formal proof are the same structure seen two ways.

#Composition: (A -> B) -> (B -> C) -> A -> C

Now something that feels like actual reasoning:

A = Atom("A")
B = Atom("B")
C = Atom("C")

compose = Lam(
    "f", Arrow(A, B),
    Lam(
        "g", Arrow(B, C),
        Lam(
            "x", A,
            App(Var("g"), App(Var("f"), Var("x")))
        )
    )
)

print(infer({}, compose))

The checker returns:

Arrow(Arrow(A, B), Arrow(Arrow(B, C), Arrow(A, C)))

This is proof as composition of evidence:

In logic, that is a theorem. In programming, it is function composition.

The same term carries both readings.

#What This Tiny Prover Cannot Yet Do

At this point we have a real prover, but it is extremely small.

It cannot yet express:

That is fine. The first implementation exists to make the kernel architecture legible.

Even in this tiny form, you can already see the whole pattern:

Everything more advanced is an extension of that base.

#The Next Useful Extension: Pairs

If you wanted to grow this prover one step, the best next move would be conjunction.

Add:

Then the checker gets three more rules:

That would let you prove statements like:

The architecture would not change. You would just add more term forms and more rules.

That is the recurring theme in theorem provers: small, explicit rules compose into powerful systems.

#What Lean Adds Beyond This

The toy checker here and Lean are the same kind of machine, but Lean adds several layers you immediately miss once you try to prove anything nontrivial:

But none of that changes the central picture. There is still a kernel. There are still terms. There are still types. There is still a checker.

That is why building even a tiny prover is so clarifying. It separates the parts that make the logic work from the parts that make the system pleasant to use.

#Why Python Is the Right Language for This Article

Python’s type system is too weak for the Julia-style trick of making the host language do the proving. That is exactly why it is useful here.

In this article, the prover is explicit:

Nothing is hidden in the host language. That makes Python a good teaching language for the architecture, even though it would be a bad language for host-type-system metaprogramming.

#The Real Lesson

The distance from ordinary interpreter code to a theorem prover kernel is much smaller than most programmers think. You start with:

That is enough to get a real notion of proof.

Lean is the same architecture, pushed further and engineered better.

The next article shows the complementary perspective: instead of implementing the prover explicitly, it embeds a tiny one inside Julia’s own type system. Same ideas, different lesson.