Building a Tiny Theorem Prover in Python

March 19, 2026 7 min read Keywords: theorem proving, python, type theory, formal verification, logic, curry-howard

A group of surgeons gathered around a cadaver as Dr. Tulp demonstrates the musculature of the arm — *The Anatomy Lesson of Dr. Nicolaes Tulp*, Rembrandt, 1632

The first article in this series explained the Curry-Howard correspondence: propositions are types, proofs are programs. That tells you why theorem proving fits so naturally with programming languages. It does not yet tell you what the machine looks like.

A theorem prover, concretely, is smaller than most people expect. A tiny theorem prover is just:

a language for terms
a language for types / propositions
a checker that decides whether a term has a type
a tiny trusted kernel that defines the legal moves

This article builds that architecture in plain Python. We are not abusing Python’s own type system. Python is just the implementation language. The prover we build has its own terms, its own propositions, and its own checker.

That distinction matters. The goal here is not to show off host-language cleverness. I want the moving parts to be impossible to miss.

#The Smallest Useful Core

If you strip theorem proving down to the bone, you do not need arithmetic, tactics, automation, or even a parser. You need a handful of term forms and a handful of typing rules.

For a first prover, the right target is:

implication
variables
function abstraction
function application
named assumptions in a context

This is enough to represent the simply typed lambda calculus, which under Curry-Howard corresponds to intuitionistic propositional logic.

That means we can already represent and verify proofs of statements like:

A -> A
A -> B -> A
(A -> B) -> (B -> C) -> A -> C

These are not toy programming examples. They are logical theorems.

#Terms, Propositions, and Contexts

We need three datatypes:

Types: propositions like A, B, or A -> B
Terms: proofs/programs like variables, lambdas, and applications
Contexts: the assumptions currently in scope

In Python:

from dataclasses import dataclass


class Type:
    pass


@dataclass(frozen=True)
class Atom(Type):
    name: str


@dataclass(frozen=True)
class Arrow(Type):
    left: Type
    right: Type


class Term:
    pass


@dataclass(frozen=True)
class Var(Term):
    name: str


@dataclass(frozen=True)
class Lam(Term):
    param: str
    param_type: Type
    body: Term


@dataclass(frozen=True)
class App(Term):
    fn: Term
    arg: Term

The context is just a mapping from variable names to types:

Context = dict[str, Type]

That mapping is the whole proof language.

Why these constructors?

Var(x) means “use the assumption named x”
Lam(x, A, body) means “assume x : A, then prove body”
App(f, a) means “apply a proof of A -> B to a proof of A to get a proof of B”

That is exactly how implication works in natural deduction.

#The Checker Is the Kernel

Now we write the only thing that really matters: the checker.

def infer(ctx: Context, term: Term) -> Type:
    if isinstance(term, Var):
        if term.name not in ctx:
            raise TypeError(f"unbound variable: {term.name}")
        return ctx[term.name]

    if isinstance(term, Lam):
        new_ctx = dict(ctx)
        new_ctx[term.param] = term.param_type
        body_type = infer(new_ctx, term.body)
        return Arrow(term.param_type, body_type)

    if isinstance(term, App):
        fn_type = infer(ctx, term.fn)
        arg_type = infer(ctx, term.arg)

        if not isinstance(fn_type, Arrow):
            raise TypeError(f"attempted to apply non-function: {fn_type}")

        if fn_type.left != arg_type:
            raise TypeError(
                f"function expected {fn_type.left}, got {arg_type}"
            )

        return fn_type.right

    raise TypeError(f"unknown term: {term}")

This is the trusted kernel.

If infer is correct, then any term it accepts is a valid proof of the type it returns. If infer is wrong, the system is unsound. Theorem provers target tiny kernels because every line of this function is code you have to trust.

#What the Rules Mean Logically

Each branch of infer is one proof rule.

#Variables

infer(ctx, Var("x")) == ctx["x"]

Logically, this means:

if x : A is in your assumptions
then you may conclude A

That is the rule of assumption.

#Lambdas

infer(ctx, Lam("x", A, body)) == Arrow(A, body_type)

Logically:

assume A
under that assumption, derive B
therefore derive A -> B

That is implication introduction.

#Applications

infer(ctx, App(f, a))

Logically:

if f proves A -> B
and a proves A
then f(a) proves B

That is implication elimination, better known as modus ponens.

The checker enforces logical inference rules directly.

#First Proof: A -> A

The identity function is the simplest proof in the system:

A = Atom("A")

identity = Lam("x", A, Var("x"))
print(infer({}, identity))

The inferred type is:

Arrow(Atom("A"), Atom("A"))

Under Curry-Howard, that means the term proves A -> A.

fun x => x is both:

a program that returns its input
a proof that if A holds, then A holds

The same term inhabits both interpretations.

#A Slightly Less Trivial Proof: A -> B -> A

Now prove that if A holds and B holds, then A holds.

A = Atom("A")
B = Atom("B")

proof = Lam("x", A, Lam("y", B, Var("x")))
print(infer({}, proof))

The inferred type is:

Arrow(A, Arrow(B, A))

Logically, that is:

A -> B -> A

The proof is just “ignore the second assumption and return the first one.”

Ordinary functional programming and formal proof are the same structure seen two ways.

#Composition: (A -> B) -> (B -> C) -> A -> C

Now something that feels like actual reasoning:

A = Atom("A")
B = Atom("B")
C = Atom("C")

compose = Lam(
    "f", Arrow(A, B),
    Lam(
        "g", Arrow(B, C),
        Lam(
            "x", A,
            App(Var("g"), App(Var("f"), Var("x")))
        )
    )
)

print(infer({}, compose))

The checker returns:

Arrow(Arrow(A, B), Arrow(Arrow(B, C), Arrow(A, C)))

This is proof as composition of evidence:

f turns evidence of A into evidence of B
g turns evidence of B into evidence of C
so together they turn evidence of A into evidence of C

In logic, that is a theorem. In programming, it is function composition.

The same term carries both readings.

#What This Tiny Prover Cannot Yet Do

At this point we have a real prover, but it is extremely small.

It cannot yet express:

conjunction (A and B)
disjunction (A or B)
quantifiers over values
equality
natural numbers
induction
computation inside types

That is fine. The first implementation exists to make the kernel architecture legible.

Even in this tiny form, you can already see the whole pattern:

terms are proofs
types are propositions
checking is proof verification

Everything more advanced is an extension of that base.

#The Next Useful Extension: Pairs

If you wanted to grow this prover one step, the best next move would be conjunction.

Add:

a PairType(A, B)
a PairTerm(a, b)
projections Fst and Snd

Then the checker gets three more rules:

if a : A and b : B, then (a, b) : A and B
if p : A and B, then fst(p) : A
if p : A and B, then snd(p) : B

That would let you prove statements like:

A -> B -> (A and B)
(A and B) -> A
(A and B) -> B

The architecture would not change. You would just add more term forms and more rules.

That is the recurring theme in theorem provers: small, explicit rules compose into powerful systems.

#What Lean Adds Beyond This

The toy checker here and Lean are the same kind of machine, but Lean adds several layers you immediately miss once you try to prove anything nontrivial:

Dependent types: types can mention values
Definitional equality: terms can reduce during checking
Inductive types: natural numbers, lists, trees, equality, all built into the core theory
Elaboration: Lean fills in omitted arguments, infers implicits, and resolves notation
Tactics: you work at the level of proof state instead of raw lambda terms
Automation: simplifiers, arithmetic solvers, rewriting, decision procedures
A real trust boundary: private constructors, kernel isolation, no honor system

But none of that changes the central picture. There is still a kernel. There are still terms. There are still types. There is still a checker.

That is why building even a tiny prover is so clarifying. It separates the parts that make the logic work from the parts that make the system pleasant to use.

#Why Python Is the Right Language for This Article

Python’s type system is too weak for the Julia-style trick of making the host language do the proving. That is exactly why it is useful here.

In this article, the prover is explicit:

the terms are your dataclasses
the propositions are your dataclasses
the checker is your function
the trust boundary is code you can point at with one finger

Nothing is hidden in the host language. That makes Python a good teaching language for the architecture, even though it would be a bad language for host-type-system metaprogramming.

#The Real Lesson

The distance from ordinary interpreter code to a theorem prover kernel is much smaller than most programmers think. You start with:

a syntax
a few inference rules
a checker that enforces them

That is enough to get a real notion of proof.

Lean is the same architecture, pushed further and engineered better.

The next article shows the complementary perspective: instead of implementing the prover explicitly, it embeds a tiny one inside Julia’s own type system. Same ideas, different lesson.