March 21, 2026 · 12 min read

Exploratory Benchmarking: From Goldbach to Z3 Solver Stacks

The development of a robust autonomous research pipeline requires rigorous evaluation against established mathematical challenges. Our benchmarking strategy has progressed from initial attempts at the Goldbach conjecture to the formalization of solver stacks capable of resolving combinatorial boundaries in Van der Waerden theory.

Case Study: The Goldbach Conjecture and Sieve Parity

The Goldbach conjecture served as an early stress test for our orchestration loops. The system was tasked with bridging unconnected domains, specifically evaluating if hyperbolic geometry could control prime sums via the prime geodesic theorem. During this phase, the engine encountered the multiplicative-additive obstruction—the classic sieve parity problem that has challenged analytic number theory for a century.

This benchmark revealed critical failure modes in LLM-based reasoning, including the generation of vacuous proofs. Our response was to implement a rigorous audit loop that requires the system to corrected types and bound quantities to actual Selberg coefficients, ultimately reducing the conjecture to a single open axiom: the positivity of the Rankin-Selberg convolution.

Van der Waerden Bounds and Stochastic Stagnation

Subsequent benchmarking focused on Van der Waerden numbers, specifically identifying lower bounds for $W(k, l)$ partitions. In these searches, our Simulated Annealing (SA) engines frequently encountered energy plateaus—local minima where incremental mutations failed to reduce violations. This led to the development of our "strategic escapement" protocols, including supercritical reheats and stochastic restarts.

The Z3 'UNSAT' Solver Stack Integration

A pivotal milestone in our architectural development occurred during a search for $W(3, 4)$ bounds. When the SA search stagnated at $E=12$, we integrated an SMT-based resolution node using the Z3 Theorem Prover. By translating the combinatorial constraint into a Boolean satisfiability problem, the system was able to determine—within seconds—that the target configuration was UNSAT (unsatisfiable).

This realization transformed our approach to search. Instead of indefinitely tuning hyperparameters for a failing search, the Architect now utilizes a multi-tiered solver stack:

High-Throughput Heuristic Search: Rapidly identifies candidate witnesses in large configuration spaces.
SMT/SAT Resolution: Formally proves the non-existence of solutions in sub-regions of the search space, providing an early exit for unpromising avenues.
Formal Verification (Lean 4): Provides the final machine-checked proof for successful witnesses identified by the heuristic layer.

Empirical Calibration

These benchmarks have allowed us to calibrate the system's "strategic temperature," ensuring that compute resources are allocated efficiently between empirical exploration and formal resolution. The integration of the Z3 solver stack, in particular, has formalized the transition from heuristic "guessing" to systematic mathematical search.

Axioms Mapped (Goldbach)

SMT Integration Complete

UNSAT

Solver Early Exit Proven