MathScy: Mixture-of-Experts for Mathematical Reasoning

Paper: MathScy: Domain-Specialized Mixture-of-Experts for AI-Assisted Mathematical Reasoning Code: github.com/ai4he/moexp Proved Conjectures: NLP | Lean 4

MathScy is a domain-specialized Mixture-of-Experts (MoE) system built on DeepSeek-Math-7B-Base for mathematical conjecture generation, theorem proving, and formal verification. It uses 7 domain-specialized QLoRA adapters plus a shared cross-domain expert, combined through a learned sigmoid router that selects the top-2 most relevant experts per input.

Key Results

153 novel conjectures generated across 7 mathematical domains
9 conjectures proved, 6 flagged as publishable by independent evaluation
Top quality score: 0.980 (algebra — weakly Artinian module characterization)
Router accuracy: 76.3% top-1, 88.9% top-2 with near-uniform load balancing
Full system fits in ~9 GB — runs on a single consumer GPU

Model Architecture

DeepSeek-Math-7B-Base (6.9B params, 4-bit NF4 quantized, ~4 GB)
├── Expert 1: Algebraic Geometry    (QLoRA adapter, ~572 MB)
├── Expert 2: Discrete Math         (QLoRA adapter, ~572 MB)
├── Expert 3: Number Theory         (QLoRA adapter, ~572 MB)
├── Expert 4: Analysis              (QLoRA adapter, ~572 MB)
├── Expert 5: Algebra               (QLoRA adapter, ~572 MB)
├── Expert 6: Geometry & Topology   (QLoRA adapter, ~572 MB)
├── Expert 7: Probability & Stats   (QLoRA adapter, ~572 MB)
├── Shared:   Cross-domain expert   (QLoRA adapter, ~572 MB)
└── Router:   Sigmoid gating        (117 KB)

Component	Details
Base Model	DeepSeek-Math-7B-Base (6.9B parameters)
Quantization	4-bit NF4 (QLoRA)
LoRA Rank	r=64, alpha=128, dropout=0.05
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable Parameters	~150M per expert (2.1% of base)
Gating	Sigmoid router with auxiliary-loss-free load balancing
Top-k	2 domain experts activated per input (+ shared always active)
Training Method	Branch-Train-Mix (BTX) with QLoRA
Training Hardware	Single NVIDIA A100 40GB PCIe
Total Training Time	~18.7 hours (sequential)

Domain Experts

Expert	Training Examples	Steps	Final Loss	Description
Algebraic Geometry	5,751	540	0.83	Schemes, sheaves, cohomology, intersection theory
Discrete Math	6,201	582	0.77	Combinatorics, graph theory, counting arguments
Number Theory	7,560	711	0.85	Primes, Diophantine equations, p-adic analysis
Analysis	26,501	2,487	0.61	Real, complex, and functional analysis
Algebra	26,218	2,460	0.81	Commutative algebra, ring and module theory
Geometry & Topology	49,782	4,668	0.58	Manifolds, algebraic topology, differential geometry
Probability & Statistics	10,492	984	0.86	Measure theory, concentration inequalities
Shared (all domains)	14,000	1,314	1.05	Balanced cross-domain (2K per domain)

Installation

pip install torch transformers peft bitsandbytes accelerate safetensors

Quick Start: Load a Single Expert

The simplest usage is loading the base model with one domain expert adapter:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-math-7b-base",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-math-7b-base")

# Load the algebra expert
model = PeftModel.from_pretrained(
    base_model,
    "ai4he/mathscy-moe-7b/expert_algebra/final"
)
model.eval()

# Generate a proof
prompt = """Prove that every finitely generated module over a PID is a direct sum of cyclic modules.

Proof:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Different Domain Experts

Each expert is specialized for its mathematical domain. Choose the expert that best matches your problem:

# Number theory expert — for primes, modular arithmetic, p-adic analysis
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_number_theory/final")
prompt = "State and prove Fermat's theorem on sums of two squares."

# Algebraic geometry expert — for schemes, sheaves, varieties
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_algebraic_geometry/final")
prompt = "Describe the Hilbert scheme of points on a smooth surface and its key properties."

# Probability expert — for measure theory, concentration inequalities
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_probability_statistics/final")
prompt = "Prove a sub-Gaussian tail bound for the sample mean of bounded random variables."

# Geometry & topology expert — for manifolds, homology, homotopy
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_geometry_topology/final")
prompt = "Explain the relationship between de Rham cohomology and singular cohomology."

# Analysis expert — for real, complex, and functional analysis
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_analysis/final")
prompt = "Prove that every bounded sequence in a Hilbert space has a weakly convergent subsequence."

# Discrete math expert — for combinatorics, graph theory
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_discrete_math/final")
prompt = "Prove the Erdős–Ko–Rado theorem for intersecting families of k-element subsets."

# Shared expert — for cross-domain problems
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_shared/final")
prompt = "Describe connections between algebraic topology and combinatorics."

Advanced: Loading the Router for Automatic Expert Selection

The sigmoid router can automatically select the best experts for a given input:

import torch
import json

# Load the router
router_state = torch.load(
    "ai4he/mathscy-moe-7b/moe_assembled/router.pt",
    map_location="cpu"
)

# The router maps hidden states to expert scores
# router_state contains 'weight' and 'bias' for a linear layer
weight = router_state["weight"]  # Shape: [7, hidden_dim]
bias = router_state["bias"]      # Shape: [7]

# Domain order matches the expert registry
domains = [
    "algebraic_geometry", "discrete_math", "number_theory",
    "analysis", "algebra", "geometry_topology", "probability_statistics"
]

def route_input(hidden_states):
    """Select top-2 experts given mean-pooled hidden states."""
    scores = torch.sigmoid(hidden_states @ weight.T + bias)
    top2 = torch.topk(scores, k=2)
    selected = [(domains[i], s.item()) for i, s in zip(top2.indices[0], top2.values[0])]
    return selected

# Example: get hidden states from the base model and route
inputs = tokenizer("Prove that the ideal class group of a number field is finite.", return_tensors="pt")
with torch.no_grad():
    outputs = base_model(**inputs.to(base_model.device), output_hidden_states=True)
    # Mean-pool the last hidden state
    hidden = outputs.hidden_states[-1].mean(dim=1)

selected_experts = route_input(hidden.cpu())
print(f"Selected experts: {selected_experts}")
# Example output: [('number_theory', 0.92), ('algebra', 0.78)]

Advanced: Full MoE Inference with Expert Merging

For the complete MoE experience, load multiple experts and merge their outputs:

from peft import PeftModel
import torch

# Load base model (as above)
# ...

def moe_generate(prompt, top_k=2):
    """Generate using the full MoE pipeline with router-selected experts."""

    # Step 1: Get hidden states for routing
    inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)
    with torch.no_grad():
        base_outputs = base_model(**inputs, output_hidden_states=True)
        hidden = base_outputs.hidden_states[-1].mean(dim=1)

    # Step 2: Route to top-k experts
    router_state = torch.load("ai4he/mathscy-moe-7b/moe_assembled/router.pt", map_location="cpu")
    scores = torch.sigmoid(hidden.cpu() @ router_state["weight"].T + router_state["bias"])
    topk = torch.topk(scores, k=top_k)

    domains = ["algebraic_geometry", "discrete_math", "number_theory",
               "analysis", "algebra", "geometry_topology", "probability_statistics"]

    # Step 3: Generate from each selected expert + shared
    expert_paths = [f"ai4he/mathscy-moe-7b/expert_{domains[i]}/final" for i in topk.indices[0]]
    expert_paths.append("ai4he/mathscy-moe-7b/expert_shared/final")  # Always include shared

    results = []
    for path in expert_paths:
        expert_model = PeftModel.from_pretrained(base_model, path)
        expert_model.eval()
        with torch.no_grad():
            outputs = expert_model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
        results.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
        del expert_model  # Free memory

    return results

# Usage
outputs = moe_generate("Prove that every maximal ideal in a polynomial ring over a field is generated by irreducible polynomials.")
for i, output in enumerate(outputs):
    print(f"--- Expert {i+1} ---")
    print(output[:500])
    print()

Conjecture Generation Results

The Self-Play Theorem Prover (STP) loop generated 153 conjectures across all 7 domains:

Metric	Value
Total conjectures	153
Proved	9
Disproved	1
Unknown/open	5
Partially proved	41
Average quality score	0.693
STP average quality	0.735
Top quality score	0.980
Flagged publishable	6

Per-Domain STP Results

Domain	Total	Proved	Avg Quality
Algebra	8	5	0.850
Number Theory	8	1	0.755
Geometry & Topology	8	0	0.745
Algebraic Geometry	8	0	0.739
Discrete Math	8	0	0.713
Probability & Statistics	8	3	0.696
Analysis	8	0	0.645

Highlighted Proved Conjectures

Weakly Artinian iff Finite Length at Associated Primes (Quality: 0.88, Algebra)

Let R be a commutative Noetherian ring and M a finitely generated R-module. Then M is weakly Artinian iff for every prime p in Ass(M), the localization M_p has finite length.
Cohen-Macaulay from Depth Condition (Quality: 0.88, Algebra)

If Ass(M) is finite and depth(M_p) = dim(R_p) for all associated primes p, then M is Cohen-Macaulay.
Conditional Independence Implies Marginal Independence (Quality: 0.78, Probability)

Under strict positivity, conditional independence of X and Z given Y implies marginal independence of X and Z.

Full proved conjectures with proofs and Lean 4 formalizations: proved_conjectures.md

Router Performance

Domain	Top-1 Accuracy	Top-2 Accuracy
Algebraic Geometry	76.5%	90.0%
Discrete Math	82.0%	90.0%
Number Theory	83.0%	91.0%
Analysis	77.5%	90.0%
Algebra	72.5%	90.0%
Geometry & Topology	66.5%	80.0%
Probability & Statistics	76.0%	91.5%
Overall	76.3%	88.9%

Load balance ratio: 1.26 (near-uniform, target: 1.0)

File Structure

ai4he/mathscy-moe-7b/
├── expert_algebraic_geometry/final/    # LoRA adapter + tokenizer
│   ├── adapter_config.json
│   ├── adapter_model.safetensors       # ~572 MB
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── expert_discrete_math/final/
├── expert_number_theory/final/
├── expert_analysis/final/
├── expert_algebra/final/
├── expert_geometry_topology/final/
├── expert_probability_statistics/final/
├── expert_shared/final/                # Always-active shared expert
├── moe_assembled/
│   └── router.pt                       # Sigmoid router weights (117 KB)
├── expert_registry.json                # Expert-to-domain mapping
└── README.md                           # This file

Training Details

Method: Branch-Train-Mix (BTX) with QLoRA
Hardware: NVIDIA A100 40GB PCIe (single GPU, sequential training)
Total Training Time: ~18.7 hours
Optimizer: 8-bit AdamW, lr=1e-4, cosine schedule with 3% warmup
Batch Size: 4 per GPU, 8 gradient accumulation steps (effective batch 32)
Max Sequence Length: 4,096 tokens
Epochs: 3 per expert
Dataset: 241,586 training examples extracted from 6.5M mathematical statements (ArXiv)
Data sources: LLM-structured knowledge extraction (Gemini 2.0 Flash) + LaTeX theorem mining (117K theorems from 3.2K papers)

Citation

@article{mathscy2026,
  title={MathScy: Domain-Specialized Mixture-of-Experts for Mathematical Conjecture Generation and Theorem Proving},
  year={2026},
  url={https://github.com/ai4he/moexp}
}

License

Apache 2.0

Downloads last month: -

Model tree for ai4he/mathscy-moe-7b

Base model

deepseek-ai/deepseek-math-7b-base

Adapter

(24)

this model

Evaluation results

Router Top-1 Accuracy
self-reported

76.300
Router Top-2 Accuracy
self-reported

88.900
Conjectures Proved
self-reported

9.000
Average Quality Score
self-reported

0.693