MathScy: Mixture-of-Experts for Mathematical Reasoning

Paper: MathScy: Domain-Specialized Mixture-of-Experts for AI-Assisted Mathematical Reasoning Code: github.com/ai4he/moexp Proved Conjectures: NLP | Lean 4

MathScy is a domain-specialized Mixture-of-Experts (MoE) system built on DeepSeek-Math-7B-Base for mathematical conjecture generation, theorem proving, and formal verification. It uses 7 domain-specialized QLoRA adapters plus a shared cross-domain expert, combined through a learned sigmoid router that selects the top-2 most relevant experts per input.

Key Results

  • 153 novel conjectures generated across 7 mathematical domains
  • 9 conjectures proved, 6 flagged as publishable by independent evaluation
  • Top quality score: 0.980 (algebra β€” weakly Artinian module characterization)
  • Router accuracy: 76.3% top-1, 88.9% top-2 with near-uniform load balancing
  • Full system fits in ~9 GB β€” runs on a single consumer GPU

Model Architecture

DeepSeek-Math-7B-Base (6.9B params, 4-bit NF4 quantized, ~4 GB)
β”œβ”€β”€ Expert 1: Algebraic Geometry    (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Expert 2: Discrete Math         (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Expert 3: Number Theory         (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Expert 4: Analysis              (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Expert 5: Algebra               (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Expert 6: Geometry & Topology   (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Expert 7: Probability & Stats   (QLoRA adapter, ~572 MB)
β”œβ”€β”€ Shared:   Cross-domain expert   (QLoRA adapter, ~572 MB)
└── Router:   Sigmoid gating        (117 KB)
Component Details
Base Model DeepSeek-Math-7B-Base (6.9B parameters)
Quantization 4-bit NF4 (QLoRA)
LoRA Rank r=64, alpha=128, dropout=0.05
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable Parameters ~150M per expert (2.1% of base)
Gating Sigmoid router with auxiliary-loss-free load balancing
Top-k 2 domain experts activated per input (+ shared always active)
Training Method Branch-Train-Mix (BTX) with QLoRA
Training Hardware Single NVIDIA A100 40GB PCIe
Total Training Time ~18.7 hours (sequential)

Domain Experts

Expert Training Examples Steps Final Loss Description
Algebraic Geometry 5,751 540 0.83 Schemes, sheaves, cohomology, intersection theory
Discrete Math 6,201 582 0.77 Combinatorics, graph theory, counting arguments
Number Theory 7,560 711 0.85 Primes, Diophantine equations, p-adic analysis
Analysis 26,501 2,487 0.61 Real, complex, and functional analysis
Algebra 26,218 2,460 0.81 Commutative algebra, ring and module theory
Geometry & Topology 49,782 4,668 0.58 Manifolds, algebraic topology, differential geometry
Probability & Statistics 10,492 984 0.86 Measure theory, concentration inequalities
Shared (all domains) 14,000 1,314 1.05 Balanced cross-domain (2K per domain)

Installation

pip install torch transformers peft bitsandbytes accelerate safetensors

Quick Start: Load a Single Expert

The simplest usage is loading the base model with one domain expert adapter:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-math-7b-base",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-math-7b-base")

# Load the algebra expert
model = PeftModel.from_pretrained(
    base_model,
    "ai4he/mathscy-moe-7b/expert_algebra/final"
)
model.eval()

# Generate a proof
prompt = """Prove that every finitely generated module over a PID is a direct sum of cyclic modules.

Proof:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Different Domain Experts

Each expert is specialized for its mathematical domain. Choose the expert that best matches your problem:

# Number theory expert β€” for primes, modular arithmetic, p-adic analysis
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_number_theory/final")
prompt = "State and prove Fermat's theorem on sums of two squares."

# Algebraic geometry expert β€” for schemes, sheaves, varieties
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_algebraic_geometry/final")
prompt = "Describe the Hilbert scheme of points on a smooth surface and its key properties."

# Probability expert β€” for measure theory, concentration inequalities
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_probability_statistics/final")
prompt = "Prove a sub-Gaussian tail bound for the sample mean of bounded random variables."

# Geometry & topology expert β€” for manifolds, homology, homotopy
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_geometry_topology/final")
prompt = "Explain the relationship between de Rham cohomology and singular cohomology."

# Analysis expert β€” for real, complex, and functional analysis
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_analysis/final")
prompt = "Prove that every bounded sequence in a Hilbert space has a weakly convergent subsequence."

# Discrete math expert β€” for combinatorics, graph theory
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_discrete_math/final")
prompt = "Prove the ErdΕ‘s–Ko–Rado theorem for intersecting families of k-element subsets."

# Shared expert β€” for cross-domain problems
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_shared/final")
prompt = "Describe connections between algebraic topology and combinatorics."

Advanced: Loading the Router for Automatic Expert Selection

The sigmoid router can automatically select the best experts for a given input:

import torch
import json

# Load the router
router_state = torch.load(
    "ai4he/mathscy-moe-7b/moe_assembled/router.pt",
    map_location="cpu"
)

# The router maps hidden states to expert scores
# router_state contains 'weight' and 'bias' for a linear layer
weight = router_state["weight"]  # Shape: [7, hidden_dim]
bias = router_state["bias"]      # Shape: [7]

# Domain order matches the expert registry
domains = [
    "algebraic_geometry", "discrete_math", "number_theory",
    "analysis", "algebra", "geometry_topology", "probability_statistics"
]

def route_input(hidden_states):
    """Select top-2 experts given mean-pooled hidden states."""
    scores = torch.sigmoid(hidden_states @ weight.T + bias)
    top2 = torch.topk(scores, k=2)
    selected = [(domains[i], s.item()) for i, s in zip(top2.indices[0], top2.values[0])]
    return selected

# Example: get hidden states from the base model and route
inputs = tokenizer("Prove that the ideal class group of a number field is finite.", return_tensors="pt")
with torch.no_grad():
    outputs = base_model(**inputs.to(base_model.device), output_hidden_states=True)
    # Mean-pool the last hidden state
    hidden = outputs.hidden_states[-1].mean(dim=1)

selected_experts = route_input(hidden.cpu())
print(f"Selected experts: {selected_experts}")
# Example output: [('number_theory', 0.92), ('algebra', 0.78)]

Advanced: Full MoE Inference with Expert Merging

For the complete MoE experience, load multiple experts and merge their outputs:

from peft import PeftModel
import torch

# Load base model (as above)
# ...

def moe_generate(prompt, top_k=2):
    """Generate using the full MoE pipeline with router-selected experts."""

    # Step 1: Get hidden states for routing
    inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)
    with torch.no_grad():
        base_outputs = base_model(**inputs, output_hidden_states=True)
        hidden = base_outputs.hidden_states[-1].mean(dim=1)

    # Step 2: Route to top-k experts
    router_state = torch.load("ai4he/mathscy-moe-7b/moe_assembled/router.pt", map_location="cpu")
    scores = torch.sigmoid(hidden.cpu() @ router_state["weight"].T + router_state["bias"])
    topk = torch.topk(scores, k=top_k)

    domains = ["algebraic_geometry", "discrete_math", "number_theory",
               "analysis", "algebra", "geometry_topology", "probability_statistics"]

    # Step 3: Generate from each selected expert + shared
    expert_paths = [f"ai4he/mathscy-moe-7b/expert_{domains[i]}/final" for i in topk.indices[0]]
    expert_paths.append("ai4he/mathscy-moe-7b/expert_shared/final")  # Always include shared

    results = []
    for path in expert_paths:
        expert_model = PeftModel.from_pretrained(base_model, path)
        expert_model.eval()
        with torch.no_grad():
            outputs = expert_model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
        results.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
        del expert_model  # Free memory

    return results

# Usage
outputs = moe_generate("Prove that every maximal ideal in a polynomial ring over a field is generated by irreducible polynomials.")
for i, output in enumerate(outputs):
    print(f"--- Expert {i+1} ---")
    print(output[:500])
    print()

Conjecture Generation Results

The Self-Play Theorem Prover (STP) loop generated 153 conjectures across all 7 domains:

Metric Value
Total conjectures 153
Proved 9
Disproved 1
Unknown/open 5
Partially proved 41
Average quality score 0.693
STP average quality 0.735
Top quality score 0.980
Flagged publishable 6

Per-Domain STP Results

Domain Total Proved Avg Quality
Algebra 8 5 0.850
Number Theory 8 1 0.755
Geometry & Topology 8 0 0.745
Algebraic Geometry 8 0 0.739
Discrete Math 8 0 0.713
Probability & Statistics 8 3 0.696
Analysis 8 0 0.645

Highlighted Proved Conjectures

  1. Weakly Artinian iff Finite Length at Associated Primes (Quality: 0.88, Algebra)

    Let R be a commutative Noetherian ring and M a finitely generated R-module. Then M is weakly Artinian iff for every prime p in Ass(M), the localization M_p has finite length.

  2. Cohen-Macaulay from Depth Condition (Quality: 0.88, Algebra)

    If Ass(M) is finite and depth(M_p) = dim(R_p) for all associated primes p, then M is Cohen-Macaulay.

  3. Conditional Independence Implies Marginal Independence (Quality: 0.78, Probability)

    Under strict positivity, conditional independence of X and Z given Y implies marginal independence of X and Z.

Full proved conjectures with proofs and Lean 4 formalizations: proved_conjectures.md

Router Performance

Domain Top-1 Accuracy Top-2 Accuracy
Algebraic Geometry 76.5% 90.0%
Discrete Math 82.0% 90.0%
Number Theory 83.0% 91.0%
Analysis 77.5% 90.0%
Algebra 72.5% 90.0%
Geometry & Topology 66.5% 80.0%
Probability & Statistics 76.0% 91.5%
Overall 76.3% 88.9%

Load balance ratio: 1.26 (near-uniform, target: 1.0)

File Structure

ai4he/mathscy-moe-7b/
β”œβ”€β”€ expert_algebraic_geometry/final/    # LoRA adapter + tokenizer
β”‚   β”œβ”€β”€ adapter_config.json
β”‚   β”œβ”€β”€ adapter_model.safetensors       # ~572 MB
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── special_tokens_map.json
β”œβ”€β”€ expert_discrete_math/final/
β”œβ”€β”€ expert_number_theory/final/
β”œβ”€β”€ expert_analysis/final/
β”œβ”€β”€ expert_algebra/final/
β”œβ”€β”€ expert_geometry_topology/final/
β”œβ”€β”€ expert_probability_statistics/final/
β”œβ”€β”€ expert_shared/final/                # Always-active shared expert
β”œβ”€β”€ moe_assembled/
β”‚   └── router.pt                       # Sigmoid router weights (117 KB)
β”œβ”€β”€ expert_registry.json                # Expert-to-domain mapping
└── README.md                           # This file

Training Details

  • Method: Branch-Train-Mix (BTX) with QLoRA
  • Hardware: NVIDIA A100 40GB PCIe (single GPU, sequential training)
  • Total Training Time: ~18.7 hours
  • Optimizer: 8-bit AdamW, lr=1e-4, cosine schedule with 3% warmup
  • Batch Size: 4 per GPU, 8 gradient accumulation steps (effective batch 32)
  • Max Sequence Length: 4,096 tokens
  • Epochs: 3 per expert
  • Dataset: 241,586 training examples extracted from 6.5M mathematical statements (ArXiv)
  • Data sources: LLM-structured knowledge extraction (Gemini 2.0 Flash) + LaTeX theorem mining (117K theorems from 3.2K papers)

Citation

@article{mathscy2026,
  title={MathScy: Domain-Specialized Mixture-of-Experts for Mathematical Conjecture Generation and Theorem Proving},
  year={2026},
  url={https://github.com/ai4he/moexp}
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ai4he/mathscy-moe-7b

Adapter
(23)
this model

Evaluation results