MathScy: Mixture-of-Experts for Mathematical Reasoning
Paper: MathScy: Domain-Specialized Mixture-of-Experts for AI-Assisted Mathematical Reasoning Code: github.com/ai4he/moexp Proved Conjectures: NLP | Lean 4
MathScy is a domain-specialized Mixture-of-Experts (MoE) system built on DeepSeek-Math-7B-Base for mathematical conjecture generation, theorem proving, and formal verification. It uses 7 domain-specialized QLoRA adapters plus a shared cross-domain expert, combined through a learned sigmoid router that selects the top-2 most relevant experts per input.
Key Results
- 153 novel conjectures generated across 7 mathematical domains
- 9 conjectures proved, 6 flagged as publishable by independent evaluation
- Top quality score: 0.980 (algebra β weakly Artinian module characterization)
- Router accuracy: 76.3% top-1, 88.9% top-2 with near-uniform load balancing
- Full system fits in ~9 GB β runs on a single consumer GPU
Model Architecture
DeepSeek-Math-7B-Base (6.9B params, 4-bit NF4 quantized, ~4 GB)
βββ Expert 1: Algebraic Geometry (QLoRA adapter, ~572 MB)
βββ Expert 2: Discrete Math (QLoRA adapter, ~572 MB)
βββ Expert 3: Number Theory (QLoRA adapter, ~572 MB)
βββ Expert 4: Analysis (QLoRA adapter, ~572 MB)
βββ Expert 5: Algebra (QLoRA adapter, ~572 MB)
βββ Expert 6: Geometry & Topology (QLoRA adapter, ~572 MB)
βββ Expert 7: Probability & Stats (QLoRA adapter, ~572 MB)
βββ Shared: Cross-domain expert (QLoRA adapter, ~572 MB)
βββ Router: Sigmoid gating (117 KB)
| Component | Details |
|---|---|
| Base Model | DeepSeek-Math-7B-Base (6.9B parameters) |
| Quantization | 4-bit NF4 (QLoRA) |
| LoRA Rank | r=64, alpha=128, dropout=0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable Parameters | ~150M per expert (2.1% of base) |
| Gating | Sigmoid router with auxiliary-loss-free load balancing |
| Top-k | 2 domain experts activated per input (+ shared always active) |
| Training Method | Branch-Train-Mix (BTX) with QLoRA |
| Training Hardware | Single NVIDIA A100 40GB PCIe |
| Total Training Time | ~18.7 hours (sequential) |
Domain Experts
| Expert | Training Examples | Steps | Final Loss | Description |
|---|---|---|---|---|
| Algebraic Geometry | 5,751 | 540 | 0.83 | Schemes, sheaves, cohomology, intersection theory |
| Discrete Math | 6,201 | 582 | 0.77 | Combinatorics, graph theory, counting arguments |
| Number Theory | 7,560 | 711 | 0.85 | Primes, Diophantine equations, p-adic analysis |
| Analysis | 26,501 | 2,487 | 0.61 | Real, complex, and functional analysis |
| Algebra | 26,218 | 2,460 | 0.81 | Commutative algebra, ring and module theory |
| Geometry & Topology | 49,782 | 4,668 | 0.58 | Manifolds, algebraic topology, differential geometry |
| Probability & Statistics | 10,492 | 984 | 0.86 | Measure theory, concentration inequalities |
| Shared (all domains) | 14,000 | 1,314 | 1.05 | Balanced cross-domain (2K per domain) |
Installation
pip install torch transformers peft bitsandbytes accelerate safetensors
Quick Start: Load a Single Expert
The simplest usage is loading the base model with one domain expert adapter:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/deepseek-math-7b-base",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-math-7b-base")
# Load the algebra expert
model = PeftModel.from_pretrained(
base_model,
"ai4he/mathscy-moe-7b/expert_algebra/final"
)
model.eval()
# Generate a proof
prompt = """Prove that every finitely generated module over a PID is a direct sum of cyclic modules.
Proof:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using Different Domain Experts
Each expert is specialized for its mathematical domain. Choose the expert that best matches your problem:
# Number theory expert β for primes, modular arithmetic, p-adic analysis
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_number_theory/final")
prompt = "State and prove Fermat's theorem on sums of two squares."
# Algebraic geometry expert β for schemes, sheaves, varieties
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_algebraic_geometry/final")
prompt = "Describe the Hilbert scheme of points on a smooth surface and its key properties."
# Probability expert β for measure theory, concentration inequalities
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_probability_statistics/final")
prompt = "Prove a sub-Gaussian tail bound for the sample mean of bounded random variables."
# Geometry & topology expert β for manifolds, homology, homotopy
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_geometry_topology/final")
prompt = "Explain the relationship between de Rham cohomology and singular cohomology."
# Analysis expert β for real, complex, and functional analysis
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_analysis/final")
prompt = "Prove that every bounded sequence in a Hilbert space has a weakly convergent subsequence."
# Discrete math expert β for combinatorics, graph theory
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_discrete_math/final")
prompt = "Prove the ErdΕsβKoβRado theorem for intersecting families of k-element subsets."
# Shared expert β for cross-domain problems
model = PeftModel.from_pretrained(base_model, "ai4he/mathscy-moe-7b/expert_shared/final")
prompt = "Describe connections between algebraic topology and combinatorics."
Advanced: Loading the Router for Automatic Expert Selection
The sigmoid router can automatically select the best experts for a given input:
import torch
import json
# Load the router
router_state = torch.load(
"ai4he/mathscy-moe-7b/moe_assembled/router.pt",
map_location="cpu"
)
# The router maps hidden states to expert scores
# router_state contains 'weight' and 'bias' for a linear layer
weight = router_state["weight"] # Shape: [7, hidden_dim]
bias = router_state["bias"] # Shape: [7]
# Domain order matches the expert registry
domains = [
"algebraic_geometry", "discrete_math", "number_theory",
"analysis", "algebra", "geometry_topology", "probability_statistics"
]
def route_input(hidden_states):
"""Select top-2 experts given mean-pooled hidden states."""
scores = torch.sigmoid(hidden_states @ weight.T + bias)
top2 = torch.topk(scores, k=2)
selected = [(domains[i], s.item()) for i, s in zip(top2.indices[0], top2.values[0])]
return selected
# Example: get hidden states from the base model and route
inputs = tokenizer("Prove that the ideal class group of a number field is finite.", return_tensors="pt")
with torch.no_grad():
outputs = base_model(**inputs.to(base_model.device), output_hidden_states=True)
# Mean-pool the last hidden state
hidden = outputs.hidden_states[-1].mean(dim=1)
selected_experts = route_input(hidden.cpu())
print(f"Selected experts: {selected_experts}")
# Example output: [('number_theory', 0.92), ('algebra', 0.78)]
Advanced: Full MoE Inference with Expert Merging
For the complete MoE experience, load multiple experts and merge their outputs:
from peft import PeftModel
import torch
# Load base model (as above)
# ...
def moe_generate(prompt, top_k=2):
"""Generate using the full MoE pipeline with router-selected experts."""
# Step 1: Get hidden states for routing
inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)
with torch.no_grad():
base_outputs = base_model(**inputs, output_hidden_states=True)
hidden = base_outputs.hidden_states[-1].mean(dim=1)
# Step 2: Route to top-k experts
router_state = torch.load("ai4he/mathscy-moe-7b/moe_assembled/router.pt", map_location="cpu")
scores = torch.sigmoid(hidden.cpu() @ router_state["weight"].T + router_state["bias"])
topk = torch.topk(scores, k=top_k)
domains = ["algebraic_geometry", "discrete_math", "number_theory",
"analysis", "algebra", "geometry_topology", "probability_statistics"]
# Step 3: Generate from each selected expert + shared
expert_paths = [f"ai4he/mathscy-moe-7b/expert_{domains[i]}/final" for i in topk.indices[0]]
expert_paths.append("ai4he/mathscy-moe-7b/expert_shared/final") # Always include shared
results = []
for path in expert_paths:
expert_model = PeftModel.from_pretrained(base_model, path)
expert_model.eval()
with torch.no_grad():
outputs = expert_model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
results.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
del expert_model # Free memory
return results
# Usage
outputs = moe_generate("Prove that every maximal ideal in a polynomial ring over a field is generated by irreducible polynomials.")
for i, output in enumerate(outputs):
print(f"--- Expert {i+1} ---")
print(output[:500])
print()
Conjecture Generation Results
The Self-Play Theorem Prover (STP) loop generated 153 conjectures across all 7 domains:
| Metric | Value |
|---|---|
| Total conjectures | 153 |
| Proved | 9 |
| Disproved | 1 |
| Unknown/open | 5 |
| Partially proved | 41 |
| Average quality score | 0.693 |
| STP average quality | 0.735 |
| Top quality score | 0.980 |
| Flagged publishable | 6 |
Per-Domain STP Results
| Domain | Total | Proved | Avg Quality |
|---|---|---|---|
| Algebra | 8 | 5 | 0.850 |
| Number Theory | 8 | 1 | 0.755 |
| Geometry & Topology | 8 | 0 | 0.745 |
| Algebraic Geometry | 8 | 0 | 0.739 |
| Discrete Math | 8 | 0 | 0.713 |
| Probability & Statistics | 8 | 3 | 0.696 |
| Analysis | 8 | 0 | 0.645 |
Highlighted Proved Conjectures
Weakly Artinian iff Finite Length at Associated Primes (Quality: 0.88, Algebra)
Let R be a commutative Noetherian ring and M a finitely generated R-module. Then M is weakly Artinian iff for every prime p in Ass(M), the localization M_p has finite length.
Cohen-Macaulay from Depth Condition (Quality: 0.88, Algebra)
If Ass(M) is finite and depth(M_p) = dim(R_p) for all associated primes p, then M is Cohen-Macaulay.
Conditional Independence Implies Marginal Independence (Quality: 0.78, Probability)
Under strict positivity, conditional independence of X and Z given Y implies marginal independence of X and Z.
Full proved conjectures with proofs and Lean 4 formalizations: proved_conjectures.md
Router Performance
| Domain | Top-1 Accuracy | Top-2 Accuracy |
|---|---|---|
| Algebraic Geometry | 76.5% | 90.0% |
| Discrete Math | 82.0% | 90.0% |
| Number Theory | 83.0% | 91.0% |
| Analysis | 77.5% | 90.0% |
| Algebra | 72.5% | 90.0% |
| Geometry & Topology | 66.5% | 80.0% |
| Probability & Statistics | 76.0% | 91.5% |
| Overall | 76.3% | 88.9% |
Load balance ratio: 1.26 (near-uniform, target: 1.0)
File Structure
ai4he/mathscy-moe-7b/
βββ expert_algebraic_geometry/final/ # LoRA adapter + tokenizer
β βββ adapter_config.json
β βββ adapter_model.safetensors # ~572 MB
β βββ tokenizer.json
β βββ tokenizer_config.json
β βββ special_tokens_map.json
βββ expert_discrete_math/final/
βββ expert_number_theory/final/
βββ expert_analysis/final/
βββ expert_algebra/final/
βββ expert_geometry_topology/final/
βββ expert_probability_statistics/final/
βββ expert_shared/final/ # Always-active shared expert
βββ moe_assembled/
β βββ router.pt # Sigmoid router weights (117 KB)
βββ expert_registry.json # Expert-to-domain mapping
βββ README.md # This file
Training Details
- Method: Branch-Train-Mix (BTX) with QLoRA
- Hardware: NVIDIA A100 40GB PCIe (single GPU, sequential training)
- Total Training Time: ~18.7 hours
- Optimizer: 8-bit AdamW, lr=1e-4, cosine schedule with 3% warmup
- Batch Size: 4 per GPU, 8 gradient accumulation steps (effective batch 32)
- Max Sequence Length: 4,096 tokens
- Epochs: 3 per expert
- Dataset: 241,586 training examples extracted from 6.5M mathematical statements (ArXiv)
- Data sources: LLM-structured knowledge extraction (Gemini 2.0 Flash) + LaTeX theorem mining (117K theorems from 3.2K papers)
Citation
@article{mathscy2026,
title={MathScy: Domain-Specialized Mixture-of-Experts for Mathematical Conjecture Generation and Theorem Proving},
year={2026},
url={https://github.com/ai4he/moexp}
}
License
Apache 2.0
- Downloads last month
- -
Model tree for ai4he/mathscy-moe-7b
Base model
deepseek-ai/deepseek-math-7b-baseEvaluation results
- Router Top-1 Accuracyself-reported76.300
- Router Top-2 Accuracyself-reported88.900
- Conjectures Provedself-reported9.000
- Average Quality Scoreself-reported0.693