v4.2: update model card with balanced eval scores (75% overall, std=8.4)

Browse files

Files changed (1) hide show

README.md +25 -23

README.md CHANGED Viewed

@@ -186,29 +186,29 @@ When a tool call returns an error (e.g., `"Finding not found: nonexistent-findin
 | Category | Score | N | Description |
 |----------|-------|---|-------------|
-| multi_turn_react | 91% | 3 | Full ReAct loops with tool chaining |
-| cross_domain | 90% | 2 | Connecting findings across mathematical domains |
-| agentic_tool_use | 88% | 12 | Correct tool-call format and JSON |
-| novel_synthesis | 87% | 6 | Synthesizing novel research directions from data |
-| theoretical_frontier | 87% | 6 | Frontier knowledge of open conjectures |
-| experiment_suggestion | 84% | 5 | Proposing novel GPU experiments |
-| synthesis | 84% | 5 | Synthesizing research directions from data |
 | mcp_decision | 80% | 2 | When to call tools vs. answer from knowledge |
-| standard_math | 80% | 8 | BK theorem, Hausdorff dimension, Kronecker, Ramsey |
 | identity | 77% | 5 | Self-identification and platform knowledge |
-| factual_recall | 74% | 10 | Exact computational findings from bigcompute.science |
-| chain_of_thought | 73% | 3 | Multi-step mathematical reasoning |
 | conjecture_depth | 73% | 6 | Deep reasoning about unsolved problems |
 | cuda_code_generation | 71% | 8 | Writing correct CUDA kernels (nvcc compilation-tested) |
-| paper_comprehension | 70% | 6 | Understanding published papers (BK, Shkredov, etc.) |
-| proof_strategy | 70% | 2 | Proof strategies and sketch generation |
-| gpu_architecture | 67% | 3 | NVIDIA architecture knowledge (sm_86–sm_120) |
 | student_guidance | 60% | 2 | Actionable advice for new contributors |
-| results_to_kernel | 50% | 6 | Interpreting findings and designing CUDA experiments |
-| error_recovery | 47% | 3 | Graceful handling of tool failures |
-| **Overall** | **76%** | **103** | **Across all 20 categories** |
-Scores are from automated rubric evaluation. CUDA code generation scores include real `nvcc` compilation testing and anti-pattern detection. The model performs well on structured tasks (tool calling, math reasoning, synthesis) and is designed to work within agentic ReAct loops with the bigcompute.science MCP server.
 ### Standard Benchmarks (Alignment Tax)
@@ -228,27 +228,29 @@ Math capabilities improved or preserved. General reasoning has a 6% tax — an a
 |-----------|-------|
 | Base model | Qwen/Qwen2.5-7B-Instruct |
 | Method | QLoRA (4-bit NF4, double quantization) |
-| LoRA rank | 128 |
-| LoRA alpha | 256 |
 | LoRA dropout | 0.05 |
 | Target modules | q, k, v, o, gate, up, down projections |
 | Epochs | 2 |
-| Learning rate | 2e-4 (cosine schedule) |
 | Batch size | 2 (× 4 gradient accumulation = effective 8) |
 | Max sequence length | 4096 |
 | Optimizer | AdamW 8-bit |
 | NEFTune noise | alpha = 5 |
-| Training entries | 5,783 |
 | Hardware | NVIDIA RTX 5090 (32GB) |
 ### Training Data Composition
-- **Curated domain blocks** (~1,100 entries): 40 modular blocks covering identity, tool calling (23 real MCP tools), CUDA kernels, number theory, error recovery, student guidance
 - **Synthetic CoT (Qwen2.5-Math-72B)** (~3,100 entries): deep mathematical reasoning generated on NVIDIA H200
 - **Synthetic reasoning (Gemma-4-26B)** (~1,200 entries): creative synthesis and experiment design
 - **External (Hermes FC dataset)** (300 entries): diverse tool-calling patterns from NousResearch
-Full data source documentation: [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md)
 ## The Research Flywheel

 | Category | Score | N | Description |
 |----------|-------|---|-------------|
+| standard_math | 88% | 8 | BK theorem, Hausdorff dimension, Kronecker, Ramsey |
+| chain_of_thought | 87% | 3 | Multi-step mathematical reasoning |
+| factual_recall | 85% | 10 | Exact computational findings from bigcompute.science |
+| agentic_tool_use | 82% | 12 | Correct tool-call format and JSON |
+| cross_domain | 80% | 2 | Connecting findings across mathematical domains |
+| gpu_architecture | 80% | 3 | NVIDIA architecture knowledge (sm_86–sm_120) |
 | mcp_decision | 80% | 2 | When to call tools vs. answer from knowledge |
+| proof_strategy | 80% | 2 | Proof strategies and sketch generation |
+| multi_turn_react | 79% | 3 | Full ReAct loops with tool chaining |
+| paper_comprehension | 77% | 6 | Understanding published papers (BK, Shkredov, etc.) |
 | identity | 77% | 5 | Self-identification and platform knowledge |
 | conjecture_depth | 73% | 6 | Deep reasoning about unsolved problems |
 | cuda_code_generation | 71% | 8 | Writing correct CUDA kernels (nvcc compilation-tested) |
+| theoretical_frontier | 70% | 6 | Frontier knowledge of open conjectures |
+| synthesis | 68% | 5 | Synthesizing research directions from data |
+| results_to_kernel | 68% | 6 | Interpreting findings and designing CUDA experiments |
+| experiment_suggestion | 64% | 5 | Proposing novel GPU experiments |
+| novel_synthesis | 63% | 6 | Synthesizing novel research directions from data |
+| error_recovery | 60% | 3 | Graceful handling of tool failures |
 | student_guidance | 60% | 2 | Actionable advice for new contributors |
+| **Overall** | **75%** | **103** | **Across all 20 categories** |
+Scores are from automated rubric evaluation with balanced category coverage (std = 8.4 across categories). CUDA code generation scores include real `nvcc` compilation testing and anti-pattern detection. The model is designed to work within agentic ReAct loops with the bigcompute.science MCP server.
 ### Standard Benchmarks (Alignment Tax)
 |-----------|-------|
 | Base model | Qwen/Qwen2.5-7B-Instruct |
 | Method | QLoRA (4-bit NF4, double quantization) |
+| LoRA rank | 64 |
+| LoRA alpha | 128 |
 | LoRA dropout | 0.05 |
 | Target modules | q, k, v, o, gate, up, down projections |
 | Epochs | 2 |
+| Learning rate | 1.5e-4 (cosine schedule) |
 | Batch size | 2 (× 4 gradient accumulation = effective 8) |
 | Max sequence length | 4096 |
 | Optimizer | AdamW 8-bit |
 | NEFTune noise | alpha = 5 |
+| Training entries | 5,799 |
 | Hardware | NVIDIA RTX 5090 (32GB) |
 ### Training Data Composition
+- **Curated domain blocks** (~1,150 entries): 40+ modular blocks covering identity, tool calling (23 real MCP tools), nvcc-validated CUDA kernels, number theory, error recovery, paper comprehension, student guidance
 - **Synthetic CoT (Qwen2.5-Math-72B)** (~3,100 entries): deep mathematical reasoning generated on NVIDIA H200
 - **Synthetic reasoning (Gemma-4-26B)** (~1,200 entries): creative synthesis and experiment design
 - **External (Hermes FC dataset)** (300 entries): diverse tool-calling patterns from NousResearch
+Data has been cleaned of off-topic entries and deduplicated (97 entries removed from raw merge).
+See the [dataset card](https://huggingface.co/datasets/cahlen/Convergent-7B-data) for full composition details and [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md) for source documentation.
 ## The Research Flywheel