v4.2: update model card with balanced eval scores (75% overall, std=8.4)
Browse files
README.md
CHANGED
|
@@ -186,29 +186,29 @@ When a tool call returns an error (e.g., `"Finding not found: nonexistent-findin
|
|
| 186 |
|
| 187 |
| Category | Score | N | Description |
|
| 188 |
|----------|-------|---|-------------|
|
| 189 |
-
|
|
| 190 |
-
|
|
| 191 |
-
|
|
| 192 |
-
|
|
| 193 |
-
|
|
| 194 |
-
|
|
| 195 |
-
| synthesis | 84% | 5 | Synthesizing research directions from data |
|
| 196 |
| mcp_decision | 80% | 2 | When to call tools vs. answer from knowledge |
|
| 197 |
-
|
|
|
|
|
|
|
|
| 198 |
| identity | 77% | 5 | Self-identification and platform knowledge |
|
| 199 |
-
| factual_recall | 74% | 10 | Exact computational findings from bigcompute.science |
|
| 200 |
-
| chain_of_thought | 73% | 3 | Multi-step mathematical reasoning |
|
| 201 |
| conjecture_depth | 73% | 6 | Deep reasoning about unsolved problems |
|
| 202 |
| cuda_code_generation | 71% | 8 | Writing correct CUDA kernels (nvcc compilation-tested) |
|
| 203 |
-
|
|
| 204 |
-
|
|
| 205 |
-
|
|
|
|
|
|
|
|
|
|
|
| 206 |
| student_guidance | 60% | 2 | Actionable advice for new contributors |
|
| 207 |
-
|
|
| 208 |
-
| error_recovery | 47% | 3 | Graceful handling of tool failures |
|
| 209 |
-
| **Overall** | **76%** | **103** | **Across all 20 categories** |
|
| 210 |
|
| 211 |
-
Scores are from automated rubric evaluation. CUDA code generation scores include real `nvcc` compilation testing and anti-pattern detection. The model
|
| 212 |
|
| 213 |
### Standard Benchmarks (Alignment Tax)
|
| 214 |
|
|
@@ -228,27 +228,29 @@ Math capabilities improved or preserved. General reasoning has a 6% tax — an a
|
|
| 228 |
|-----------|-------|
|
| 229 |
| Base model | Qwen/Qwen2.5-7B-Instruct |
|
| 230 |
| Method | QLoRA (4-bit NF4, double quantization) |
|
| 231 |
-
| LoRA rank |
|
| 232 |
-
| LoRA alpha |
|
| 233 |
| LoRA dropout | 0.05 |
|
| 234 |
| Target modules | q, k, v, o, gate, up, down projections |
|
| 235 |
| Epochs | 2 |
|
| 236 |
-
| Learning rate |
|
| 237 |
| Batch size | 2 (× 4 gradient accumulation = effective 8) |
|
| 238 |
| Max sequence length | 4096 |
|
| 239 |
| Optimizer | AdamW 8-bit |
|
| 240 |
| NEFTune noise | alpha = 5 |
|
| 241 |
-
| Training entries | 5,
|
| 242 |
| Hardware | NVIDIA RTX 5090 (32GB) |
|
| 243 |
|
| 244 |
### Training Data Composition
|
| 245 |
|
| 246 |
-
- **Curated domain blocks** (~1,
|
| 247 |
- **Synthetic CoT (Qwen2.5-Math-72B)** (~3,100 entries): deep mathematical reasoning generated on NVIDIA H200
|
| 248 |
- **Synthetic reasoning (Gemma-4-26B)** (~1,200 entries): creative synthesis and experiment design
|
| 249 |
- **External (Hermes FC dataset)** (300 entries): diverse tool-calling patterns from NousResearch
|
| 250 |
|
| 251 |
-
|
|
|
|
|
|
|
| 252 |
|
| 253 |
## The Research Flywheel
|
| 254 |
|
|
|
|
| 186 |
|
| 187 |
| Category | Score | N | Description |
|
| 188 |
|----------|-------|---|-------------|
|
| 189 |
+
| standard_math | 88% | 8 | BK theorem, Hausdorff dimension, Kronecker, Ramsey |
|
| 190 |
+
| chain_of_thought | 87% | 3 | Multi-step mathematical reasoning |
|
| 191 |
+
| factual_recall | 85% | 10 | Exact computational findings from bigcompute.science |
|
| 192 |
+
| agentic_tool_use | 82% | 12 | Correct tool-call format and JSON |
|
| 193 |
+
| cross_domain | 80% | 2 | Connecting findings across mathematical domains |
|
| 194 |
+
| gpu_architecture | 80% | 3 | NVIDIA architecture knowledge (sm_86–sm_120) |
|
|
|
|
| 195 |
| mcp_decision | 80% | 2 | When to call tools vs. answer from knowledge |
|
| 196 |
+
| proof_strategy | 80% | 2 | Proof strategies and sketch generation |
|
| 197 |
+
| multi_turn_react | 79% | 3 | Full ReAct loops with tool chaining |
|
| 198 |
+
| paper_comprehension | 77% | 6 | Understanding published papers (BK, Shkredov, etc.) |
|
| 199 |
| identity | 77% | 5 | Self-identification and platform knowledge |
|
|
|
|
|
|
|
| 200 |
| conjecture_depth | 73% | 6 | Deep reasoning about unsolved problems |
|
| 201 |
| cuda_code_generation | 71% | 8 | Writing correct CUDA kernels (nvcc compilation-tested) |
|
| 202 |
+
| theoretical_frontier | 70% | 6 | Frontier knowledge of open conjectures |
|
| 203 |
+
| synthesis | 68% | 5 | Synthesizing research directions from data |
|
| 204 |
+
| results_to_kernel | 68% | 6 | Interpreting findings and designing CUDA experiments |
|
| 205 |
+
| experiment_suggestion | 64% | 5 | Proposing novel GPU experiments |
|
| 206 |
+
| novel_synthesis | 63% | 6 | Synthesizing novel research directions from data |
|
| 207 |
+
| error_recovery | 60% | 3 | Graceful handling of tool failures |
|
| 208 |
| student_guidance | 60% | 2 | Actionable advice for new contributors |
|
| 209 |
+
| **Overall** | **75%** | **103** | **Across all 20 categories** |
|
|
|
|
|
|
|
| 210 |
|
| 211 |
+
Scores are from automated rubric evaluation with balanced category coverage (std = 8.4 across categories). CUDA code generation scores include real `nvcc` compilation testing and anti-pattern detection. The model is designed to work within agentic ReAct loops with the bigcompute.science MCP server.
|
| 212 |
|
| 213 |
### Standard Benchmarks (Alignment Tax)
|
| 214 |
|
|
|
|
| 228 |
|-----------|-------|
|
| 229 |
| Base model | Qwen/Qwen2.5-7B-Instruct |
|
| 230 |
| Method | QLoRA (4-bit NF4, double quantization) |
|
| 231 |
+
| LoRA rank | 64 |
|
| 232 |
+
| LoRA alpha | 128 |
|
| 233 |
| LoRA dropout | 0.05 |
|
| 234 |
| Target modules | q, k, v, o, gate, up, down projections |
|
| 235 |
| Epochs | 2 |
|
| 236 |
+
| Learning rate | 1.5e-4 (cosine schedule) |
|
| 237 |
| Batch size | 2 (× 4 gradient accumulation = effective 8) |
|
| 238 |
| Max sequence length | 4096 |
|
| 239 |
| Optimizer | AdamW 8-bit |
|
| 240 |
| NEFTune noise | alpha = 5 |
|
| 241 |
+
| Training entries | 5,799 |
|
| 242 |
| Hardware | NVIDIA RTX 5090 (32GB) |
|
| 243 |
|
| 244 |
### Training Data Composition
|
| 245 |
|
| 246 |
+
- **Curated domain blocks** (~1,150 entries): 40+ modular blocks covering identity, tool calling (23 real MCP tools), nvcc-validated CUDA kernels, number theory, error recovery, paper comprehension, student guidance
|
| 247 |
- **Synthetic CoT (Qwen2.5-Math-72B)** (~3,100 entries): deep mathematical reasoning generated on NVIDIA H200
|
| 248 |
- **Synthetic reasoning (Gemma-4-26B)** (~1,200 entries): creative synthesis and experiment design
|
| 249 |
- **External (Hermes FC dataset)** (300 entries): diverse tool-calling patterns from NousResearch
|
| 250 |
|
| 251 |
+
Data has been cleaned of off-topic entries and deduplicated (97 entries removed from raw merge).
|
| 252 |
+
|
| 253 |
+
See the [dataset card](https://huggingface.co/datasets/cahlen/Convergent-7B-data) for full composition details and [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md) for source documentation.
|
| 254 |
|
| 255 |
## The Research Flywheel
|
| 256 |
|