cahlen commited on
Commit
7f6d08d
·
verified ·
1 Parent(s): b8a2b74

v4.2: update model card with balanced eval scores (75% overall, std=8.4)

Browse files
Files changed (1) hide show
  1. README.md +25 -23
README.md CHANGED
@@ -186,29 +186,29 @@ When a tool call returns an error (e.g., `"Finding not found: nonexistent-findin
186
 
187
  | Category | Score | N | Description |
188
  |----------|-------|---|-------------|
189
- | multi_turn_react | 91% | 3 | Full ReAct loops with tool chaining |
190
- | cross_domain | 90% | 2 | Connecting findings across mathematical domains |
191
- | agentic_tool_use | 88% | 12 | Correct tool-call format and JSON |
192
- | novel_synthesis | 87% | 6 | Synthesizing novel research directions from data |
193
- | theoretical_frontier | 87% | 6 | Frontier knowledge of open conjectures |
194
- | experiment_suggestion | 84% | 5 | Proposing novel GPU experiments |
195
- | synthesis | 84% | 5 | Synthesizing research directions from data |
196
  | mcp_decision | 80% | 2 | When to call tools vs. answer from knowledge |
197
- | standard_math | 80% | 8 | BK theorem, Hausdorff dimension, Kronecker, Ramsey |
 
 
198
  | identity | 77% | 5 | Self-identification and platform knowledge |
199
- | factual_recall | 74% | 10 | Exact computational findings from bigcompute.science |
200
- | chain_of_thought | 73% | 3 | Multi-step mathematical reasoning |
201
  | conjecture_depth | 73% | 6 | Deep reasoning about unsolved problems |
202
  | cuda_code_generation | 71% | 8 | Writing correct CUDA kernels (nvcc compilation-tested) |
203
- | paper_comprehension | 70% | 6 | Understanding published papers (BK, Shkredov, etc.) |
204
- | proof_strategy | 70% | 2 | Proof strategies and sketch generation |
205
- | gpu_architecture | 67% | 3 | NVIDIA architecture knowledge (sm_86–sm_120) |
 
 
 
206
  | student_guidance | 60% | 2 | Actionable advice for new contributors |
207
- | results_to_kernel | 50% | 6 | Interpreting findings and designing CUDA experiments |
208
- | error_recovery | 47% | 3 | Graceful handling of tool failures |
209
- | **Overall** | **76%** | **103** | **Across all 20 categories** |
210
 
211
- Scores are from automated rubric evaluation. CUDA code generation scores include real `nvcc` compilation testing and anti-pattern detection. The model performs well on structured tasks (tool calling, math reasoning, synthesis) and is designed to work within agentic ReAct loops with the bigcompute.science MCP server.
212
 
213
  ### Standard Benchmarks (Alignment Tax)
214
 
@@ -228,27 +228,29 @@ Math capabilities improved or preserved. General reasoning has a 6% tax — an a
228
  |-----------|-------|
229
  | Base model | Qwen/Qwen2.5-7B-Instruct |
230
  | Method | QLoRA (4-bit NF4, double quantization) |
231
- | LoRA rank | 128 |
232
- | LoRA alpha | 256 |
233
  | LoRA dropout | 0.05 |
234
  | Target modules | q, k, v, o, gate, up, down projections |
235
  | Epochs | 2 |
236
- | Learning rate | 2e-4 (cosine schedule) |
237
  | Batch size | 2 (× 4 gradient accumulation = effective 8) |
238
  | Max sequence length | 4096 |
239
  | Optimizer | AdamW 8-bit |
240
  | NEFTune noise | alpha = 5 |
241
- | Training entries | 5,783 |
242
  | Hardware | NVIDIA RTX 5090 (32GB) |
243
 
244
  ### Training Data Composition
245
 
246
- - **Curated domain blocks** (~1,100 entries): 40 modular blocks covering identity, tool calling (23 real MCP tools), CUDA kernels, number theory, error recovery, student guidance
247
  - **Synthetic CoT (Qwen2.5-Math-72B)** (~3,100 entries): deep mathematical reasoning generated on NVIDIA H200
248
  - **Synthetic reasoning (Gemma-4-26B)** (~1,200 entries): creative synthesis and experiment design
249
  - **External (Hermes FC dataset)** (300 entries): diverse tool-calling patterns from NousResearch
250
 
251
- Full data source documentation: [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md)
 
 
252
 
253
  ## The Research Flywheel
254
 
 
186
 
187
  | Category | Score | N | Description |
188
  |----------|-------|---|-------------|
189
+ | standard_math | 88% | 8 | BK theorem, Hausdorff dimension, Kronecker, Ramsey |
190
+ | chain_of_thought | 87% | 3 | Multi-step mathematical reasoning |
191
+ | factual_recall | 85% | 10 | Exact computational findings from bigcompute.science |
192
+ | agentic_tool_use | 82% | 12 | Correct tool-call format and JSON |
193
+ | cross_domain | 80% | 2 | Connecting findings across mathematical domains |
194
+ | gpu_architecture | 80% | 3 | NVIDIA architecture knowledge (sm_86–sm_120) |
 
195
  | mcp_decision | 80% | 2 | When to call tools vs. answer from knowledge |
196
+ | proof_strategy | 80% | 2 | Proof strategies and sketch generation |
197
+ | multi_turn_react | 79% | 3 | Full ReAct loops with tool chaining |
198
+ | paper_comprehension | 77% | 6 | Understanding published papers (BK, Shkredov, etc.) |
199
  | identity | 77% | 5 | Self-identification and platform knowledge |
 
 
200
  | conjecture_depth | 73% | 6 | Deep reasoning about unsolved problems |
201
  | cuda_code_generation | 71% | 8 | Writing correct CUDA kernels (nvcc compilation-tested) |
202
+ | theoretical_frontier | 70% | 6 | Frontier knowledge of open conjectures |
203
+ | synthesis | 68% | 5 | Synthesizing research directions from data |
204
+ | results_to_kernel | 68% | 6 | Interpreting findings and designing CUDA experiments |
205
+ | experiment_suggestion | 64% | 5 | Proposing novel GPU experiments |
206
+ | novel_synthesis | 63% | 6 | Synthesizing novel research directions from data |
207
+ | error_recovery | 60% | 3 | Graceful handling of tool failures |
208
  | student_guidance | 60% | 2 | Actionable advice for new contributors |
209
+ | **Overall** | **75%** | **103** | **Across all 20 categories** |
 
 
210
 
211
+ Scores are from automated rubric evaluation with balanced category coverage (std = 8.4 across categories). CUDA code generation scores include real `nvcc` compilation testing and anti-pattern detection. The model is designed to work within agentic ReAct loops with the bigcompute.science MCP server.
212
 
213
  ### Standard Benchmarks (Alignment Tax)
214
 
 
228
  |-----------|-------|
229
  | Base model | Qwen/Qwen2.5-7B-Instruct |
230
  | Method | QLoRA (4-bit NF4, double quantization) |
231
+ | LoRA rank | 64 |
232
+ | LoRA alpha | 128 |
233
  | LoRA dropout | 0.05 |
234
  | Target modules | q, k, v, o, gate, up, down projections |
235
  | Epochs | 2 |
236
+ | Learning rate | 1.5e-4 (cosine schedule) |
237
  | Batch size | 2 (× 4 gradient accumulation = effective 8) |
238
  | Max sequence length | 4096 |
239
  | Optimizer | AdamW 8-bit |
240
  | NEFTune noise | alpha = 5 |
241
+ | Training entries | 5,799 |
242
  | Hardware | NVIDIA RTX 5090 (32GB) |
243
 
244
  ### Training Data Composition
245
 
246
+ - **Curated domain blocks** (~1,150 entries): 40+ modular blocks covering identity, tool calling (23 real MCP tools), nvcc-validated CUDA kernels, number theory, error recovery, paper comprehension, student guidance
247
  - **Synthetic CoT (Qwen2.5-Math-72B)** (~3,100 entries): deep mathematical reasoning generated on NVIDIA H200
248
  - **Synthetic reasoning (Gemma-4-26B)** (~1,200 entries): creative synthesis and experiment design
249
  - **External (Hermes FC dataset)** (300 entries): diverse tool-calling patterns from NousResearch
250
 
251
+ Data has been cleaned of off-topic entries and deduplicated (97 entries removed from raw merge).
252
+
253
+ See the [dataset card](https://huggingface.co/datasets/cahlen/Convergent-7B-data) for full composition details and [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md) for source documentation.
254
 
255
  ## The Research Flywheel
256