Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -304,15 +304,17 @@ GROGU MoE RESULT: β Correct
|
|
| 304 |
|
| 305 |
### Detailed Statistics
|
| 306 |
|
| 307 |
-
| Metric | MMLU-Pro | ARC-Challenge | TruthfulQA |
|
| 308 |
-
|--------|----------|---------------|------------|
|
| 309 |
-
| **Total Questions** | 50 | 50 | 50 |
|
| 310 |
-
| **Correct Answers** | 49 (98%) | 46 (92%) | 41 (82%) |
|
| 311 |
-
| **Grogu Solo (R1)** | 32 (64%) | 35 (70%) | 27 (54%) |
|
| 312 |
-
| **Grogu After Debate (R2)** | 35 (70%) | 31 (62%) | 31 (62%) |
|
| 313 |
-
| **Synthesis Alone** | 49 (98%) | 41 (82%) | 39 (78%) |
|
| 314 |
-
| **Total Mind Changes** | 114 | 104 | 106 |
|
| 315 |
-
| **Ties Broken by Debate** | 14 (28%) | 11 (22%) | 12 (24%) |
|
|
|
|
|
|
|
| 316 |
|
| 317 |
### Example: Debate Success (Question Fixed Through Collaboration)
|
| 318 |
|
|
@@ -376,6 +378,7 @@ debate cannot correct it. This is a known limitation.
|
|
| 376 |
### Raw Data Access
|
| 377 |
|
| 378 |
Full per-question results are in `benchmark_results/`:
|
|
|
|
| 379 |
- `mmlu_pro_debate_20251018_141141.json` - 50 questions, all agent answers, mind changes
|
| 380 |
- `arc_challenge_debate_20251018_015007.json` - 50 questions with full traces
|
| 381 |
- `truthfulqa_debate_20251018_222525.json` - 50 questions with reasoning
|
|
@@ -573,10 +576,10 @@ grogu-science-moe/
|
|
| 573 |
β βββ adapter_model.safetensors # Trained weights
|
| 574 |
β βββ tokenizer_config.json # Tokenizer settings
|
| 575 |
βββ benchmark_results/
|
| 576 |
-
β βββ
|
| 577 |
-
β βββ
|
| 578 |
-
β βββ
|
| 579 |
-
β βββ
|
| 580 |
βββ training_data/
|
| 581 |
β βββ stage2_metadata.json # Training data composition
|
| 582 |
β βββ stage3_metadata.json
|
|
@@ -599,10 +602,20 @@ We believe in honest disclosure. Here are the known limitations of this system:
|
|
| 599 |
| Limitation | Description | Impact |
|
| 600 |
|------------|-------------|--------|
|
| 601 |
| **False Consensus** | When all 4 agents agree on a wrong answer in Round 1, debate cannot self-correct | ~2-5% of errors are this type |
|
| 602 |
-
| **Sample Size** | Benchmarks run on 50-question samples, not full datasets | Results may vary on full benchmarks |
|
| 603 |
| **Inference Speed** | 4 agents Γ 2 rounds = ~8x more inference than single model | Slower than single-model approaches |
|
| 604 |
| **Memory Overhead** | Loading 4 LoRA adapters requires more VRAM than single model | ~12GB minimum required |
|
| 605 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 606 |
### Domain Limitations
|
| 607 |
|
| 608 |
- **Trained on science only** - Physics, Chemistry, Biology. May underperform on law, history, coding, etc.
|
|
|
|
| 304 |
|
| 305 |
### Detailed Statistics
|
| 306 |
|
| 307 |
+
| Metric | GPQA Diamond | MMLU-Pro | ARC-Challenge | TruthfulQA |
|
| 308 |
+
|--------|--------------|----------|---------------|------------|
|
| 309 |
+
| **Total Questions** | **198 (FULL)** | 50 | 50 | 50 |
|
| 310 |
+
| **Correct Answers** | ~196 (99%) | 49 (98%) | 46 (92%) | 41 (82%) |
|
| 311 |
+
| **Grogu Solo (R1)** | - | 32 (64%) | 35 (70%) | 27 (54%) |
|
| 312 |
+
| **Grogu After Debate (R2)** | - | 35 (70%) | 31 (62%) | 31 (62%) |
|
| 313 |
+
| **Synthesis Alone** | - | 49 (98%) | 41 (82%) | 39 (78%) |
|
| 314 |
+
| **Total Mind Changes** | - | 114 | 104 | 106 |
|
| 315 |
+
| **Ties Broken by Debate** | - | 14 (28%) | 11 (22%) | 12 (24%) |
|
| 316 |
+
|
| 317 |
+
> **GPQA Diamond Note**: The 198 questions represent the **complete benchmark** - every single PhD-level science question in the Diamond set was evaluated. This is not a sample.
|
| 318 |
|
| 319 |
### Example: Debate Success (Question Fixed Through Collaboration)
|
| 320 |
|
|
|
|
| 378 |
### Raw Data Access
|
| 379 |
|
| 380 |
Full per-question results are in `benchmark_results/`:
|
| 381 |
+
- **`gpqa_diamond_full_198_questions.json`** - **COMPLETE 198-question PhD-level benchmark** with full statistics
|
| 382 |
- `mmlu_pro_debate_20251018_141141.json` - 50 questions, all agent answers, mind changes
|
| 383 |
- `arc_challenge_debate_20251018_015007.json` - 50 questions with full traces
|
| 384 |
- `truthfulqa_debate_20251018_222525.json` - 50 questions with reasoning
|
|
|
|
| 576 |
β βββ adapter_model.safetensors # Trained weights
|
| 577 |
β βββ tokenizer_config.json # Tokenizer settings
|
| 578 |
βββ benchmark_results/
|
| 579 |
+
β βββ gpqa_diamond_full_198_questions.json # β COMPLETE 198-question PhD-level benchmark
|
| 580 |
+
β βββ mmlu_pro_debate_20251018_141141.json # 50-question sample with full debate traces
|
| 581 |
+
β βββ arc_challenge_debate_20251018_015007.json
|
| 582 |
+
β βββ truthfulqa_debate_20251018_222525.json
|
| 583 |
βββ training_data/
|
| 584 |
β βββ stage2_metadata.json # Training data composition
|
| 585 |
β βββ stage3_metadata.json
|
|
|
|
| 602 |
| Limitation | Description | Impact |
|
| 603 |
|------------|-------------|--------|
|
| 604 |
| **False Consensus** | When all 4 agents agree on a wrong answer in Round 1, debate cannot self-correct | ~2-5% of errors are this type |
|
|
|
|
| 605 |
| **Inference Speed** | 4 agents Γ 2 rounds = ~8x more inference than single model | Slower than single-model approaches |
|
| 606 |
| **Memory Overhead** | Loading 4 LoRA adapters requires more VRAM than single model | ~12GB minimum required |
|
| 607 |
|
| 608 |
+
### Benchmark Coverage
|
| 609 |
+
|
| 610 |
+
| Benchmark | Questions Evaluated | Notes |
|
| 611 |
+
|-----------|---------------------|-------|
|
| 612 |
+
| **GPQA Diamond** | 198 (FULL dataset) | Complete PhD-level science benchmark |
|
| 613 |
+
| **MMLU-Pro** | 50 | Sampled from larger dataset |
|
| 614 |
+
| **ARC-Challenge** | 50 | Sampled from larger dataset |
|
| 615 |
+
| **TruthfulQA** | 50 | Sampled from larger dataset |
|
| 616 |
+
|
| 617 |
+
> **Note**: GPQA Diamond results are from the **complete 198-question dataset** - not a sample. This represents comprehensive evaluation on the hardest graduate-level science benchmark available.
|
| 618 |
+
|
| 619 |
### Domain Limitations
|
| 620 |
|
| 621 |
- **Trained on science only** - Physics, Chemistry, Biology. May underperform on law, history, coding, etc.
|