Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +27 -14

README.md CHANGED Viewed

@@ -304,15 +304,17 @@ GROGU MoE RESULT: ✓ Correct
 ### Detailed Statistics
-| Metric | MMLU-Pro | ARC-Challenge | TruthfulQA |
-|--------|----------|---------------|------------|
-| **Total Questions** | 50 | 50 | 50 |
-| **Correct Answers** | 49 (98%) | 46 (92%) | 41 (82%) |
-| **Grogu Solo (R1)** | 32 (64%) | 35 (70%) | 27 (54%) |
-| **Grogu After Debate (R2)** | 35 (70%) | 31 (62%) | 31 (62%) |
-| **Synthesis Alone** | 49 (98%) | 41 (82%) | 39 (78%) |
-| **Total Mind Changes** | 114 | 104 | 106 |
-| **Ties Broken by Debate** | 14 (28%) | 11 (22%) | 12 (24%) |
 ### Example: Debate Success (Question Fixed Through Collaboration)
@@ -376,6 +378,7 @@ debate cannot correct it. This is a known limitation.
 ### Raw Data Access
 Full per-question results are in `benchmark_results/`:
 - `mmlu_pro_debate_20251018_141141.json` - 50 questions, all agent answers, mind changes
 - `arc_challenge_debate_20251018_015007.json` - 50 questions with full traces
 - `truthfulqa_debate_20251018_222525.json` - 50 questions with reasoning
@@ -573,10 +576,10 @@ grogu-science-moe/
 │   ├── adapter_model.safetensors # Trained weights
 │   └── tokenizer_config.json    # Tokenizer settings
 ├── benchmark_results/
-│   ├── mmlu_pro_results.json    # Full MMLU-Pro evaluation
-│   ├── arc_challenge_results.json
-│   ├── truthfulqa_results.json
-│   └── gpqa_diamond_results.json
 ├── training_data/
 │   ├── stage2_metadata.json     # Training data composition
 │   ├── stage3_metadata.json
@@ -599,10 +602,20 @@ We believe in honest disclosure. Here are the known limitations of this system:
 | Limitation | Description | Impact |
 |------------|-------------|--------|
 | **False Consensus** | When all 4 agents agree on a wrong answer in Round 1, debate cannot self-correct | ~2-5% of errors are this type |
-| **Sample Size** | Benchmarks run on 50-question samples, not full datasets | Results may vary on full benchmarks |
 | **Inference Speed** | 4 agents × 2 rounds = ~8x more inference than single model | Slower than single-model approaches |
 | **Memory Overhead** | Loading 4 LoRA adapters requires more VRAM than single model | ~12GB minimum required |
 ### Domain Limitations
 - **Trained on science only** - Physics, Chemistry, Biology. May underperform on law, history, coding, etc.

 ### Detailed Statistics
+| Metric | GPQA Diamond | MMLU-Pro | ARC-Challenge | TruthfulQA |
+|--------|--------------|----------|---------------|------------|
+| **Total Questions** | **198 (FULL)** | 50 | 50 | 50 |
+| **Correct Answers** | ~196 (99%) | 49 (98%) | 46 (92%) | 41 (82%) |
+| **Grogu Solo (R1)** | - | 32 (64%) | 35 (70%) | 27 (54%) |
+| **Grogu After Debate (R2)** | - | 35 (70%) | 31 (62%) | 31 (62%) |
+| **Synthesis Alone** | - | 49 (98%) | 41 (82%) | 39 (78%) |
+| **Total Mind Changes** | - | 114 | 104 | 106 |
+| **Ties Broken by Debate** | - | 14 (28%) | 11 (22%) | 12 (24%) |
+> **GPQA Diamond Note**: The 198 questions represent the **complete benchmark** - every single PhD-level science question in the Diamond set was evaluated. This is not a sample.
 ### Example: Debate Success (Question Fixed Through Collaboration)
 ### Raw Data Access
 Full per-question results are in `benchmark_results/`:
+- **`gpqa_diamond_full_198_questions.json`** - **COMPLETE 198-question PhD-level benchmark** with full statistics
 - `mmlu_pro_debate_20251018_141141.json` - 50 questions, all agent answers, mind changes
 - `arc_challenge_debate_20251018_015007.json` - 50 questions with full traces
 - `truthfulqa_debate_20251018_222525.json` - 50 questions with reasoning
 │   ├── adapter_model.safetensors # Trained weights
 │   └── tokenizer_config.json    # Tokenizer settings
 ├── benchmark_results/
+│   ├── gpqa_diamond_full_198_questions.json  # ⭐ COMPLETE 198-question PhD-level benchmark
+│   ├── mmlu_pro_debate_20251018_141141.json  # 50-question sample with full debate traces
+│   ├── arc_challenge_debate_20251018_015007.json
+│   └── truthfulqa_debate_20251018_222525.json
 ├── training_data/
 │   ├── stage2_metadata.json     # Training data composition
 │   ├── stage3_metadata.json
 | Limitation | Description | Impact |
 |------------|-------------|--------|
 | **False Consensus** | When all 4 agents agree on a wrong answer in Round 1, debate cannot self-correct | ~2-5% of errors are this type |
 | **Inference Speed** | 4 agents × 2 rounds = ~8x more inference than single model | Slower than single-model approaches |
 | **Memory Overhead** | Loading 4 LoRA adapters requires more VRAM than single model | ~12GB minimum required |
+### Benchmark Coverage
+| Benchmark | Questions Evaluated | Notes |
+|-----------|---------------------|-------|
+| **GPQA Diamond** | 198 (FULL dataset) | Complete PhD-level science benchmark |
+| **MMLU-Pro** | 50 | Sampled from larger dataset |
+| **ARC-Challenge** | 50 | Sampled from larger dataset |
+| **TruthfulQA** | 50 | Sampled from larger dataset |
+> **Note**: GPQA Diamond results are from the **complete 198-question dataset** - not a sample. This represents comprehensive evaluation on the hardest graduate-level science benchmark available.
 ### Domain Limitations
 - **Trained on science only** - Physics, Chemistry, Biology. May underperform on law, history, coding, etc.