RhinoWithAcape commited on
Commit
cb4ba69
Β·
verified Β·
1 Parent(s): 528f54c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +27 -14
README.md CHANGED
@@ -304,15 +304,17 @@ GROGU MoE RESULT: βœ“ Correct
304
 
305
  ### Detailed Statistics
306
 
307
- | Metric | MMLU-Pro | ARC-Challenge | TruthfulQA |
308
- |--------|----------|---------------|------------|
309
- | **Total Questions** | 50 | 50 | 50 |
310
- | **Correct Answers** | 49 (98%) | 46 (92%) | 41 (82%) |
311
- | **Grogu Solo (R1)** | 32 (64%) | 35 (70%) | 27 (54%) |
312
- | **Grogu After Debate (R2)** | 35 (70%) | 31 (62%) | 31 (62%) |
313
- | **Synthesis Alone** | 49 (98%) | 41 (82%) | 39 (78%) |
314
- | **Total Mind Changes** | 114 | 104 | 106 |
315
- | **Ties Broken by Debate** | 14 (28%) | 11 (22%) | 12 (24%) |
 
 
316
 
317
  ### Example: Debate Success (Question Fixed Through Collaboration)
318
 
@@ -376,6 +378,7 @@ debate cannot correct it. This is a known limitation.
376
  ### Raw Data Access
377
 
378
  Full per-question results are in `benchmark_results/`:
 
379
  - `mmlu_pro_debate_20251018_141141.json` - 50 questions, all agent answers, mind changes
380
  - `arc_challenge_debate_20251018_015007.json` - 50 questions with full traces
381
  - `truthfulqa_debate_20251018_222525.json` - 50 questions with reasoning
@@ -573,10 +576,10 @@ grogu-science-moe/
573
  β”‚ β”œβ”€β”€ adapter_model.safetensors # Trained weights
574
  β”‚ └── tokenizer_config.json # Tokenizer settings
575
  β”œβ”€β”€ benchmark_results/
576
- β”‚ β”œβ”€β”€ mmlu_pro_results.json # Full MMLU-Pro evaluation
577
- β”‚ β”œβ”€β”€ arc_challenge_results.json
578
- β”‚ β”œβ”€β”€ truthfulqa_results.json
579
- β”‚ └── gpqa_diamond_results.json
580
  β”œβ”€β”€ training_data/
581
  β”‚ β”œβ”€β”€ stage2_metadata.json # Training data composition
582
  β”‚ β”œβ”€β”€ stage3_metadata.json
@@ -599,10 +602,20 @@ We believe in honest disclosure. Here are the known limitations of this system:
599
  | Limitation | Description | Impact |
600
  |------------|-------------|--------|
601
  | **False Consensus** | When all 4 agents agree on a wrong answer in Round 1, debate cannot self-correct | ~2-5% of errors are this type |
602
- | **Sample Size** | Benchmarks run on 50-question samples, not full datasets | Results may vary on full benchmarks |
603
  | **Inference Speed** | 4 agents Γ— 2 rounds = ~8x more inference than single model | Slower than single-model approaches |
604
  | **Memory Overhead** | Loading 4 LoRA adapters requires more VRAM than single model | ~12GB minimum required |
605
 
 
 
 
 
 
 
 
 
 
 
 
606
  ### Domain Limitations
607
 
608
  - **Trained on science only** - Physics, Chemistry, Biology. May underperform on law, history, coding, etc.
 
304
 
305
  ### Detailed Statistics
306
 
307
+ | Metric | GPQA Diamond | MMLU-Pro | ARC-Challenge | TruthfulQA |
308
+ |--------|--------------|----------|---------------|------------|
309
+ | **Total Questions** | **198 (FULL)** | 50 | 50 | 50 |
310
+ | **Correct Answers** | ~196 (99%) | 49 (98%) | 46 (92%) | 41 (82%) |
311
+ | **Grogu Solo (R1)** | - | 32 (64%) | 35 (70%) | 27 (54%) |
312
+ | **Grogu After Debate (R2)** | - | 35 (70%) | 31 (62%) | 31 (62%) |
313
+ | **Synthesis Alone** | - | 49 (98%) | 41 (82%) | 39 (78%) |
314
+ | **Total Mind Changes** | - | 114 | 104 | 106 |
315
+ | **Ties Broken by Debate** | - | 14 (28%) | 11 (22%) | 12 (24%) |
316
+
317
+ > **GPQA Diamond Note**: The 198 questions represent the **complete benchmark** - every single PhD-level science question in the Diamond set was evaluated. This is not a sample.
318
 
319
  ### Example: Debate Success (Question Fixed Through Collaboration)
320
 
 
378
  ### Raw Data Access
379
 
380
  Full per-question results are in `benchmark_results/`:
381
+ - **`gpqa_diamond_full_198_questions.json`** - **COMPLETE 198-question PhD-level benchmark** with full statistics
382
  - `mmlu_pro_debate_20251018_141141.json` - 50 questions, all agent answers, mind changes
383
  - `arc_challenge_debate_20251018_015007.json` - 50 questions with full traces
384
  - `truthfulqa_debate_20251018_222525.json` - 50 questions with reasoning
 
576
  β”‚ β”œβ”€β”€ adapter_model.safetensors # Trained weights
577
  β”‚ └── tokenizer_config.json # Tokenizer settings
578
  β”œβ”€β”€ benchmark_results/
579
+ β”‚ β”œβ”€β”€ gpqa_diamond_full_198_questions.json # ⭐ COMPLETE 198-question PhD-level benchmark
580
+ β”‚ β”œβ”€β”€ mmlu_pro_debate_20251018_141141.json # 50-question sample with full debate traces
581
+ β”‚ β”œβ”€β”€ arc_challenge_debate_20251018_015007.json
582
+ β”‚ └── truthfulqa_debate_20251018_222525.json
583
  β”œβ”€β”€ training_data/
584
  β”‚ β”œβ”€β”€ stage2_metadata.json # Training data composition
585
  β”‚ β”œβ”€β”€ stage3_metadata.json
 
602
  | Limitation | Description | Impact |
603
  |------------|-------------|--------|
604
  | **False Consensus** | When all 4 agents agree on a wrong answer in Round 1, debate cannot self-correct | ~2-5% of errors are this type |
 
605
  | **Inference Speed** | 4 agents Γ— 2 rounds = ~8x more inference than single model | Slower than single-model approaches |
606
  | **Memory Overhead** | Loading 4 LoRA adapters requires more VRAM than single model | ~12GB minimum required |
607
 
608
+ ### Benchmark Coverage
609
+
610
+ | Benchmark | Questions Evaluated | Notes |
611
+ |-----------|---------------------|-------|
612
+ | **GPQA Diamond** | 198 (FULL dataset) | Complete PhD-level science benchmark |
613
+ | **MMLU-Pro** | 50 | Sampled from larger dataset |
614
+ | **ARC-Challenge** | 50 | Sampled from larger dataset |
615
+ | **TruthfulQA** | 50 | Sampled from larger dataset |
616
+
617
+ > **Note**: GPQA Diamond results are from the **complete 198-question dataset** - not a sample. This represents comprehensive evaluation on the hardest graduate-level science benchmark available.
618
+
619
  ### Domain Limitations
620
 
621
  - **Trained on science only** - Physics, Chemistry, Biology. May underperform on law, history, coding, etc.