| # Concrete Example: Validation Scoring with 8 Environments | |
| ## Scenario Setup | |
| Let's say we have 8 validation environments: | |
| 1. WEBSHOP | |
| 2. ALFWORLD | |
| 3. BABYAI | |
| 4. SCIWORLD | |
| 5. TEXTCRAFT | |
| 6. SAT | |
| 7. DED | |
| 8. ABD | |
| And we're evaluating a model's performance across these environments. | |
| ## Validation Layer Breakdown | |
| ### Layer 3 (3-environment combinations) | |
| - **Total subsets**: C(8,3) = 56 combinations | |
| - **Examples**: | |
| - {WEBSHOP, ALFWORLD, BABYAI} | |
| - {SCIWORLD, TEXTCRAFT, SAT} | |
| - {DED, ABD, WEBSHOP} | |
| - ... (54 more combinations) | |
| - **Total weight for layer**: 2³ = 8.0 | |
| - **Weight per subset**: 8.0 / 56 = **0.143** | |
| ### Layer 4 (4-environment combinations) | |
| - **Total subsets**: C(8,4) = 70 combinations | |
| - **Examples**: | |
| - {WEBSHOP, ALFWORLD, BABYAI, SCIWORLD} | |
| - {TEXTCRAFT, SAT, DED, ABD} | |
| - ... (68 more combinations) | |
| - **Total weight for layer**: 2⁴ = 16.0 | |
| - **Weight per subset**: 16.0 / 70 = **0.229** | |
| ### Layer 5 (5-environment combinations) | |
| - **Total subsets**: C(8,5) = 56 combinations | |
| - **Total weight for layer**: 2⁵ = 32.0 | |
| - **Weight per subset**: 32.0 / 56 = **0.571** | |
| ### Layer 6 (6-environment combinations) | |
| - **Total subsets**: C(8,6) = 28 combinations | |
| - **Total weight for layer**: 2⁶ = 64.0 | |
| - **Weight per subset**: 64.0 / 28 = **2.286** | |
| ### Layer 7 (7-environment combinations) | |
| - **Total subsets**: C(8,7) = 8 combinations | |
| - **Examples**: | |
| - All environments except WEBSHOP | |
| - All environments except ALFWORLD | |
| - ... (6 more combinations) | |
| - **Total weight for layer**: 2⁷ = 128.0 | |
| - **Weight per subset**: 128.0 / 8 = **16.0** | |
| ### Layer 8 (All 8 environments) | |
| - **Total subsets**: C(8,8) = 1 combination | |
| - **The subset**: {WEBSHOP, ALFWORLD, BABYAI, SCIWORLD, TEXTCRAFT, SAT, DED, ABD} | |
| - **Total weight for layer**: 2⁸ = 256.0 | |
| - **Weight per subset**: 256.0 / 1 = **256.0** (highest reward!) | |
| ## Scoring Example | |
| Let's say Model A performs well on: | |
| - Layer 3: 10 out of 56 subsets (wins on 10 different 3-environment combinations) | |
| - Layer 4: 5 out of 70 subsets | |
| - Layer 5: 2 out of 56 subsets | |
| - Layer 6: 1 out of 28 subsets | |
| - Layer 7: 0 out of 8 subsets | |
| - Layer 8: 0 out of 1 subset (doesn't perform well on all 8 simultaneously) | |
| **Model A's total score** (simplified, assuming equal performance on winning subsets): | |
| ``` | |
| = (10 × 0.143) + (5 × 0.229) + (2 × 0.571) + (1 × 2.286) + (0 × 16.0) + (0 × 256.0) | |
| = 1.43 + 1.145 + 1.142 + 2.286 + 0 + 0 | |
| = 6.003 | |
| ``` | |
| Now let's say Model B performs well on: | |
| - Layer 3: 5 out of 56 subsets | |
| - Layer 4: 3 out of 70 subsets | |
| - Layer 5: 1 out of 56 subsets | |
| - Layer 6: 0 out of 28 subsets | |
| - Layer 7: 0 out of 8 subsets | |
| - Layer 8: **1 out of 1 subset** (performs well on ALL 8 environments!) | |
| **Model B's total score**: | |
| ``` | |
| = (5 × 0.143) + (3 × 0.229) + (1 × 0.571) + (0 × 2.286) + (0 × 16.0) + (1 × 256.0) | |
| = 0.715 + 0.687 + 0.571 + 0 + 0 + 256.0 | |
| = 257.973 | |
| ``` | |
| **Result**: Model B wins decisively because it performs well across all 8 environments simultaneously, earning the massive Layer 8 reward of 256.0! | |
| ## Key Takeaways | |
| 1. **Exponential Rewards**: Each layer gets 2× more total weight than the previous layer | |
| 2. **Comprehensive Performance Matters**: Performing well on all 8 environments (Layer 8) gives 256× more weight than a single 3-environment combination | |
| 3. **Distributed Weight**: Within each layer, weight is evenly distributed, so winning more subsets in a layer increases score | |
| 4. **Top-Layer Focus**: Only layers 3-8 are evaluated, focusing on multi-environment capability | |
| ## How This Relates to the 36 Transformer Layers | |
| The 36 transformer layers in the model work together to: | |
| 1. Process input from any of the 8 environments | |
| 2. Generate appropriate responses for each task type | |
| 3. Learn representations that generalize across environments | |
| The validation scoring system then: | |
| 1. Tests the model on all 8 environments | |
| 2. Rewards models that perform well across multiple environments | |
| 3. Uses combinatoric layers to incentivize comprehensive ability | |
| The 36 layers are the **capacity** (how the model processes information), while the 8 environments and combinatoric scoring are the **evaluation framework** (how we measure and reward performance). | |