| # BABYAI Environment: Validation Layer Analysis | |
| ## Overview | |
| BABYAI is one of the 8 validation environments. This document shows which validation scoring layers include BABYAI and how it contributes to the overall scoring system. | |
| ## BABYAI Environment Details | |
| - **Environment Name**: `agentgym:babyai` | |
| - **Type**: AgentGym environment | |
| - **Dataset Size**: 500 tasks | |
| - **Daily Sampling Rate**: 120/day (fast environment) | |
| - **Task Type**: Grid-world navigation and instruction following | |
| - **Max Rounds**: 10 | |
| - **Timeout**: 1200 seconds | |
| ## Validation Layers That Include BABYAI | |
| BABYAI appears in validation layers 3-8 as part of various environment combinations. Here's the breakdown: | |
| ### Layer 3 (3-environment combinations) | |
| **Total subsets with BABYAI**: C(7,2) = 21 combinations | |
| BABYAI appears in 21 out of 56 total Layer 3 subsets. Examples: | |
| - {BABYAI, WEBSHOP, ALFWORLD} | |
| - {BABYAI, SCIWORLD, TEXTCRAFT} | |
| - {BABYAI, SAT, DED} | |
| - {BABYAI, ABD, WEBSHOP} | |
| - ... (17 more combinations) | |
| **Weight per subset**: 8.0 / 56 = 0.143 | |
| **Total potential weight from BABYAI Layer 3 subsets**: 21 × 0.143 = 3.003 | |
| ### Layer 4 (4-environment combinations) | |
| **Total subsets with BABYAI**: C(7,3) = 35 combinations | |
| BABYAI appears in 35 out of 70 total Layer 4 subsets. Examples: | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD} | |
| - {BABYAI, TEXTCRAFT, SAT, DED} | |
| - {BABYAI, ABD, WEBSHOP, ALFWORLD} | |
| - ... (32 more combinations) | |
| **Weight per subset**: 16.0 / 70 = 0.229 | |
| **Total potential weight from BABYAI Layer 4 subsets**: 35 × 0.229 = 8.015 | |
| ### Layer 5 (5-environment combinations) | |
| **Total subsets with BABYAI**: C(7,4) = 35 combinations | |
| BABYAI appears in 35 out of 56 total Layer 5 subsets. Examples: | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, TEXTCRAFT} | |
| - {BABYAI, SAT, DED, ABD, WEBSHOP} | |
| - ... (33 more combinations) | |
| **Weight per subset**: 32.0 / 56 = 0.571 | |
| **Total potential weight from BABYAI Layer 5 subsets**: 35 × 0.571 = 19.985 | |
| ### Layer 6 (6-environment combinations) | |
| **Total subsets with BABYAI**: C(7,5) = 21 combinations | |
| BABYAI appears in 21 out of 28 total Layer 6 subsets. Examples: | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, TEXTCRAFT, SAT} | |
| - {BABYAI, DED, ABD, WEBSHOP, ALFWORLD, SCIWORLD} | |
| - ... (19 more combinations) | |
| **Weight per subset**: 64.0 / 28 = 2.286 | |
| **Total potential weight from BABYAI Layer 6 subsets**: 21 × 2.286 = 48.006 | |
| ### Layer 7 (7-environment combinations) | |
| **Total subsets with BABYAI**: C(7,6) = 7 combinations | |
| BABYAI appears in 7 out of 8 total Layer 7 subsets. Examples: | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, TEXTCRAFT, SAT, DED} | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, TEXTCRAFT, SAT, ABD} | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, TEXTCRAFT, DED, ABD} | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, SAT, DED, ABD} | |
| - {BABYAI, WEBSHOP, ALFWORLD, TEXTCRAFT, SAT, DED, ABD} | |
| - {BABYAI, WEBSHOP, SCIWORLD, TEXTCRAFT, SAT, DED, ABD} | |
| - {BABYAI, ALFWORLD, SCIWORLD, TEXTCRAFT, SAT, DED, ABD} | |
| **Weight per subset**: 128.0 / 8 = 16.0 | |
| **Total potential weight from BABYAI Layer 7 subsets**: 7 × 16.0 = 112.0 | |
| ### Layer 8 (All 8 environments) | |
| **Total subsets with BABYAI**: C(7,7) = 1 combination | |
| BABYAI appears in the single Layer 8 subset: | |
| - {BABYAI, WEBSHOP, ALFWORLD, SCIWORLD, TEXTCRAFT, SAT, DED, ABD} | |
| **Weight per subset**: 256.0 / 1 = 256.0 | |
| **Total potential weight from BABYAI Layer 8 subset**: 1 × 256.0 = 256.0 | |
| ## Summary Table | |
| | Layer | Total Subsets | Subsets with BABYAI | Weight per Subset | Total BABYAI Weight | | |
| |-------|---------------|---------------------|-------------------|-------------------| | |
| | 3 | 56 | 21 | 0.143 | 3.003 | | |
| | 4 | 70 | 35 | 0.229 | 8.015 | | |
| | 5 | 56 | 35 | 0.571 | 19.985 | | |
| | 6 | 28 | 21 | 2.286 | 48.006 | | |
| | 7 | 8 | 7 | 16.0 | 112.0 | | |
| | 8 | 1 | 1 | 256.0 | 256.0 | | |
| | **Total** | **219** | **120** | - | **447.009** | | |
| ## Key Insights | |
| 1. **BABYAI Coverage**: BABYAI appears in 120 out of 219 total evaluated subsets (54.8% of all subsets) | |
| 2. **Exponential Importance**: As layers increase, BABYAI's potential contribution grows exponentially: | |
| - Layer 3: 3.003 total weight | |
| - Layer 8: 256.0 total weight (85× more!) | |
| 3. **Comprehensive Performance Matters**: To maximize BABYAI-related rewards, a model must: | |
| - Perform well on BABYAI alone (but this isn't evaluated in layers 1-2) | |
| - Perform well on BABYAI + 2 other environments (Layer 3) | |
| - Perform well on BABYAI + 3 other environments (Layer 4) | |
| - ... | |
| - Perform well on ALL 8 environments including BABYAI (Layer 8) - **highest reward!** | |
| 4. **Layer 8 Dominance**: The single Layer 8 subset (all environments) contributes 256.0 weight, which is more than all other BABYAI-related subsets combined (191.009). | |
| ## Relationship to 36 Transformer Layers | |
| The 36 transformer layers in the model architecture are **not directly mapped** to BABYAI. Instead: | |
| 1. **All 36 layers work together** to process BABYAI tasks | |
| 2. **BABYAI performance** is evaluated across all 8 environments | |
| 3. **Validation scoring layers** (3-8) reward models that perform well on BABYAI in combination with other environments | |
| However, based on the codebase documentation (`BABYAI_SPECIFIC_IMPROVEMENTS.md`), there's evidence that: | |
| - **Late layers (24-35)** may be more important for BABYAI-specific improvements | |
| - BABYAI tasks (navigation/instruction-following) may benefit more from higher-level reasoning in later transformer layers | |
| ## Scoring Example | |
| If a model performs well on BABYAI: | |
| **Scenario A**: Model excels on BABYAI + 2 other environments (wins 5 Layer 3 subsets) | |
| - Reward: 5 × 0.143 = 0.715 | |
| **Scenario B**: Model excels on BABYAI + 6 other environments (wins 1 Layer 7 subset) | |
| - Reward: 1 × 16.0 = 16.0 | |
| **Scenario C**: Model excels on ALL 8 environments including BABYAI (wins Layer 8) | |
| - Reward: 1 × 256.0 = 256.0 | |
| **Result**: Comprehensive performance (Scenario C) gives 358× more reward than partial performance (Scenario A)! | |
| ## Conclusion | |
| BABYAI is a critical component of the validation system, appearing in: | |
| - **120 out of 219 evaluated subsets** (54.8%) | |
| - **All validation layers 3-8** | |
| - **Maximum reward potential of 447.009** (if model wins all BABYAI-related subsets) | |
| To maximize BABYAI-related rewards, models must demonstrate comprehensive ability across multiple environments, with the highest reward (256.0) coming from performing well on all 8 environments simultaneously. | |