| # O3 Model Comparison: openai/o:latest vs openai/o3 | |
| ## Summary | |
| Both `openai/o:latest` and `openai/o3` route to the **identical** underlying model deployment in CBORG with **no configuration differences** detected. | |
| ## Technical Details | |
| ### 1. Underlying Model | |
| - **openai/o:latest** β `azure/o3-2025-04-16` | |
| - **openai/o3** β `azure/o3-2025-04-16` | |
| - β **SAME** base model | |
| ### 2. Configuration Parameters | |
| Tested with explicit parameters: | |
| ```python | |
| temperature=1.0 | |
| top_p=1.0 | |
| max_tokens=10 | |
| ``` | |
| **Result**: Both models respond identically | |
| - Same token usage for same prompts | |
| - Same response IDs format | |
| - Same provider-specific fields: `{'content_filter_results': {}}` | |
| - No system fingerprint differences (both return `None`) | |
| ### 3. API Response Comparison | |
| Multiple test calls (3 each) showed: | |
| - Identical response structure | |
| - Same routing backend | |
| - No detectable configuration differences | |
| - No temperature/top_p/frequency_penalty differences | |
| ## Performance After Merging | |
| After merging both experimental runs, the combined statistics are: | |
| | Step | Success Rate | Trials | | |
| |------|-------------|--------| | |
| | 1 | 95.0% (19/20) | 20 | | |
| | 2 | 60.0% (12/20) | 20 | | |
| | 3 | 20.0% (4/20) | 20 | | |
| | 4 | 100.0% (20/20)| 20 | | |
| | 5 | 65.0% (13/20) | 20 | | |
| **Total records**: 100 (50 from `openai/o:latest` + 50 from `openai/o3`) | |
| The merged data provides: | |
| - β More robust statistics (doubled sample size) | |
| - β Average performance across both experimental runs | |
| - β Reduced variance in the estimates | |
| ## Why Were There Performance Differences Before Merging? | |
| The separate experimental runs showed different performance: | |
| - Step 3: 10% vs 30% success (20 percentage point difference) | |
| - Step 5: 50% vs 80% success (30 percentage point difference) | |
| These differences were **NOT due to model configuration**, but rather: | |
| 1. **Different Experimental Runs** | |
| - Different timestamps when trials were conducted | |
| - Separate experimental sessions | |
| 2. **Natural Model Variability** | |
| - O3 models are reasoning models with inherent variability | |
| - Even with same temperature, outputs can differ significantly | |
| - Non-deterministic reasoning processes | |
| 3. **Small Sample Size Effects** | |
| - Only 10 trials per step in each run | |
| - Random variation can appear as systematic differences | |
| - Merging to 20 trials provides more stable estimates | |
| 4. **Temporal Factors** | |
| - Models might have been tested at different times | |
| - Backend infrastructure state could differ | |
| - Load balancing or deployment variations | |
| By merging, we get a more representative average of the model's actual performance. | |
| ## Recommendation | |
| **Merge both models in plots** because: | |
| 1. β They are technically identical (same model, same configuration) | |
| 2. β Performance differences are due to experimental variability, not model differences | |
| 3. β Merging provides more robust statistics (20 trials per step instead of 10) | |
| 4. β Reduces clutter in visualizations while preserving all data | |
| **Display names** (updated): | |
| - `openai/o:latest` β **"O3 (2025-04-16)"** | |
| - `openai/o3` β **"O3 (2025-04-16)"** | |
| This naming makes it clear: | |
| - Both use the same base model (2025-04-16) | |
| - Data from both variants is combined under a single label | |
| - Total: 100 records (50 + 50) across 5 steps = 20 trials per step | |
| ## CBORG Routing Behavior | |
| From our testing, CBORG treats both aliases as: | |
| - **Functionally identical** at the API level | |
| - **Same deployment** (azure/o3-2025-04-16) | |
| - **No configuration override** based on alias name | |
| The alias `openai/o:latest` is simply a pointer to `openai/o3` at the CBORG routing layer, but the experiments treated them as separate model selections, leading to different trial data. | |
| ## Conclusion | |
| `openai/o:latest` and `openai/o3` are technically the same model with the same configuration. They have been **merged in the plots** under the single label **"O3 (2025-04-16)"** to: | |
| - Provide more robust statistics (20 trials per step) | |
| - Reduce visualization clutter | |
| - Average out experimental variability | |
| - Present a clearer picture of the model's typical performance | |
| The merged dataset combines 100 total records (50 + 50) across all 5 steps, providing better statistical reliability than either run alone. | |