O3 Model Comparison: openai/o:latest vs openai/o3
Summary
Both openai/o:latest and openai/o3 route to the identical underlying model deployment in CBORG with no configuration differences detected.
Technical Details
1. Underlying Model
- openai/o:latest β
azure/o3-2025-04-16 - openai/o3 β
azure/o3-2025-04-16 - β SAME base model
2. Configuration Parameters
Tested with explicit parameters:
temperature=1.0
top_p=1.0
max_tokens=10
Result: Both models respond identically
- Same token usage for same prompts
- Same response IDs format
- Same provider-specific fields:
{'content_filter_results': {}} - No system fingerprint differences (both return
None)
3. API Response Comparison
Multiple test calls (3 each) showed:
- Identical response structure
- Same routing backend
- No detectable configuration differences
- No temperature/top_p/frequency_penalty differences
Performance After Merging
After merging both experimental runs, the combined statistics are:
| Step | Success Rate | Trials |
|---|---|---|
| 1 | 95.0% (19/20) | 20 |
| 2 | 60.0% (12/20) | 20 |
| 3 | 20.0% (4/20) | 20 |
| 4 | 100.0% (20/20) | 20 |
| 5 | 65.0% (13/20) | 20 |
Total records: 100 (50 from openai/o:latest + 50 from openai/o3)
The merged data provides:
- β More robust statistics (doubled sample size)
- β Average performance across both experimental runs
- β Reduced variance in the estimates
Why Were There Performance Differences Before Merging?
The separate experimental runs showed different performance:
- Step 3: 10% vs 30% success (20 percentage point difference)
- Step 5: 50% vs 80% success (30 percentage point difference)
These differences were NOT due to model configuration, but rather:
Different Experimental Runs
- Different timestamps when trials were conducted
- Separate experimental sessions
Natural Model Variability
- O3 models are reasoning models with inherent variability
- Even with same temperature, outputs can differ significantly
- Non-deterministic reasoning processes
Small Sample Size Effects
- Only 10 trials per step in each run
- Random variation can appear as systematic differences
- Merging to 20 trials provides more stable estimates
Temporal Factors
- Models might have been tested at different times
- Backend infrastructure state could differ
- Load balancing or deployment variations
By merging, we get a more representative average of the model's actual performance.
Recommendation
Merge both models in plots because:
- β They are technically identical (same model, same configuration)
- β Performance differences are due to experimental variability, not model differences
- β Merging provides more robust statistics (20 trials per step instead of 10)
- β Reduces clutter in visualizations while preserving all data
Display names (updated):
openai/o:latestβ "O3 (2025-04-16)"openai/o3β "O3 (2025-04-16)"
This naming makes it clear:
- Both use the same base model (2025-04-16)
- Data from both variants is combined under a single label
- Total: 100 records (50 + 50) across 5 steps = 20 trials per step
CBORG Routing Behavior
From our testing, CBORG treats both aliases as:
- Functionally identical at the API level
- Same deployment (azure/o3-2025-04-16)
- No configuration override based on alias name
The alias openai/o:latest is simply a pointer to openai/o3 at the CBORG routing layer, but the experiments treated them as separate model selections, leading to different trial data.
Conclusion
openai/o:latest and openai/o3 are technically the same model with the same configuration. They have been merged in the plots under the single label "O3 (2025-04-16)" to:
- Provide more robust statistics (20 trials per step)
- Reduce visualization clutter
- Average out experimental variability
- Present a clearer picture of the model's typical performance
The merged dataset combines 100 total records (50 + 50) across all 5 steps, providing better statistical reliability than either run alone.