LLM4HEP / O3_MODEL_COMPARISON.md
ho22joshua's picture
initial commit
cfcbbc8
# O3 Model Comparison: openai/o:latest vs openai/o3
## Summary
Both `openai/o:latest` and `openai/o3` route to the **identical** underlying model deployment in CBORG with **no configuration differences** detected.
## Technical Details
### 1. Underlying Model
- **openai/o:latest** β†’ `azure/o3-2025-04-16`
- **openai/o3** β†’ `azure/o3-2025-04-16`
- βœ“ **SAME** base model
### 2. Configuration Parameters
Tested with explicit parameters:
```python
temperature=1.0
top_p=1.0
max_tokens=10
```
**Result**: Both models respond identically
- Same token usage for same prompts
- Same response IDs format
- Same provider-specific fields: `{'content_filter_results': {}}`
- No system fingerprint differences (both return `None`)
### 3. API Response Comparison
Multiple test calls (3 each) showed:
- Identical response structure
- Same routing backend
- No detectable configuration differences
- No temperature/top_p/frequency_penalty differences
## Performance After Merging
After merging both experimental runs, the combined statistics are:
| Step | Success Rate | Trials |
|------|-------------|--------|
| 1 | 95.0% (19/20) | 20 |
| 2 | 60.0% (12/20) | 20 |
| 3 | 20.0% (4/20) | 20 |
| 4 | 100.0% (20/20)| 20 |
| 5 | 65.0% (13/20) | 20 |
**Total records**: 100 (50 from `openai/o:latest` + 50 from `openai/o3`)
The merged data provides:
- βœ“ More robust statistics (doubled sample size)
- βœ“ Average performance across both experimental runs
- βœ“ Reduced variance in the estimates
## Why Were There Performance Differences Before Merging?
The separate experimental runs showed different performance:
- Step 3: 10% vs 30% success (20 percentage point difference)
- Step 5: 50% vs 80% success (30 percentage point difference)
These differences were **NOT due to model configuration**, but rather:
1. **Different Experimental Runs**
- Different timestamps when trials were conducted
- Separate experimental sessions
2. **Natural Model Variability**
- O3 models are reasoning models with inherent variability
- Even with same temperature, outputs can differ significantly
- Non-deterministic reasoning processes
3. **Small Sample Size Effects**
- Only 10 trials per step in each run
- Random variation can appear as systematic differences
- Merging to 20 trials provides more stable estimates
4. **Temporal Factors**
- Models might have been tested at different times
- Backend infrastructure state could differ
- Load balancing or deployment variations
By merging, we get a more representative average of the model's actual performance.
## Recommendation
**Merge both models in plots** because:
1. βœ“ They are technically identical (same model, same configuration)
2. βœ“ Performance differences are due to experimental variability, not model differences
3. βœ“ Merging provides more robust statistics (20 trials per step instead of 10)
4. βœ“ Reduces clutter in visualizations while preserving all data
**Display names** (updated):
- `openai/o:latest` β†’ **"O3 (2025-04-16)"**
- `openai/o3` β†’ **"O3 (2025-04-16)"**
This naming makes it clear:
- Both use the same base model (2025-04-16)
- Data from both variants is combined under a single label
- Total: 100 records (50 + 50) across 5 steps = 20 trials per step
## CBORG Routing Behavior
From our testing, CBORG treats both aliases as:
- **Functionally identical** at the API level
- **Same deployment** (azure/o3-2025-04-16)
- **No configuration override** based on alias name
The alias `openai/o:latest` is simply a pointer to `openai/o3` at the CBORG routing layer, but the experiments treated them as separate model selections, leading to different trial data.
## Conclusion
`openai/o:latest` and `openai/o3` are technically the same model with the same configuration. They have been **merged in the plots** under the single label **"O3 (2025-04-16)"** to:
- Provide more robust statistics (20 trials per step)
- Reduce visualization clutter
- Average out experimental variability
- Present a clearer picture of the model's typical performance
The merged dataset combines 100 total records (50 + 50) across all 5 steps, providing better statistical reliability than either run alone.