LLM4HEP / O3_MODEL_COMPARISON.md

initial commit

cfcbbc8 4 months ago

4.24 kB

	# O3 Model Comparison: openai/o:latest vs openai/o3

	## Summary
	Both `openai/o:latest` and `openai/o3` route to the identical underlying model deployment in CBORG with no configuration differences detected.

	## Technical Details

	### 1. Underlying Model
	- openai/o:latest → `azure/o3-2025-04-16`
	- openai/o3 → `azure/o3-2025-04-16`
	- ✓ SAME base model

	### 2. Configuration Parameters
	Tested with explicit parameters:
	```python
	temperature=1.0
	top_p=1.0
	max_tokens=10
	```

	Result: Both models respond identically
	- Same token usage for same prompts
	- Same response IDs format
	- Same provider-specific fields: `{'content_filter_results': {}}`
	- No system fingerprint differences (both return `None`)

	### 3. API Response Comparison
	Multiple test calls (3 each) showed:
	- Identical response structure
	- Same routing backend
	- No detectable configuration differences
	- No temperature/top_p/frequency_penalty differences

	## Performance After Merging

	After merging both experimental runs, the combined statistics are:

	\| Step \| Success Rate \| Trials \|
	\|------\|-------------\|--------\|
	\| 1 \| 95.0% (19/20) \| 20 \|
	\| 2 \| 60.0% (12/20) \| 20 \|
	\| 3 \| 20.0% (4/20) \| 20 \|
	\| 4 \| 100.0% (20/20)\| 20 \|
	\| 5 \| 65.0% (13/20) \| 20 \|

	Total records: 100 (50 from `openai/o:latest` + 50 from `openai/o3`)

	The merged data provides:
	- ✓ More robust statistics (doubled sample size)
	- ✓ Average performance across both experimental runs
	- ✓ Reduced variance in the estimates

	## Why Were There Performance Differences Before Merging?

	The separate experimental runs showed different performance:
	- Step 3: 10% vs 30% success (20 percentage point difference)
	- Step 5: 50% vs 80% success (30 percentage point difference)

	These differences were NOT due to model configuration, but rather:

	1. Different Experimental Runs
	- Different timestamps when trials were conducted
	- Separate experimental sessions

	2. Natural Model Variability
	- O3 models are reasoning models with inherent variability
	- Even with same temperature, outputs can differ significantly
	- Non-deterministic reasoning processes

	3. Small Sample Size Effects
	- Only 10 trials per step in each run
	- Random variation can appear as systematic differences
	- Merging to 20 trials provides more stable estimates

	4. Temporal Factors
	- Models might have been tested at different times
	- Backend infrastructure state could differ
	- Load balancing or deployment variations

	By merging, we get a more representative average of the model's actual performance.

	## Recommendation

	Merge both models in plots because:

	1. ✓ They are technically identical (same model, same configuration)
	2. ✓ Performance differences are due to experimental variability, not model differences
	3. ✓ Merging provides more robust statistics (20 trials per step instead of 10)
	4. ✓ Reduces clutter in visualizations while preserving all data

	Display names (updated):
	- `openai/o:latest` → "O3 (2025-04-16)"
	- `openai/o3` → "O3 (2025-04-16)"

	This naming makes it clear:
	- Both use the same base model (2025-04-16)
	- Data from both variants is combined under a single label
	- Total: 100 records (50 + 50) across 5 steps = 20 trials per step

	## CBORG Routing Behavior

	From our testing, CBORG treats both aliases as:
	- Functionally identical at the API level
	- Same deployment (azure/o3-2025-04-16)
	- No configuration override based on alias name

	The alias `openai/o:latest` is simply a pointer to `openai/o3` at the CBORG routing layer, but the experiments treated them as separate model selections, leading to different trial data.

	## Conclusion

	`openai/o:latest` and `openai/o3` are technically the same model with the same configuration. They have been merged in the plots under the single label "O3 (2025-04-16)" to:
	- Provide more robust statistics (20 trials per step)
	- Reduce visualization clutter
	- Average out experimental variability
	- Present a clearer picture of the model's typical performance

	The merged dataset combines 100 total records (50 + 50) across all 5 steps, providing better statistical reliability than either run alone.