LLM4HEP / O3_MODEL_COMPARISON.md
ho22joshua's picture
initial commit
cfcbbc8

O3 Model Comparison: openai/o:latest vs openai/o3

Summary

Both openai/o:latest and openai/o3 route to the identical underlying model deployment in CBORG with no configuration differences detected.

Technical Details

1. Underlying Model

  • openai/o:latest β†’ azure/o3-2025-04-16
  • openai/o3 β†’ azure/o3-2025-04-16
  • βœ“ SAME base model

2. Configuration Parameters

Tested with explicit parameters:

temperature=1.0
top_p=1.0
max_tokens=10

Result: Both models respond identically

  • Same token usage for same prompts
  • Same response IDs format
  • Same provider-specific fields: {'content_filter_results': {}}
  • No system fingerprint differences (both return None)

3. API Response Comparison

Multiple test calls (3 each) showed:

  • Identical response structure
  • Same routing backend
  • No detectable configuration differences
  • No temperature/top_p/frequency_penalty differences

Performance After Merging

After merging both experimental runs, the combined statistics are:

Step Success Rate Trials
1 95.0% (19/20) 20
2 60.0% (12/20) 20
3 20.0% (4/20) 20
4 100.0% (20/20) 20
5 65.0% (13/20) 20

Total records: 100 (50 from openai/o:latest + 50 from openai/o3)

The merged data provides:

  • βœ“ More robust statistics (doubled sample size)
  • βœ“ Average performance across both experimental runs
  • βœ“ Reduced variance in the estimates

Why Were There Performance Differences Before Merging?

The separate experimental runs showed different performance:

  • Step 3: 10% vs 30% success (20 percentage point difference)
  • Step 5: 50% vs 80% success (30 percentage point difference)

These differences were NOT due to model configuration, but rather:

  1. Different Experimental Runs

    • Different timestamps when trials were conducted
    • Separate experimental sessions
  2. Natural Model Variability

    • O3 models are reasoning models with inherent variability
    • Even with same temperature, outputs can differ significantly
    • Non-deterministic reasoning processes
  3. Small Sample Size Effects

    • Only 10 trials per step in each run
    • Random variation can appear as systematic differences
    • Merging to 20 trials provides more stable estimates
  4. Temporal Factors

    • Models might have been tested at different times
    • Backend infrastructure state could differ
    • Load balancing or deployment variations

By merging, we get a more representative average of the model's actual performance.

Recommendation

Merge both models in plots because:

  1. βœ“ They are technically identical (same model, same configuration)
  2. βœ“ Performance differences are due to experimental variability, not model differences
  3. βœ“ Merging provides more robust statistics (20 trials per step instead of 10)
  4. βœ“ Reduces clutter in visualizations while preserving all data

Display names (updated):

  • openai/o:latest β†’ "O3 (2025-04-16)"
  • openai/o3 β†’ "O3 (2025-04-16)"

This naming makes it clear:

  • Both use the same base model (2025-04-16)
  • Data from both variants is combined under a single label
  • Total: 100 records (50 + 50) across 5 steps = 20 trials per step

CBORG Routing Behavior

From our testing, CBORG treats both aliases as:

  • Functionally identical at the API level
  • Same deployment (azure/o3-2025-04-16)
  • No configuration override based on alias name

The alias openai/o:latest is simply a pointer to openai/o3 at the CBORG routing layer, but the experiments treated them as separate model selections, leading to different trial data.

Conclusion

openai/o:latest and openai/o3 are technically the same model with the same configuration. They have been merged in the plots under the single label "O3 (2025-04-16)" to:

  • Provide more robust statistics (20 trials per step)
  • Reduce visualization clutter
  • Average out experimental variability
  • Present a clearer picture of the model's typical performance

The merged dataset combines 100 total records (50 + 50) across all 5 steps, providing better statistical reliability than either run alone.