embraceableAI
/

e1-Phi4-FT-v2

@@ -203,128 +203,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 [More Information Needed]
-# 🧪 ZeroEval Benchmark Report for `e1-Phi4-FT-v2`
-## 📘 Model Overview
-- **Model Name:** `e1-Phi4-FT-v2`
-- **Evaluation Benchmark:** [ZeroEval](https://github.com/WildEval/ZeroEval) - Zebra Grid subset
-- **Total Puzzles Evaluated:** 1000
-- **Evaluation Mode:** Greedy decoding
-- **Prompt Type:** Chain-of-Thought (CoT) logic puzzles with grid constraints
----
-## 🔢 Quantitative Performance Summary
-| Metric                     | Value     |
-|---------------------------|-----------|
-| **Puzzle Accuracy**       | 33.70%    |
-| **Cell Accuracy**         | 47.59%    |
-| **No Answer Rate**        | 11.40%    |
-| **Easy Puzzle Accuracy**  | 78.21%    |
-| **Hard Puzzle Accuracy**  | 16.39%    |
-| **Small Puzzle Accuracy** | 77.19%    |
-| **Medium Puzzle Accuracy**| 30.00%    |
-| **Large Puzzle Accuracy** | 3.00%     |
-| **XL Puzzle Accuracy**    | 0.00%     |
-| **Reason Lens**           | 433.30    |
----
-## 🔍 Qualitative Performance
-### ✅ Strengths
-- Strong performance on **easy and small puzzles** (77–78% accuracy).
-- Reasoning is often **structured, well-formed**, and aligned to the prompt format.
-- Handles simple logic and direct clue application correctly.
-### ❌ Weaknesses
-- **Fails to generalize to complex, larger puzzles** (e.g., 0% on XL puzzles).
-- High **"no answer" rate** (~11%), suggesting reasoning breakdowns.
-- Tends to **violate global constraints** in 5x6 or 6x4 Zebra puzzles despite localized correctness.
-- **Lacks chain-consistency checking**, leading to unsatisfiable global solutions.
----
-## 🧪 Example Evaluation (Puzzle ID: `lgp-test-5x6-16`)
-- The model produced a full solution and reasoning trace.
-- Correctly matched some constraints (e.g., nationality and color).
-- **Violated at least 3 constraints**, such as:
-  - Incorrect relative placement of grilled cheese and Norwegian.
-  - Invalid logical sequences (e.g., Dog → Fish → Sci-Fi).
-- Output JSON was valid and interpretable but logically inconsistent.
----
-## 📊 Full Comparative Ranking Table (Zebra Grid - Puzzle Accuracy)
-| 🥇 Rank | Model Name                          | Puzzle Acc (%) | Cell Acc (%) | No Answer (%) | Reason Lens |
-|--------|--------------------------------------|----------------|---------------|----------------|--------------|
-| 1      | grok-3-mini-fast-beta-high           | 92.60          | 94.63         | 1.00           | 782.25       |
-| 2      | o3-mini-2025-01-31-high              | 91.70          | 95.70         | 0.30           | 1983.34      |
-| 3      | o3-mini-2025-01-31-medium            | 88.90          | 90.41         | 0.10           | 2067.98      |
-| 4      | o1-2024-12-17                        | 81.00          | 78.74         | 0.20           | 1197.51      |
-| 5      | grok-3-mini-fast-beta-low            | 80.70          | 84.22         | 0.00           | 874.09       |
-| 6      | deepseek-R1                          | 78.70          | 80.54         | 0.00           | 586.33       |
-| 7      | o3-mini-2025-01-31-low               | 74.80          | 72.60         | 1.60           | 2080.78      |
-| 8      | o1-preview-2024-09-12                | 71.40          | 75.14         | 0.30           | 1565.88      |
-| 9      | o1-preview-2024-09-12-v2             | 70.40          | 74.18         | 0.40           | 1559.71      |
-| 10     | o1-mini-2024-09-12-v3                | 59.70          | 70.32         | 1.00           | 1166.38      |
-| 11     | grok-3-fast-beta                     | 57.90          | 67.73         | 0.00           | 3973.16      |
-| 12     | o1-mini-2024-09-12-v2                | 56.80          | 69.87         | 1.30           | 1164.95      |
-| 13     | o1-mini-2024-09-12                   | 52.60          | 52.29         | 0.80           | 993.28       |
-| 14     | deepseek-v3                          | 42.10          | 42.04         | 27.90          | 2158.00      |
-| 15     | claude-3-5-sonnet-20241022           | 36.20          | 54.27         | 0.00           | 861.18       |
-| 16     | **e1-Phi4-FT-v2**                    | **33.70**      | 47.59         | 11.40          | 433.30       |
-| 17     | claude-3-5-sonnet-20240620           | 33.40          | 54.34         | 0.00           | 1141.94      |
-| 18     | Llama-3.1-405B-Inst-fp8@together      | 32.60          | 45.80         | 12.50          | 314.66       |
-| 19     | gpt-4o-2024-08-06                    | 31.70          | 50.34         | 3.60           | 1106.51      |
-| 20     | gemini-1.5-pro-exp-0827              | 30.50          | 50.84         | 0.80           | 1594.47      |
-| 21     | Llama-3.1-405B-Inst@sambanova         | 30.10          | 39.06         | 24.70          | 2001.12      |
-| 22     | chatgpt-4o-latest-24-09-07           | 29.90          | 48.83         | 4.20           | 1539.99      |
-| 23     | Mistral-Large-2                      | 29.00          | 47.64         | 1.70           | 1592.39      |
-| 24     | gpt-4-turbo-2024-04-09               | 28.40          | 47.90         | 0.10           | 1148.46      |
-| 25     | gpt-4o-2024-05-13                    | 28.20          | 38.72         | 19.30          | 1643.51      |
-| 26     | grok-2-1212                          | 27.70          | 48.16         | 3.50           | 974.00       |
----
-## ⚙️ Model Configuration
-- **Inference Engine:** vLLM
-- **Temperature:** 0.0
-- **Top-p:** 1.0
-- **Max Tokens:** 4096
-- **Repetition Penalty:** 1.0
-- **Generation Strategy:** Greedy decoding (deterministic)
----
-## 🧠 Recommendations for Improvement
-1. **Architecture Improvements**
-   - Integrate constraint solvers or logic programming modules.
-   - Add symbolic reasoning or intermediate variable tracking.
-2. **Training Enhancements**
-   - Fine-tune on larger structured reasoning datasets.
-   - Use curriculum learning from 3x3 to 6x6 puzzles.
-3. **Prompt & Inference Strategy**
-   - Use few-shot CoT prompting.
-   - Explore self-consistency sampling (e.g., majority vote from top-k outputs).
-   - Try symbolic supervision or logic rule augmentation.
----
-## ✅ Conclusion
-The `e1-Phi4-FT-v2` model shows fundamental capabilities in logic puzzle reasoning, especially with smaller/easy instances. However, its **limited generalization and constraint tracking** hinder performance on complex Zebra puzzles. With further improvements in reasoning architecture and training, the model has potential to move into higher tiers on the ZeroEval leaderboard.
----
-*Report generated using the `ZeroEval` Zebra Grid benchmark data and model output logs.*


203
204	[More Information Needed]
205