Update README.md
Browse files
README.md
CHANGED
|
@@ -203,128 +203,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
| 203 |
|
| 204 |
[More Information Needed]
|
| 205 |
|
| 206 |
-
# 🧪 ZeroEval Benchmark Report for `e1-Phi4-FT-v2`
|
| 207 |
-
|
| 208 |
-
## 📘 Model Overview
|
| 209 |
-
|
| 210 |
-
- **Model Name:** `e1-Phi4-FT-v2`
|
| 211 |
-
- **Evaluation Benchmark:** [ZeroEval](https://github.com/WildEval/ZeroEval) - Zebra Grid subset
|
| 212 |
-
- **Total Puzzles Evaluated:** 1000
|
| 213 |
-
- **Evaluation Mode:** Greedy decoding
|
| 214 |
-
- **Prompt Type:** Chain-of-Thought (CoT) logic puzzles with grid constraints
|
| 215 |
-
|
| 216 |
-
---
|
| 217 |
-
|
| 218 |
-
## 🔢 Quantitative Performance Summary
|
| 219 |
-
|
| 220 |
-
| Metric | Value |
|
| 221 |
-
|---------------------------|-----------|
|
| 222 |
-
| **Puzzle Accuracy** | 33.70% |
|
| 223 |
-
| **Cell Accuracy** | 47.59% |
|
| 224 |
-
| **No Answer Rate** | 11.40% |
|
| 225 |
-
| **Easy Puzzle Accuracy** | 78.21% |
|
| 226 |
-
| **Hard Puzzle Accuracy** | 16.39% |
|
| 227 |
-
| **Small Puzzle Accuracy** | 77.19% |
|
| 228 |
-
| **Medium Puzzle Accuracy**| 30.00% |
|
| 229 |
-
| **Large Puzzle Accuracy** | 3.00% |
|
| 230 |
-
| **XL Puzzle Accuracy** | 0.00% |
|
| 231 |
-
| **Reason Lens** | 433.30 |
|
| 232 |
-
|
| 233 |
-
---
|
| 234 |
-
|
| 235 |
-
## 🔍 Qualitative Performance
|
| 236 |
-
|
| 237 |
-
### ✅ Strengths
|
| 238 |
-
|
| 239 |
-
- Strong performance on **easy and small puzzles** (77–78% accuracy).
|
| 240 |
-
- Reasoning is often **structured, well-formed**, and aligned to the prompt format.
|
| 241 |
-
- Handles simple logic and direct clue application correctly.
|
| 242 |
-
|
| 243 |
-
### ❌ Weaknesses
|
| 244 |
-
|
| 245 |
-
- **Fails to generalize to complex, larger puzzles** (e.g., 0% on XL puzzles).
|
| 246 |
-
- High **"no answer" rate** (~11%), suggesting reasoning breakdowns.
|
| 247 |
-
- Tends to **violate global constraints** in 5x6 or 6x4 Zebra puzzles despite localized correctness.
|
| 248 |
-
- **Lacks chain-consistency checking**, leading to unsatisfiable global solutions.
|
| 249 |
-
|
| 250 |
-
---
|
| 251 |
-
|
| 252 |
-
## 🧪 Example Evaluation (Puzzle ID: `lgp-test-5x6-16`)
|
| 253 |
-
|
| 254 |
-
- The model produced a full solution and reasoning trace.
|
| 255 |
-
- Correctly matched some constraints (e.g., nationality and color).
|
| 256 |
-
- **Violated at least 3 constraints**, such as:
|
| 257 |
-
- Incorrect relative placement of grilled cheese and Norwegian.
|
| 258 |
-
- Invalid logical sequences (e.g., Dog → Fish → Sci-Fi).
|
| 259 |
-
- Output JSON was valid and interpretable but logically inconsistent.
|
| 260 |
-
|
| 261 |
-
---
|
| 262 |
-
|
| 263 |
-
## 📊 Full Comparative Ranking Table (Zebra Grid - Puzzle Accuracy)
|
| 264 |
-
|
| 265 |
-
| 🥇 Rank | Model Name | Puzzle Acc (%) | Cell Acc (%) | No Answer (%) | Reason Lens |
|
| 266 |
-
|--------|--------------------------------------|----------------|---------------|----------------|--------------|
|
| 267 |
-
| 1 | grok-3-mini-fast-beta-high | 92.60 | 94.63 | 1.00 | 782.25 |
|
| 268 |
-
| 2 | o3-mini-2025-01-31-high | 91.70 | 95.70 | 0.30 | 1983.34 |
|
| 269 |
-
| 3 | o3-mini-2025-01-31-medium | 88.90 | 90.41 | 0.10 | 2067.98 |
|
| 270 |
-
| 4 | o1-2024-12-17 | 81.00 | 78.74 | 0.20 | 1197.51 |
|
| 271 |
-
| 5 | grok-3-mini-fast-beta-low | 80.70 | 84.22 | 0.00 | 874.09 |
|
| 272 |
-
| 6 | deepseek-R1 | 78.70 | 80.54 | 0.00 | 586.33 |
|
| 273 |
-
| 7 | o3-mini-2025-01-31-low | 74.80 | 72.60 | 1.60 | 2080.78 |
|
| 274 |
-
| 8 | o1-preview-2024-09-12 | 71.40 | 75.14 | 0.30 | 1565.88 |
|
| 275 |
-
| 9 | o1-preview-2024-09-12-v2 | 70.40 | 74.18 | 0.40 | 1559.71 |
|
| 276 |
-
| 10 | o1-mini-2024-09-12-v3 | 59.70 | 70.32 | 1.00 | 1166.38 |
|
| 277 |
-
| 11 | grok-3-fast-beta | 57.90 | 67.73 | 0.00 | 3973.16 |
|
| 278 |
-
| 12 | o1-mini-2024-09-12-v2 | 56.80 | 69.87 | 1.30 | 1164.95 |
|
| 279 |
-
| 13 | o1-mini-2024-09-12 | 52.60 | 52.29 | 0.80 | 993.28 |
|
| 280 |
-
| 14 | deepseek-v3 | 42.10 | 42.04 | 27.90 | 2158.00 |
|
| 281 |
-
| 15 | claude-3-5-sonnet-20241022 | 36.20 | 54.27 | 0.00 | 861.18 |
|
| 282 |
-
| 16 | **e1-Phi4-FT-v2** | **33.70** | 47.59 | 11.40 | 433.30 |
|
| 283 |
-
| 17 | claude-3-5-sonnet-20240620 | 33.40 | 54.34 | 0.00 | 1141.94 |
|
| 284 |
-
| 18 | Llama-3.1-405B-Inst-fp8@together | 32.60 | 45.80 | 12.50 | 314.66 |
|
| 285 |
-
| 19 | gpt-4o-2024-08-06 | 31.70 | 50.34 | 3.60 | 1106.51 |
|
| 286 |
-
| 20 | gemini-1.5-pro-exp-0827 | 30.50 | 50.84 | 0.80 | 1594.47 |
|
| 287 |
-
| 21 | Llama-3.1-405B-Inst@sambanova | 30.10 | 39.06 | 24.70 | 2001.12 |
|
| 288 |
-
| 22 | chatgpt-4o-latest-24-09-07 | 29.90 | 48.83 | 4.20 | 1539.99 |
|
| 289 |
-
| 23 | Mistral-Large-2 | 29.00 | 47.64 | 1.70 | 1592.39 |
|
| 290 |
-
| 24 | gpt-4-turbo-2024-04-09 | 28.40 | 47.90 | 0.10 | 1148.46 |
|
| 291 |
-
| 25 | gpt-4o-2024-05-13 | 28.20 | 38.72 | 19.30 | 1643.51 |
|
| 292 |
-
| 26 | grok-2-1212 | 27.70 | 48.16 | 3.50 | 974.00 |
|
| 293 |
-
|
| 294 |
-
---
|
| 295 |
-
|
| 296 |
-
## ⚙️ Model Configuration
|
| 297 |
-
|
| 298 |
-
- **Inference Engine:** vLLM
|
| 299 |
-
- **Temperature:** 0.0
|
| 300 |
-
- **Top-p:** 1.0
|
| 301 |
-
- **Max Tokens:** 4096
|
| 302 |
-
- **Repetition Penalty:** 1.0
|
| 303 |
-
- **Generation Strategy:** Greedy decoding (deterministic)
|
| 304 |
-
|
| 305 |
-
---
|
| 306 |
-
|
| 307 |
-
## 🧠 Recommendations for Improvement
|
| 308 |
-
|
| 309 |
-
1. **Architecture Improvements**
|
| 310 |
-
- Integrate constraint solvers or logic programming modules.
|
| 311 |
-
- Add symbolic reasoning or intermediate variable tracking.
|
| 312 |
-
|
| 313 |
-
2. **Training Enhancements**
|
| 314 |
-
- Fine-tune on larger structured reasoning datasets.
|
| 315 |
-
- Use curriculum learning from 3x3 to 6x6 puzzles.
|
| 316 |
-
|
| 317 |
-
3. **Prompt & Inference Strategy**
|
| 318 |
-
- Use few-shot CoT prompting.
|
| 319 |
-
- Explore self-consistency sampling (e.g., majority vote from top-k outputs).
|
| 320 |
-
- Try symbolic supervision or logic rule augmentation.
|
| 321 |
-
|
| 322 |
-
---
|
| 323 |
-
|
| 324 |
-
## ✅ Conclusion
|
| 325 |
-
|
| 326 |
-
The `e1-Phi4-FT-v2` model shows fundamental capabilities in logic puzzle reasoning, especially with smaller/easy instances. However, its **limited generalization and constraint tracking** hinder performance on complex Zebra puzzles. With further improvements in reasoning architecture and training, the model has potential to move into higher tiers on the ZeroEval leaderboard.
|
| 327 |
-
|
| 328 |
-
---
|
| 329 |
-
|
| 330 |
-
*Report generated using the `ZeroEval` Zebra Grid benchmark data and model output logs.*
|
|
|
|
| 203 |
|
| 204 |
[More Information Needed]
|
| 205 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|