sp-embraceable commited on
Commit
5a39286
·
verified ·
1 Parent(s): 56d6289

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -125
README.md CHANGED
@@ -203,128 +203,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
203
 
204
  [More Information Needed]
205
 
206
- # 🧪 ZeroEval Benchmark Report for `e1-Phi4-FT-v2`
207
-
208
- ## 📘 Model Overview
209
-
210
- - **Model Name:** `e1-Phi4-FT-v2`
211
- - **Evaluation Benchmark:** [ZeroEval](https://github.com/WildEval/ZeroEval) - Zebra Grid subset
212
- - **Total Puzzles Evaluated:** 1000
213
- - **Evaluation Mode:** Greedy decoding
214
- - **Prompt Type:** Chain-of-Thought (CoT) logic puzzles with grid constraints
215
-
216
- ---
217
-
218
- ## 🔢 Quantitative Performance Summary
219
-
220
- | Metric | Value |
221
- |---------------------------|-----------|
222
- | **Puzzle Accuracy** | 33.70% |
223
- | **Cell Accuracy** | 47.59% |
224
- | **No Answer Rate** | 11.40% |
225
- | **Easy Puzzle Accuracy** | 78.21% |
226
- | **Hard Puzzle Accuracy** | 16.39% |
227
- | **Small Puzzle Accuracy** | 77.19% |
228
- | **Medium Puzzle Accuracy**| 30.00% |
229
- | **Large Puzzle Accuracy** | 3.00% |
230
- | **XL Puzzle Accuracy** | 0.00% |
231
- | **Reason Lens** | 433.30 |
232
-
233
- ---
234
-
235
- ## 🔍 Qualitative Performance
236
-
237
- ### ✅ Strengths
238
-
239
- - Strong performance on **easy and small puzzles** (77–78% accuracy).
240
- - Reasoning is often **structured, well-formed**, and aligned to the prompt format.
241
- - Handles simple logic and direct clue application correctly.
242
-
243
- ### ❌ Weaknesses
244
-
245
- - **Fails to generalize to complex, larger puzzles** (e.g., 0% on XL puzzles).
246
- - High **"no answer" rate** (~11%), suggesting reasoning breakdowns.
247
- - Tends to **violate global constraints** in 5x6 or 6x4 Zebra puzzles despite localized correctness.
248
- - **Lacks chain-consistency checking**, leading to unsatisfiable global solutions.
249
-
250
- ---
251
-
252
- ## 🧪 Example Evaluation (Puzzle ID: `lgp-test-5x6-16`)
253
-
254
- - The model produced a full solution and reasoning trace.
255
- - Correctly matched some constraints (e.g., nationality and color).
256
- - **Violated at least 3 constraints**, such as:
257
- - Incorrect relative placement of grilled cheese and Norwegian.
258
- - Invalid logical sequences (e.g., Dog → Fish → Sci-Fi).
259
- - Output JSON was valid and interpretable but logically inconsistent.
260
-
261
- ---
262
-
263
- ## 📊 Full Comparative Ranking Table (Zebra Grid - Puzzle Accuracy)
264
-
265
- | 🥇 Rank | Model Name | Puzzle Acc (%) | Cell Acc (%) | No Answer (%) | Reason Lens |
266
- |--------|--------------------------------------|----------------|---------------|----------------|--------------|
267
- | 1 | grok-3-mini-fast-beta-high | 92.60 | 94.63 | 1.00 | 782.25 |
268
- | 2 | o3-mini-2025-01-31-high | 91.70 | 95.70 | 0.30 | 1983.34 |
269
- | 3 | o3-mini-2025-01-31-medium | 88.90 | 90.41 | 0.10 | 2067.98 |
270
- | 4 | o1-2024-12-17 | 81.00 | 78.74 | 0.20 | 1197.51 |
271
- | 5 | grok-3-mini-fast-beta-low | 80.70 | 84.22 | 0.00 | 874.09 |
272
- | 6 | deepseek-R1 | 78.70 | 80.54 | 0.00 | 586.33 |
273
- | 7 | o3-mini-2025-01-31-low | 74.80 | 72.60 | 1.60 | 2080.78 |
274
- | 8 | o1-preview-2024-09-12 | 71.40 | 75.14 | 0.30 | 1565.88 |
275
- | 9 | o1-preview-2024-09-12-v2 | 70.40 | 74.18 | 0.40 | 1559.71 |
276
- | 10 | o1-mini-2024-09-12-v3 | 59.70 | 70.32 | 1.00 | 1166.38 |
277
- | 11 | grok-3-fast-beta | 57.90 | 67.73 | 0.00 | 3973.16 |
278
- | 12 | o1-mini-2024-09-12-v2 | 56.80 | 69.87 | 1.30 | 1164.95 |
279
- | 13 | o1-mini-2024-09-12 | 52.60 | 52.29 | 0.80 | 993.28 |
280
- | 14 | deepseek-v3 | 42.10 | 42.04 | 27.90 | 2158.00 |
281
- | 15 | claude-3-5-sonnet-20241022 | 36.20 | 54.27 | 0.00 | 861.18 |
282
- | 16 | **e1-Phi4-FT-v2** | **33.70** | 47.59 | 11.40 | 433.30 |
283
- | 17 | claude-3-5-sonnet-20240620 | 33.40 | 54.34 | 0.00 | 1141.94 |
284
- | 18 | Llama-3.1-405B-Inst-fp8@together | 32.60 | 45.80 | 12.50 | 314.66 |
285
- | 19 | gpt-4o-2024-08-06 | 31.70 | 50.34 | 3.60 | 1106.51 |
286
- | 20 | gemini-1.5-pro-exp-0827 | 30.50 | 50.84 | 0.80 | 1594.47 |
287
- | 21 | Llama-3.1-405B-Inst@sambanova | 30.10 | 39.06 | 24.70 | 2001.12 |
288
- | 22 | chatgpt-4o-latest-24-09-07 | 29.90 | 48.83 | 4.20 | 1539.99 |
289
- | 23 | Mistral-Large-2 | 29.00 | 47.64 | 1.70 | 1592.39 |
290
- | 24 | gpt-4-turbo-2024-04-09 | 28.40 | 47.90 | 0.10 | 1148.46 |
291
- | 25 | gpt-4o-2024-05-13 | 28.20 | 38.72 | 19.30 | 1643.51 |
292
- | 26 | grok-2-1212 | 27.70 | 48.16 | 3.50 | 974.00 |
293
-
294
- ---
295
-
296
- ## ⚙️ Model Configuration
297
-
298
- - **Inference Engine:** vLLM
299
- - **Temperature:** 0.0
300
- - **Top-p:** 1.0
301
- - **Max Tokens:** 4096
302
- - **Repetition Penalty:** 1.0
303
- - **Generation Strategy:** Greedy decoding (deterministic)
304
-
305
- ---
306
-
307
- ## 🧠 Recommendations for Improvement
308
-
309
- 1. **Architecture Improvements**
310
- - Integrate constraint solvers or logic programming modules.
311
- - Add symbolic reasoning or intermediate variable tracking.
312
-
313
- 2. **Training Enhancements**
314
- - Fine-tune on larger structured reasoning datasets.
315
- - Use curriculum learning from 3x3 to 6x6 puzzles.
316
-
317
- 3. **Prompt & Inference Strategy**
318
- - Use few-shot CoT prompting.
319
- - Explore self-consistency sampling (e.g., majority vote from top-k outputs).
320
- - Try symbolic supervision or logic rule augmentation.
321
-
322
- ---
323
-
324
- ## ✅ Conclusion
325
-
326
- The `e1-Phi4-FT-v2` model shows fundamental capabilities in logic puzzle reasoning, especially with smaller/easy instances. However, its **limited generalization and constraint tracking** hinder performance on complex Zebra puzzles. With further improvements in reasoning architecture and training, the model has potential to move into higher tiers on the ZeroEval leaderboard.
327
-
328
- ---
329
-
330
- *Report generated using the `ZeroEval` Zebra Grid benchmark data and model output logs.*
 
203
 
204
  [More Information Needed]
205