| ⸻ | |
| VERANTYX Evaluation & Benchmark Guide | |
| How to Evaluate Reasoning — and How to Record Failure | |
| ⸻ | |
| 0. Why Evaluation Is Different in VERANTYX | |
| VERANTYX is not optimized for accuracy metrics. | |
| It is optimized for auditable reasoning behavior. | |
| Therefore, traditional evaluation questions like: | |
| • “How many did it get right?” | |
| • “What is the accuracy@k?” | |
| are insufficient and sometimes misleading. | |
| In VERANTYX, failure is not a bug — | |
| unexplained success is. | |
| ⸻ | |
| 1. What VERANTYX Is Actually Evaluated On | |
| VERANTYX evaluation focuses on process integrity, not output fluency. | |
| Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity | |
| ⸻ | |
| 2. Primary Outcome Categories (Do Not Collapse These) | |
| Every evaluation result must fall into one of four explicit buckets: 1. Proven | |
| • Supported by DB rules | |
| • Assumptions satisfied | |
| • No refutation hits 2. Refuted | |
| • One or more counterexample rules fired | |
| • Or necessity checks failed 3. Insufficient Evidence | |
| • No applicable proof or refutation | |
| • Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined) | |
| • Depends on Phase 2 rules | |
| • Explicitly labeled as non-final | |
| ⚠️ Never merge these into “correct / incorrect”. | |
| ⸻ | |
| 3. Benchmark Design Principles | |
| 3.1 Prefer “Trap-Rich” Problems | |
| Good VERANTYX benchmarks include: | |
| • Missing assumptions | |
| • Overgeneralized language (“always”, “all”) | |
| • Vacuity cases | |
| • Necessary vs sufficient confusion | |
| Bad benchmarks: | |
| • Straightforward textbook exercises | |
| • Problems solvable by surface pattern matching | |
| ⸻ | |
| 3.2 Single-Answer Accuracy Is Not Enough | |
| Each benchmark item should be evaluated on: | |
| • Was the selected outcome category correct? | |
| • Did the system refuse when it should? | |
| • Did it avoid hallucinated certainty? | |
| A correct refusal scores higher than an unjustified answer. | |
| ⸻ | |
| 4. Measuring Success (VERANTYX-style Metrics) | |
| 4.1 Core Metrics | |
| Instead of accuracy, track: | |
| • Justified Resolution Rate | |
| • % of answers that are Proven or Refuted with evidence | |
| • Correct Refusal Rate | |
| • % of Insufficient Evidence when the DB truly lacks coverage | |
| • False Confidence Rate (Critical) | |
| • % of answers given without valid proof/refutation | |
| Goal: drive this to near zero. | |
| ⸻ | |
| 4.2 Phase Sensitivity Metrics | |
| Track separately: | |
| • Phase 1 success rate | |
| • Phase 2 activation rate | |
| • Phase 2 dependency rate | |
| A growing DB should: | |
| • Increase Phase 1 coverage | |
| • Decrease Phase 2 usage | |
| ⸻ | |
| 5. Failure Is a First-Class Artifact | |
| 5.1 What Counts as a “Good Failure” | |
| A good failure: | |
| • Is labeled “insufficient evidence” | |
| • Explains why no rule applied | |
| • Produces zero noise | |
| A bad failure: | |
| • Selects a choice with weak lexical match | |
| • Uses mined rules silently | |
| • Looks confident but is unjustified | |
| ⸻ | |
| 5.2 Mandatory Failure Logging | |
| For every failure, record: | |
| • Input problem | |
| • Extracted assumptions | |
| • Which rules were skipped (and why) | |
| • Which rules almost matched | |
| • Phase reached (1 or 2) | |
| Failures are data, not errors. | |
| ⸻ | |
| 6. Regression Testing with Benchmarks | |
| Every DB change should be evaluated against: | |
| • A fixed benchmark suite | |
| • A failure expectation list | |
| Regression is not only: | |
| • “Did it break a correct answer?” | |
| but also: | |
| • “Did it stop refusing where it should?” | |
| ⸻ | |
| 7. Comparing VERANTYX to LLMs (Correctly) | |
| Do not compare: | |
| • Raw accuracy | |
| • Speed | |
| • Fluency | |
| Do compare: | |
| • Hallucination rate | |
| • Refusal correctness | |
| • Sensitivity to missing assumptions | |
| • Stability across re-runs | |
| VERANTYX should lose on fluency — and win on discipline. | |
| ⸻ | |
| 8. Human-in-the-Loop Evaluation | |
| Recommended practice: | |
| • Periodic human audit of: | |
| • Refusal cases | |
| • Provisional (mined) answers | |
| • High-evidence proofs | |
| Humans should ask: | |
| “Would I trust this answer in a paper?” | |
| If not, the system should not either. | |
| ⸻ | |
| 9. Benchmark Suite Composition (Recommended) | |
| A balanced VERANTYX benchmark includes: | |
| • 30–40% unsolvable (DB-limited) problems | |
| • 30% trap problems | |
| • 20% clean axiomatic problems | |
| • 10–20% exploratory / edge cases | |
| If everything is solvable, the benchmark is broken. | |
| ⸻ | |
| 10. Evaluation Anti-Patterns (Avoid These) | |
| ❌ Scoring by majority vote | |
| ❌ Treating refusal as failure | |
| ❌ Hiding Phase 2 usage | |
| ❌ Optimizing prompts to “pass benchmarks” | |
| Benchmarks exist to expose limits, not hide them. | |
| ⸻ | |
| Closing Note (English) | |
| VERANTYX is successful when: | |
| • It says “I don’t know” often | |
| • And explains why | |
| A benchmark that makes VERANTYX look perfect | |
| is a benchmark that teaches nothing. | |
| ⸻ | |
| ⸻ | |
| VERANTYX 評価・ベンチマークガイド | |
| どう評価し、どう失敗を記録するか | |
| ⸻ | |
| 0. VERANTYX における評価の前提 | |
| VERANTYX は 正答率最適化システムではありません。 | |
| 評価対象は 推論の振る舞いです。 | |
| そのため: | |
| • 正答率 | |
| • Top-k accuracy | |
| だけで測るのは危険です。 | |
| VERANTYX では | |
| 説明できない成功こそが失敗です。 | |
| ⸻ | |
| 1. 評価対象は「結果」ではなく「過程」 | |
| VERANTYX の評価軸: 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性 | |
| ⸻ | |
| 2. 出力は必ず 4 分類で扱う | |
| 評価時、すべての結果は次のいずれか: 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答(採掘) | |
| ⚠️ 正解/不正解に潰してはいけません。 | |
| ⸻ | |
| 3. ベンチマーク設計指針 | |
| 3.1 罠のある問題を優先 | |
| 良い問題: | |
| • 仮定抜け | |
| • 「常に」「すべて」 | |
| • 空虚真理 | |
| • 必要十分の混同 | |
| 悪い問題: | |
| • 公式暗記問題 | |
| • 表層一致で解けるもの | |
| ⸻ | |
| 3.2 拒否は成功である | |
| • 正しい拒否 > 根拠のない回答 | |
| ⸻ | |
| 4. VERANTYX 流メトリクス | |
| 4.1 基本指標 | |
| • 根拠付き決定率 | |
| • 正当拒否率 | |
| • 誤った確信率(最重要) | |
| 目標:誤った確信を限りなくゼロに。 | |
| ⸻ | |
| 4.2 Phase 別指標 | |
| • Phase 1 成功率 | |
| • Phase 2 使用率 | |
| • Phase 2 依存率 | |
| DB が育つほど Phase 2 は減るべき。 | |
| ⸻ | |
| 5. 失敗は「成果物」 | |
| 5.1 良い失敗 | |
| • 証拠不足と明示 | |
| • なぜ解けないかが分かる | |
| • ノイズゼロ | |
| 5.2 悪い失敗 | |
| • それっぽい選択 | |
| • 仮知識を黙って使う | |
| • 自信満々 | |
| ⸻ | |
| 5.3 失敗ログに必須な情報 | |
| • 問題文 | |
| • 抽出仮定 | |
| • 弾かれたルールと理由 | |
| • Phase 到達点 | |
| ⸻ | |
| 6. 回帰テストの考え方 | |
| 確認すべきは: | |
| • 正解が壊れたか | |
| • 拒否すべきところで答えていないか | |
| ⸻ | |
| 7. LLM との比較方法 | |
| 比較すべき: | |
| • 幻覚率 | |
| • 拒否の正しさ | |
| • 仮定感度 | |
| • 再現性 | |
| 比較すべきでない: | |
| • 流暢さ | |
| • 速度 | |
| ⸻ | |
| 8. 人間による監査 | |
| 定期的に: | |
| • 拒否ケース | |
| • 仮回答 | |
| • 高証拠回答 | |
| を人が確認する。 | |
| 「論文に書けるか?」が基準。 | |
| ⸻ | |
| 9. 推奨ベンチマーク構成 | |
| • 解けない問題:30–40% | |
| • 罠問題:30% | |
| • 純粋公理:20% | |
| • 境界ケース:10–20% | |
| 全部解けたら危険。 | |
| ⸻ | |
| 10. やってはいけない評価 | |
| ❌ 多数決 | |
| ❌ 拒否=失敗 | |
| ❌ Phase 2 を隠す | |
| ❌ ベンチ対策プロンプト | |
| ⸻ | |
| 終わりに(日本語) | |
| VERANTYX が成功している状態とは、 | |
| • よく「分からない」と言い | |
| • なぜ分からないかが分かる | |
| 状態です。 | |
| 完璧に見えるベンチマークは | |
| 何も教えてくれません。 | |