⸻ VERANTYX Evaluation & Benchmark Guide How to Evaluate Reasoning — and How to Record Failure ⸻ 0. Why Evaluation Is Different in VERANTYX VERANTYX is not optimized for accuracy metrics. It is optimized for auditable reasoning behavior. Therefore, traditional evaluation questions like: • “How many did it get right?” • “What is the accuracy@k?” are insufficient and sometimes misleading. In VERANTYX, failure is not a bug — unexplained success is. ⸻ 1. What VERANTYX Is Actually Evaluated On VERANTYX evaluation focuses on process integrity, not output fluency. Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity ⸻ 2. Primary Outcome Categories (Do Not Collapse These) Every evaluation result must fall into one of four explicit buckets: 1. Proven • Supported by DB rules • Assumptions satisfied • No refutation hits 2. Refuted • One or more counterexample rules fired • Or necessity checks failed 3. Insufficient Evidence • No applicable proof or refutation • Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined) • Depends on Phase 2 rules • Explicitly labeled as non-final ⚠️ Never merge these into “correct / incorrect”. ⸻ 3. Benchmark Design Principles 3.1 Prefer “Trap-Rich” Problems Good VERANTYX benchmarks include: • Missing assumptions • Overgeneralized language (“always”, “all”) • Vacuity cases • Necessary vs sufficient confusion Bad benchmarks: • Straightforward textbook exercises • Problems solvable by surface pattern matching ⸻ 3.2 Single-Answer Accuracy Is Not Enough Each benchmark item should be evaluated on: • Was the selected outcome category correct? • Did the system refuse when it should? • Did it avoid hallucinated certainty? A correct refusal scores higher than an unjustified answer. ⸻ 4. Measuring Success (VERANTYX-style Metrics) 4.1 Core Metrics Instead of accuracy, track: • Justified Resolution Rate • % of answers that are Proven or Refuted with evidence • Correct Refusal Rate • % of Insufficient Evidence when the DB truly lacks coverage • False Confidence Rate (Critical) • % of answers given without valid proof/refutation Goal: drive this to near zero. ⸻ 4.2 Phase Sensitivity Metrics Track separately: • Phase 1 success rate • Phase 2 activation rate • Phase 2 dependency rate A growing DB should: • Increase Phase 1 coverage • Decrease Phase 2 usage ⸻ 5. Failure Is a First-Class Artifact 5.1 What Counts as a “Good Failure” A good failure: • Is labeled “insufficient evidence” • Explains why no rule applied • Produces zero noise A bad failure: • Selects a choice with weak lexical match • Uses mined rules silently • Looks confident but is unjustified ⸻ 5.2 Mandatory Failure Logging For every failure, record: • Input problem • Extracted assumptions • Which rules were skipped (and why) • Which rules almost matched • Phase reached (1 or 2) Failures are data, not errors. ⸻ 6. Regression Testing with Benchmarks Every DB change should be evaluated against: • A fixed benchmark suite • A failure expectation list Regression is not only: • “Did it break a correct answer?” but also: • “Did it stop refusing where it should?” ⸻ 7. Comparing VERANTYX to LLMs (Correctly) Do not compare: • Raw accuracy • Speed • Fluency Do compare: • Hallucination rate • Refusal correctness • Sensitivity to missing assumptions • Stability across re-runs VERANTYX should lose on fluency — and win on discipline. ⸻ 8. Human-in-the-Loop Evaluation Recommended practice: • Periodic human audit of: • Refusal cases • Provisional (mined) answers • High-evidence proofs Humans should ask: “Would I trust this answer in a paper?” If not, the system should not either. ⸻ 9. Benchmark Suite Composition (Recommended) A balanced VERANTYX benchmark includes: • 30–40% unsolvable (DB-limited) problems • 30% trap problems • 20% clean axiomatic problems • 10–20% exploratory / edge cases If everything is solvable, the benchmark is broken. ⸻ 10. Evaluation Anti-Patterns (Avoid These) ❌ Scoring by majority vote ❌ Treating refusal as failure ❌ Hiding Phase 2 usage ❌ Optimizing prompts to “pass benchmarks” Benchmarks exist to expose limits, not hide them. ⸻ Closing Note (English) VERANTYX is successful when: • It says “I don’t know” often • And explains why A benchmark that makes VERANTYX look perfect is a benchmark that teaches nothing. ⸻ ⸻ VERANTYX 評価・ベンチマークガイド どう評価し、どう失敗を記録するか ⸻ 0. VERANTYX における評価の前提 VERANTYX は 正答率最適化システムではありません。 評価対象は 推論の振る舞いです。 そのため: • 正答率 • Top-k accuracy だけで測るのは危険です。 VERANTYX では 説明できない成功こそが失敗です。 ⸻ 1. 評価対象は「結果」ではなく「過程」 VERANTYX の評価軸: 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性 ⸻ 2. 出力は必ず 4 分類で扱う 評価時、すべての結果は次のいずれか: 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答(採掘) ⚠️ 正解/不正解に潰してはいけません。 ⸻ 3. ベンチマーク設計指針 3.1 罠のある問題を優先 良い問題: • 仮定抜け • 「常に」「すべて」 • 空虚真理 • 必要十分の混同 悪い問題: • 公式暗記問題 • 表層一致で解けるもの ⸻ 3.2 拒否は成功である • 正しい拒否 > 根拠のない回答 ⸻ 4. VERANTYX 流メトリクス 4.1 基本指標 • 根拠付き決定率 • 正当拒否率 • 誤った確信率(最重要) 目標:誤った確信を限りなくゼロに。 ⸻ 4.2 Phase 別指標 • Phase 1 成功率 • Phase 2 使用率 • Phase 2 依存率 DB が育つほど Phase 2 は減るべき。 ⸻ 5. 失敗は「成果物」 5.1 良い失敗 • 証拠不足と明示 • なぜ解けないかが分かる • ノイズゼロ 5.2 悪い失敗 • それっぽい選択 • 仮知識を黙って使う • 自信満々 ⸻ 5.3 失敗ログに必須な情報 • 問題文 • 抽出仮定 • 弾かれたルールと理由 • Phase 到達点 ⸻ 6. 回帰テストの考え方 確認すべきは: • 正解が壊れたか • 拒否すべきところで答えていないか ⸻ 7. LLM との比較方法 比較すべき: • 幻覚率 • 拒否の正しさ • 仮定感度 • 再現性 比較すべきでない: • 流暢さ • 速度 ⸻ 8. 人間による監査 定期的に: • 拒否ケース • 仮回答 • 高証拠回答 を人が確認する。 「論文に書けるか?」が基準。 ⸻ 9. 推奨ベンチマーク構成 • 解けない問題:30–40% • 罠問題:30% • 純粋公理:20% • 境界ケース:10–20% 全部解けたら危険。 ⸻ 10. やってはいけない評価 ❌ 多数決 ❌ 拒否=失敗 ❌ Phase 2 を隠す ❌ ベンチ対策プロンプト ⸻ 終わりに(日本語) VERANTYX が成功している状態とは、 • よく「分からない」と言い • なぜ分からないかが分かる 状態です。 完璧に見えるベンチマークは 何も教えてくれません。