⸻
VERANTYX Evaluation & Benchmark Guide
How to Evaluate Reasoning — and How to Record Failure
⸻
- Why Evaluation Is Different in VERANTYX
VERANTYX is not optimized for accuracy metrics. It is optimized for auditable reasoning behavior.
Therefore, traditional evaluation questions like: • “How many did it get right?” • “What is the accuracy@k?”
are insufficient and sometimes misleading.
In VERANTYX, failure is not a bug — unexplained success is.
⸻
- What VERANTYX Is Actually Evaluated On
VERANTYX evaluation focuses on process integrity, not output fluency.
Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity
⸻
- Primary Outcome Categories (Do Not Collapse These)
Every evaluation result must fall into one of four explicit buckets: 1. Proven • Supported by DB rules • Assumptions satisfied • No refutation hits 2. Refuted • One or more counterexample rules fired • Or necessity checks failed 3. Insufficient Evidence • No applicable proof or refutation • Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined) • Depends on Phase 2 rules • Explicitly labeled as non-final
⚠️ Never merge these into “correct / incorrect”.
⸻
- Benchmark Design Principles
3.1 Prefer “Trap-Rich” Problems
Good VERANTYX benchmarks include: • Missing assumptions • Overgeneralized language (“always”, “all”) • Vacuity cases • Necessary vs sufficient confusion
Bad benchmarks: • Straightforward textbook exercises • Problems solvable by surface pattern matching
⸻
3.2 Single-Answer Accuracy Is Not Enough
Each benchmark item should be evaluated on: • Was the selected outcome category correct? • Did the system refuse when it should? • Did it avoid hallucinated certainty?
A correct refusal scores higher than an unjustified answer.
⸻
- Measuring Success (VERANTYX-style Metrics)
4.1 Core Metrics
Instead of accuracy, track: • Justified Resolution Rate • % of answers that are Proven or Refuted with evidence • Correct Refusal Rate • % of Insufficient Evidence when the DB truly lacks coverage • False Confidence Rate (Critical) • % of answers given without valid proof/refutation
Goal: drive this to near zero.
⸻
4.2 Phase Sensitivity Metrics
Track separately: • Phase 1 success rate • Phase 2 activation rate • Phase 2 dependency rate
A growing DB should: • Increase Phase 1 coverage • Decrease Phase 2 usage
⸻
- Failure Is a First-Class Artifact
5.1 What Counts as a “Good Failure”
A good failure: • Is labeled “insufficient evidence” • Explains why no rule applied • Produces zero noise
A bad failure: • Selects a choice with weak lexical match • Uses mined rules silently • Looks confident but is unjustified
⸻
5.2 Mandatory Failure Logging
For every failure, record: • Input problem • Extracted assumptions • Which rules were skipped (and why) • Which rules almost matched • Phase reached (1 or 2)
Failures are data, not errors.
⸻
- Regression Testing with Benchmarks
Every DB change should be evaluated against: • A fixed benchmark suite • A failure expectation list
Regression is not only: • “Did it break a correct answer?”
but also: • “Did it stop refusing where it should?”
⸻
- Comparing VERANTYX to LLMs (Correctly)
Do not compare: • Raw accuracy • Speed • Fluency
Do compare: • Hallucination rate • Refusal correctness • Sensitivity to missing assumptions • Stability across re-runs
VERANTYX should lose on fluency — and win on discipline.
⸻
- Human-in-the-Loop Evaluation
Recommended practice: • Periodic human audit of: • Refusal cases • Provisional (mined) answers • High-evidence proofs
Humans should ask:
“Would I trust this answer in a paper?”
If not, the system should not either.
⸻
- Benchmark Suite Composition (Recommended)
A balanced VERANTYX benchmark includes: • 30–40% unsolvable (DB-limited) problems • 30% trap problems • 20% clean axiomatic problems • 10–20% exploratory / edge cases
If everything is solvable, the benchmark is broken.
⸻
- Evaluation Anti-Patterns (Avoid These)
❌ Scoring by majority vote ❌ Treating refusal as failure ❌ Hiding Phase 2 usage ❌ Optimizing prompts to “pass benchmarks”
Benchmarks exist to expose limits, not hide them.
⸻
Closing Note (English)
VERANTYX is successful when: • It says “I don’t know” often • And explains why
A benchmark that makes VERANTYX look perfect is a benchmark that teaches nothing.
⸻
⸻
VERANTYX 評価・ベンチマークガイド
どう評価し、どう失敗を記録するか
⸻
- VERANTYX における評価の前提
VERANTYX は 正答率最適化システムではありません。 評価対象は 推論の振る舞いです。
そのため: • 正答率 • Top-k accuracy
だけで測るのは危険です。
VERANTYX では 説明できない成功こそが失敗です。
⸻
- 評価対象は「結果」ではなく「過程」
VERANTYX の評価軸: 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性
⸻
- 出力は必ず 4 分類で扱う
評価時、すべての結果は次のいずれか: 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答(採掘)
⚠️ 正解/不正解に潰してはいけません。
⸻
- ベンチマーク設計指針
3.1 罠のある問題を優先
良い問題: • 仮定抜け • 「常に」「すべて」 • 空虚真理 • 必要十分の混同
悪い問題: • 公式暗記問題 • 表層一致で解けるもの
⸻
3.2 拒否は成功である • 正しい拒否 > 根拠のない回答
⸻
- VERANTYX 流メトリクス
4.1 基本指標 • 根拠付き決定率 • 正当拒否率 • 誤った確信率(最重要)
目標:誤った確信を限りなくゼロに。
⸻
4.2 Phase 別指標 • Phase 1 成功率 • Phase 2 使用率 • Phase 2 依存率
DB が育つほど Phase 2 は減るべき。
⸻
- 失敗は「成果物」
5.1 良い失敗 • 証拠不足と明示 • なぜ解けないかが分かる • ノイズゼロ
5.2 悪い失敗 • それっぽい選択 • 仮知識を黙って使う • 自信満々
⸻
5.3 失敗ログに必須な情報 • 問題文 • 抽出仮定 • 弾かれたルールと理由 • Phase 到達点
⸻
- 回帰テストの考え方
確認すべきは: • 正解が壊れたか • 拒否すべきところで答えていないか
⸻
- LLM との比較方法
比較すべき: • 幻覚率 • 拒否の正しさ • 仮定感度 • 再現性
比較すべきでない: • 流暢さ • 速度
⸻
- 人間による監査
定期的に: • 拒否ケース • 仮回答 • 高証拠回答
を人が確認する。
「論文に書けるか?」が基準。
⸻
- 推奨ベンチマーク構成 • 解けない問題:30–40% • 罠問題:30% • 純粋公理:20% • 境界ケース:10–20%
全部解けたら危険。
⸻
- やってはいけない評価
❌ 多数決 ❌ 拒否=失敗 ❌ Phase 2 を隠す ❌ ベンチ対策プロンプト
⸻
終わりに(日本語)
VERANTYX が成功している状態とは、 • よく「分からない」と言い • なぜ分からないかが分かる
状態です。
完璧に見えるベンチマークは 何も教えてくれません。