⸻

VERANTYX Evaluation & Benchmark Guide

How to Evaluate Reasoning — and How to Record Failure

⸻

0. Why Evaluation Is Different in VERANTYX

VERANTYX is not optimized for accuracy metrics.
It is optimized for auditable reasoning behavior.

Therefore, traditional evaluation questions like:
• “How many did it get right?”
• “What is the accuracy@k?”

are insufficient and sometimes misleading.

In VERANTYX, failure is not a bug —
unexplained success is.

⸻

1. What VERANTYX Is Actually Evaluated On

VERANTYX evaluation focuses on process integrity, not output fluency.

Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity

⸻

2. Primary Outcome Categories (Do Not Collapse These)

Every evaluation result must fall into one of four explicit buckets: 1. Proven
• Supported by DB rules
• Assumptions satisfied
• No refutation hits 2. Refuted
• One or more counterexample rules fired
• Or necessity checks failed 3. Insufficient Evidence
• No applicable proof or refutation
• Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined)
• Depends on Phase 2 rules
• Explicitly labeled as non-final

⚠️ Never merge these into “correct / incorrect”.

⸻

3. Benchmark Design Principles

3.1 Prefer “Trap-Rich” Problems

Good VERANTYX benchmarks include:
• Missing assumptions
• Overgeneralized language (“always”, “all”)
• Vacuity cases
• Necessary vs sufficient confusion

Bad benchmarks:
• Straightforward textbook exercises
• Problems solvable by surface pattern matching

⸻

3.2 Single-Answer Accuracy Is Not Enough

Each benchmark item should be evaluated on:
• Was the selected outcome category correct?
• Did the system refuse when it should?
• Did it avoid hallucinated certainty?

A correct refusal scores higher than an unjustified answer.

⸻

4. Measuring Success (VERANTYX-style Metrics)

4.1 Core Metrics

Instead of accuracy, track:
• Justified Resolution Rate
• % of answers that are Proven or Refuted with evidence
• Correct Refusal Rate
• % of Insufficient Evidence when the DB truly lacks coverage
• False Confidence Rate (Critical)
• % of answers given without valid proof/refutation

Goal: drive this to near zero.

⸻

4.2 Phase Sensitivity Metrics

Track separately:
• Phase 1 success rate
• Phase 2 activation rate
• Phase 2 dependency rate

A growing DB should:
• Increase Phase 1 coverage
• Decrease Phase 2 usage

⸻

5. Failure Is a First-Class Artifact

5.1 What Counts as a “Good Failure”

A good failure:
• Is labeled “insufficient evidence”
• Explains why no rule applied
• Produces zero noise

A bad failure:
• Selects a choice with weak lexical match
• Uses mined rules silently
• Looks confident but is unjustified

⸻

5.2 Mandatory Failure Logging

For every failure, record:
• Input problem
• Extracted assumptions
• Which rules were skipped (and why)
• Which rules almost matched
• Phase reached (1 or 2)

Failures are data, not errors.

⸻

6. Regression Testing with Benchmarks

Every DB change should be evaluated against:
• A fixed benchmark suite
• A failure expectation list

Regression is not only:
• “Did it break a correct answer?”

but also:
• “Did it stop refusing where it should?”

⸻

7. Comparing VERANTYX to LLMs (Correctly)

Do not compare:
• Raw accuracy
• Speed
• Fluency

Do compare:
• Hallucination rate
• Refusal correctness
• Sensitivity to missing assumptions
• Stability across re-runs

VERANTYX should lose on fluency — and win on discipline.

⸻

8. Human-in-the-Loop Evaluation

Recommended practice:
• Periodic human audit of:
• Refusal cases
• Provisional (mined) answers
• High-evidence proofs

Humans should ask:

“Would I trust this answer in a paper?”

If not, the system should not either.

⸻

9. Benchmark Suite Composition (Recommended)

A balanced VERANTYX benchmark includes:
• 30–40% unsolvable (DB-limited) problems
• 30% trap problems
• 20% clean axiomatic problems
• 10–20% exploratory / edge cases

If everything is solvable, the benchmark is broken.

⸻

10. Evaluation Anti-Patterns (Avoid These)

❌ Scoring by majority vote
❌ Treating refusal as failure
❌ Hiding Phase 2 usage
❌ Optimizing prompts to “pass benchmarks”

Benchmarks exist to expose limits, not hide them.

⸻

Closing Note (English)

VERANTYX is successful when:
• It says “I don’t know” often
• And explains why

A benchmark that makes VERANTYX look perfect
is a benchmark that teaches nothing.

⸻

⸻

VERANTYX 評価・ベンチマークガイド

どう評価し、どう失敗を記録するか

⸻

0. VERANTYX における評価の前提

VERANTYX は 正答率最適化システムではありません。
評価対象は 推論の振る舞いです。

そのため：
• 正答率
• Top-k accuracy

だけで測るのは危険です。

VERANTYX では
説明できない成功こそが失敗です。

⸻

1. 評価対象は「結果」ではなく「過程」

VERANTYX の評価軸： 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性

⸻

2. 出力は必ず 4 分類で扱う

評価時、すべての結果は次のいずれか： 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答（採掘）

⚠️ 正解／不正解に潰してはいけません。

⸻

3. ベンチマーク設計指針

3.1 罠のある問題を優先

良い問題：
• 仮定抜け
• 「常に」「すべて」
• 空虚真理
• 必要十分の混同

悪い問題：
• 公式暗記問題
• 表層一致で解けるもの

⸻

3.2 拒否は成功である
• 正しい拒否 ＞ 根拠のない回答

⸻

4. VERANTYX 流メトリクス

4.1 基本指標
• 根拠付き決定率
• 正当拒否率
• 誤った確信率（最重要）

目標：誤った確信を限りなくゼロに。

⸻

4.2 Phase 別指標
• Phase 1 成功率
• Phase 2 使用率
• Phase 2 依存率

DB が育つほど Phase 2 は減るべき。

⸻

5. 失敗は「成果物」

5.1 良い失敗
• 証拠不足と明示
• なぜ解けないかが分かる
• ノイズゼロ

5.2 悪い失敗
• それっぽい選択
• 仮知識を黙って使う
• 自信満々

⸻

5.3 失敗ログに必須な情報
• 問題文
• 抽出仮定
• 弾かれたルールと理由
• Phase 到達点

⸻

6. 回帰テストの考え方

確認すべきは：
• 正解が壊れたか
• 拒否すべきところで答えていないか

⸻

7. LLM との比較方法

比較すべき：
• 幻覚率
• 拒否の正しさ
• 仮定感度
• 再現性

比較すべきでない：
• 流暢さ
• 速度

⸻

8. 人間による監査

定期的に：
• 拒否ケース
• 仮回答
• 高証拠回答

を人が確認する。

「論文に書けるか？」が基準。

⸻

9. 推奨ベンチマーク構成
   • 解けない問題：30–40%
   • 罠問題：30%
   • 純粋公理：20%
   • 境界ケース：10–20%

全部解けたら危険。

⸻

10. やってはいけない評価

❌ 多数決
❌ 拒否＝失敗
❌ Phase 2 を隠す
❌ ベンチ対策プロンプト

⸻

終わりに（日本語）

VERANTYX が成功している状態とは、
• よく「分からない」と言い
• なぜ分からないかが分かる

状態です。

完璧に見えるベンチマークは
何も教えてくれません。