verantyx / axis /documents /Benchmark_Guide.md
kofdai's picture
Upload folder using huggingface_hub
6d07351 verified

VERANTYX Evaluation & Benchmark Guide

How to Evaluate Reasoning — and How to Record Failure

  1. Why Evaluation Is Different in VERANTYX

VERANTYX is not optimized for accuracy metrics. It is optimized for auditable reasoning behavior.

Therefore, traditional evaluation questions like: • “How many did it get right?” • “What is the accuracy@k?”

are insufficient and sometimes misleading.

In VERANTYX, failure is not a bug — unexplained success is.

  1. What VERANTYX Is Actually Evaluated On

VERANTYX evaluation focuses on process integrity, not output fluency.

Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity

  1. Primary Outcome Categories (Do Not Collapse These)

Every evaluation result must fall into one of four explicit buckets: 1. Proven • Supported by DB rules • Assumptions satisfied • No refutation hits 2. Refuted • One or more counterexample rules fired • Or necessity checks failed 3. Insufficient Evidence • No applicable proof or refutation • Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined) • Depends on Phase 2 rules • Explicitly labeled as non-final

⚠️ Never merge these into “correct / incorrect”.

  1. Benchmark Design Principles

3.1 Prefer “Trap-Rich” Problems

Good VERANTYX benchmarks include: • Missing assumptions • Overgeneralized language (“always”, “all”) • Vacuity cases • Necessary vs sufficient confusion

Bad benchmarks: • Straightforward textbook exercises • Problems solvable by surface pattern matching

3.2 Single-Answer Accuracy Is Not Enough

Each benchmark item should be evaluated on: • Was the selected outcome category correct? • Did the system refuse when it should? • Did it avoid hallucinated certainty?

A correct refusal scores higher than an unjustified answer.

  1. Measuring Success (VERANTYX-style Metrics)

4.1 Core Metrics

Instead of accuracy, track: • Justified Resolution Rate • % of answers that are Proven or Refuted with evidence • Correct Refusal Rate • % of Insufficient Evidence when the DB truly lacks coverage • False Confidence Rate (Critical) • % of answers given without valid proof/refutation

Goal: drive this to near zero.

4.2 Phase Sensitivity Metrics

Track separately: • Phase 1 success rate • Phase 2 activation rate • Phase 2 dependency rate

A growing DB should: • Increase Phase 1 coverage • Decrease Phase 2 usage

  1. Failure Is a First-Class Artifact

5.1 What Counts as a “Good Failure”

A good failure: • Is labeled “insufficient evidence” • Explains why no rule applied • Produces zero noise

A bad failure: • Selects a choice with weak lexical match • Uses mined rules silently • Looks confident but is unjustified

5.2 Mandatory Failure Logging

For every failure, record: • Input problem • Extracted assumptions • Which rules were skipped (and why) • Which rules almost matched • Phase reached (1 or 2)

Failures are data, not errors.

  1. Regression Testing with Benchmarks

Every DB change should be evaluated against: • A fixed benchmark suite • A failure expectation list

Regression is not only: • “Did it break a correct answer?”

but also: • “Did it stop refusing where it should?”

  1. Comparing VERANTYX to LLMs (Correctly)

Do not compare: • Raw accuracy • Speed • Fluency

Do compare: • Hallucination rate • Refusal correctness • Sensitivity to missing assumptions • Stability across re-runs

VERANTYX should lose on fluency — and win on discipline.

  1. Human-in-the-Loop Evaluation

Recommended practice: • Periodic human audit of: • Refusal cases • Provisional (mined) answers • High-evidence proofs

Humans should ask:

“Would I trust this answer in a paper?”

If not, the system should not either.

  1. Benchmark Suite Composition (Recommended)

A balanced VERANTYX benchmark includes: • 30–40% unsolvable (DB-limited) problems • 30% trap problems • 20% clean axiomatic problems • 10–20% exploratory / edge cases

If everything is solvable, the benchmark is broken.

  1. Evaluation Anti-Patterns (Avoid These)

❌ Scoring by majority vote ❌ Treating refusal as failure ❌ Hiding Phase 2 usage ❌ Optimizing prompts to “pass benchmarks”

Benchmarks exist to expose limits, not hide them.

Closing Note (English)

VERANTYX is successful when: • It says “I don’t know” often • And explains why

A benchmark that makes VERANTYX look perfect is a benchmark that teaches nothing.

VERANTYX 評価・ベンチマークガイド

どう評価し、どう失敗を記録するか

  1. VERANTYX における評価の前提

VERANTYX は 正答率最適化システムではありません。 評価対象は 推論の振る舞いです。

そのため: • 正答率 • Top-k accuracy

だけで測るのは危険です。

VERANTYX では 説明できない成功こそが失敗です。

  1. 評価対象は「結果」ではなく「過程」

VERANTYX の評価軸: 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性

  1. 出力は必ず 4 分類で扱う

評価時、すべての結果は次のいずれか: 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答(採掘)

⚠️ 正解/不正解に潰してはいけません。

  1. ベンチマーク設計指針

3.1 罠のある問題を優先

良い問題: • 仮定抜け • 「常に」「すべて」 • 空虚真理 • 必要十分の混同

悪い問題: • 公式暗記問題 • 表層一致で解けるもの

3.2 拒否は成功である • 正しい拒否 > 根拠のない回答

  1. VERANTYX 流メトリクス

4.1 基本指標 • 根拠付き決定率 • 正当拒否率 • 誤った確信率(最重要)

目標:誤った確信を限りなくゼロに。

4.2 Phase 別指標 • Phase 1 成功率 • Phase 2 使用率 • Phase 2 依存率

DB が育つほど Phase 2 は減るべき。

  1. 失敗は「成果物」

5.1 良い失敗 • 証拠不足と明示 • なぜ解けないかが分かる • ノイズゼロ

5.2 悪い失敗 • それっぽい選択 • 仮知識を黙って使う • 自信満々

5.3 失敗ログに必須な情報 • 問題文 • 抽出仮定 • 弾かれたルールと理由 • Phase 到達点

  1. 回帰テストの考え方

確認すべきは: • 正解が壊れたか • 拒否すべきところで答えていないか

  1. LLM との比較方法

比較すべき: • 幻覚率 • 拒否の正しさ • 仮定感度 • 再現性

比較すべきでない: • 流暢さ • 速度

  1. 人間による監査

定期的に: • 拒否ケース • 仮回答 • 高証拠回答

を人が確認する。

「論文に書けるか?」が基準。

  1. 推奨ベンチマーク構成 • 解けない問題:30–40% • 罠問題:30% • 純粋公理:20% • 境界ケース:10–20%

全部解けたら危険。

  1. やってはいけない評価

❌ 多数決 ❌ 拒否=失敗 ❌ Phase 2 を隠す ❌ ベンチ対策プロンプト

終わりに(日本語)

VERANTYX が成功している状態とは、 • よく「分からない」と言い • なぜ分からないかが分かる

状態です。

完璧に見えるベンチマークは 何も教えてくれません。