verantyx / axis /documents /Benchmark_Guide.md

kofdai

Upload folder using huggingface_hub

6d07351 verified 18 days ago

preview code

raw

history blame contribute delete

7.63 kB

⸻

VERANTYX Evaluation & Benchmark Guide

How to Evaluate Reasoning — and How to Record Failure

⸻

Why Evaluation Is Different in VERANTYX

VERANTYX is not optimized for accuracy metrics. It is optimized for auditable reasoning behavior.

Therefore, traditional evaluation questions like: • “How many did it get right?” • “What is the accuracy@k?”

are insufficient and sometimes misleading.

In VERANTYX, failure is not a bug — unexplained success is.

⸻

What VERANTYX Is Actually Evaluated On

VERANTYX evaluation focuses on process integrity, not output fluency.

Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity

⸻

Primary Outcome Categories (Do Not Collapse These)

Every evaluation result must fall into one of four explicit buckets: 1. Proven • Supported by DB rules • Assumptions satisfied • No refutation hits 2. Refuted • One or more counterexample rules fired • Or necessity checks failed 3. Insufficient Evidence • No applicable proof or refutation • Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined) • Depends on Phase 2 rules • Explicitly labeled as non-final

⚠️ Never merge these into “correct / incorrect”.

⸻

Benchmark Design Principles

3.1 Prefer “Trap-Rich” Problems

Good VERANTYX benchmarks include: • Missing assumptions • Overgeneralized language (“always”, “all”) • Vacuity cases • Necessary vs sufficient confusion

Bad benchmarks: • Straightforward textbook exercises • Problems solvable by surface pattern matching

⸻

3.2 Single-Answer Accuracy Is Not Enough

Each benchmark item should be evaluated on: • Was the selected outcome category correct? • Did the system refuse when it should? • Did it avoid hallucinated certainty?

A correct refusal scores higher than an unjustified answer.

⸻

Measuring Success (VERANTYX-style Metrics)

4.1 Core Metrics

Instead of accuracy, track: • Justified Resolution Rate • % of answers that are Proven or Refuted with evidence • Correct Refusal Rate • % of Insufficient Evidence when the DB truly lacks coverage • False Confidence Rate (Critical) • % of answers given without valid proof/refutation

Goal: drive this to near zero.

⸻

4.2 Phase Sensitivity Metrics

Track separately: • Phase 1 success rate • Phase 2 activation rate • Phase 2 dependency rate

A growing DB should: • Increase Phase 1 coverage • Decrease Phase 2 usage

⸻

Failure Is a First-Class Artifact

5.1 What Counts as a “Good Failure”

A good failure: • Is labeled “insufficient evidence” • Explains why no rule applied • Produces zero noise

A bad failure: • Selects a choice with weak lexical match • Uses mined rules silently • Looks confident but is unjustified

⸻

5.2 Mandatory Failure Logging

For every failure, record: • Input problem • Extracted assumptions • Which rules were skipped (and why) • Which rules almost matched • Phase reached (1 or 2)

Failures are data, not errors.

⸻

Regression Testing with Benchmarks

Every DB change should be evaluated against: • A fixed benchmark suite • A failure expectation list

Regression is not only: • “Did it break a correct answer?”

but also: • “Did it stop refusing where it should?”

⸻

Comparing VERANTYX to LLMs (Correctly)

Do not compare: • Raw accuracy • Speed • Fluency

Do compare: • Hallucination rate • Refusal correctness • Sensitivity to missing assumptions • Stability across re-runs

VERANTYX should lose on fluency — and win on discipline.

⸻

Human-in-the-Loop Evaluation

Recommended practice: • Periodic human audit of: • Refusal cases • Provisional (mined) answers • High-evidence proofs

Humans should ask:

“Would I trust this answer in a paper?”

If not, the system should not either.

⸻

Benchmark Suite Composition (Recommended)

A balanced VERANTYX benchmark includes: • 30–40% unsolvable (DB-limited) problems • 30% trap problems • 20% clean axiomatic problems • 10–20% exploratory / edge cases

If everything is solvable, the benchmark is broken.

⸻

Evaluation Anti-Patterns (Avoid These)

❌ Scoring by majority vote ❌ Treating refusal as failure ❌ Hiding Phase 2 usage ❌ Optimizing prompts to “pass benchmarks”

Benchmarks exist to expose limits, not hide them.

⸻

Closing Note (English)

VERANTYX is successful when: • It says “I don’t know” often • And explains why

A benchmark that makes VERANTYX look perfect is a benchmark that teaches nothing.

⸻

VERANTYX 評価・ベンチマークガイド

どう評価し、どう失敗を記録するか

⸻

VERANTYX における評価の前提

VERANTYX は正答率最適化システムではありません。評価対象は推論の振る舞いです。

そのため： • 正答率 • Top-k accuracy

だけで測るのは危険です。

VERANTYX では説明できない成功こそが失敗です。

⸻

評価対象は「結果」ではなく「過程」

VERANTYX の評価軸： 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性

⸻

出力は必ず 4 分類で扱う

評価時、すべての結果は次のいずれか： 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答（採掘）

⚠️ 正解／不正解に潰してはいけません。

⸻

ベンチマーク設計指針

3.1 罠のある問題を優先

良い問題： • 仮定抜け • 「常に」「すべて」 • 空虚真理 • 必要十分の混同

悪い問題： • 公式暗記問題 • 表層一致で解けるもの

⸻

3.2 拒否は成功である • 正しい拒否＞根拠のない回答

⸻

VERANTYX 流メトリクス

4.1 基本指標 • 根拠付き決定率 • 正当拒否率 • 誤った確信率（最重要）

目標：誤った確信を限りなくゼロに。

⸻

4.2 Phase 別指標 • Phase 1 成功率 • Phase 2 使用率 • Phase 2 依存率

DB が育つほど Phase 2 は減るべき。

⸻

失敗は「成果物」

5.1 良い失敗 • 証拠不足と明示 • なぜ解けないかが分かる • ノイズゼロ

5.2 悪い失敗 • それっぽい選択 • 仮知識を黙って使う • 自信満々

⸻

5.3 失敗ログに必須な情報 • 問題文 • 抽出仮定 • 弾かれたルールと理由 • Phase 到達点

⸻

回帰テストの考え方

確認すべきは： • 正解が壊れたか • 拒否すべきところで答えていないか

⸻

LLM との比較方法

比較すべき： • 幻覚率 • 拒否の正しさ • 仮定感度 • 再現性

比較すべきでない： • 流暢さ • 速度

⸻

人間による監査

定期的に： • 拒否ケース • 仮回答 • 高証拠回答

を人が確認する。

「論文に書けるか？」が基準。

⸻

推奨ベンチマーク構成 • 解けない問題：30–40% • 罠問題：30% • 純粋公理：20% • 境界ケース：10–20%

全部解けたら危険。

⸻

やってはいけない評価

❌ 多数決 ❌ 拒否＝失敗 ❌ Phase 2 を隠す ❌ ベンチ対策プロンプト

⸻

終わりに（日本語）

VERANTYX が成功している状態とは、 • よく「分からない」と言い • なぜ分からないかが分かる

状態です。

完璧に見えるベンチマークは何も教えてくれません。