verantyx / axis /documents /Benchmark_Guide.md

Upload folder using huggingface_hub

6d07351 verified 18 days ago

7.63 kB

	⸻

	VERANTYX Evaluation & Benchmark Guide

	How to Evaluate Reasoning — and How to Record Failure

	⸻

	0. Why Evaluation Is Different in VERANTYX

	VERANTYX is not optimized for accuracy metrics.
	It is optimized for auditable reasoning behavior.

	Therefore, traditional evaluation questions like:
	• “How many did it get right?”
	• “What is the accuracy@k?”

	are insufficient and sometimes misleading.

	In VERANTYX, failure is not a bug —
	unexplained success is.

	⸻

	1. What VERANTYX Is Actually Evaluated On

	VERANTYX evaluation focuses on process integrity, not output fluency.

	Core evaluation axes: 1. Evidence grounding 2. Correct refusal 3. Noise resistance 4. Assumption sensitivity 5. Reproducibility 6. Audit clarity

	⸻

	2. Primary Outcome Categories (Do Not Collapse These)

	Every evaluation result must fall into one of four explicit buckets: 1. Proven
	• Supported by DB rules
	• Assumptions satisfied
	• No refutation hits 2. Refuted
	• One or more counterexample rules fired
	• Or necessity checks failed 3. Insufficient Evidence
	• No applicable proof or refutation
	• Phase 1 (and optionally Phase 2) found nothing reliable 4. Provisional (Mined)
	• Depends on Phase 2 rules
	• Explicitly labeled as non-final

	⚠️ Never merge these into “correct / incorrect”.

	⸻

	3. Benchmark Design Principles

	3.1 Prefer “Trap-Rich” Problems

	Good VERANTYX benchmarks include:
	• Missing assumptions
	• Overgeneralized language (“always”, “all”)
	• Vacuity cases
	• Necessary vs sufficient confusion

	Bad benchmarks:
	• Straightforward textbook exercises
	• Problems solvable by surface pattern matching

	⸻

	3.2 Single-Answer Accuracy Is Not Enough

	Each benchmark item should be evaluated on:
	• Was the selected outcome category correct?
	• Did the system refuse when it should?
	• Did it avoid hallucinated certainty?

	A correct refusal scores higher than an unjustified answer.

	⸻

	4. Measuring Success (VERANTYX-style Metrics)

	4.1 Core Metrics

	Instead of accuracy, track:
	• Justified Resolution Rate
	• % of answers that are Proven or Refuted with evidence
	• Correct Refusal Rate
	• % of Insufficient Evidence when the DB truly lacks coverage
	• False Confidence Rate (Critical)
	• % of answers given without valid proof/refutation

	Goal: drive this to near zero.

	⸻

	4.2 Phase Sensitivity Metrics

	Track separately:
	• Phase 1 success rate
	• Phase 2 activation rate
	• Phase 2 dependency rate

	A growing DB should:
	• Increase Phase 1 coverage
	• Decrease Phase 2 usage

	⸻

	5. Failure Is a First-Class Artifact

	5.1 What Counts as a “Good Failure”

	A good failure:
	• Is labeled “insufficient evidence”
	• Explains why no rule applied
	• Produces zero noise

	A bad failure:
	• Selects a choice with weak lexical match
	• Uses mined rules silently
	• Looks confident but is unjustified

	⸻

	5.2 Mandatory Failure Logging

	For every failure, record:
	• Input problem
	• Extracted assumptions
	• Which rules were skipped (and why)
	• Which rules almost matched
	• Phase reached (1 or 2)

	Failures are data, not errors.

	⸻

	6. Regression Testing with Benchmarks

	Every DB change should be evaluated against:
	• A fixed benchmark suite
	• A failure expectation list

	Regression is not only:
	• “Did it break a correct answer?”

	but also:
	• “Did it stop refusing where it should?”

	⸻

	7. Comparing VERANTYX to LLMs (Correctly)

	Do not compare:
	• Raw accuracy
	• Speed
	• Fluency

	Do compare:
	• Hallucination rate
	• Refusal correctness
	• Sensitivity to missing assumptions
	• Stability across re-runs

	VERANTYX should lose on fluency — and win on discipline.

	⸻

	8. Human-in-the-Loop Evaluation

	Recommended practice:
	• Periodic human audit of:
	• Refusal cases
	• Provisional (mined) answers
	• High-evidence proofs

	Humans should ask:

	“Would I trust this answer in a paper?”

	If not, the system should not either.

	⸻

	9. Benchmark Suite Composition (Recommended)

	A balanced VERANTYX benchmark includes:
	• 30–40% unsolvable (DB-limited) problems
	• 30% trap problems
	• 20% clean axiomatic problems
	• 10–20% exploratory / edge cases

	If everything is solvable, the benchmark is broken.

	⸻

	10. Evaluation Anti-Patterns (Avoid These)

	❌ Scoring by majority vote
	❌ Treating refusal as failure
	❌ Hiding Phase 2 usage
	❌ Optimizing prompts to “pass benchmarks”

	Benchmarks exist to expose limits, not hide them.

	⸻

	Closing Note (English)

	VERANTYX is successful when:
	• It says “I don’t know” often
	• And explains why

	A benchmark that makes VERANTYX look perfect
	is a benchmark that teaches nothing.

	⸻

	⸻

	VERANTYX 評価・ベンチマークガイド

	どう評価し、どう失敗を記録するか

	⸻

	0. VERANTYX における評価の前提

	VERANTYX は正答率最適化システムではありません。
	評価対象は推論の振る舞いです。

	そのため：
	• 正答率
	• Top-k accuracy

	だけで測るのは危険です。

	VERANTYX では
	説明できない成功こそが失敗です。

	⸻

	1. 評価対象は「結果」ではなく「過程」

	VERANTYX の評価軸： 1. 根拠の明示性 2. 正しい拒否 3. ノイズ耐性 4. 仮定への感度 5. 再現性 6. 監査可能性

	⸻

	2. 出力は必ず 4 分類で扱う

	評価時、すべての結果は次のいずれか： 1. 証明済み 2. 反証済み 3. 証拠不足 4. 仮回答（採掘）

	⚠️ 正解／不正解に潰してはいけません。

	⸻

	3. ベンチマーク設計指針

	3.1 罠のある問題を優先

	良い問題：
	• 仮定抜け
	• 「常に」「すべて」
	• 空虚真理
	• 必要十分の混同

	悪い問題：
	• 公式暗記問題
	• 表層一致で解けるもの

	⸻

	3.2 拒否は成功である
	• 正しい拒否＞根拠のない回答

	⸻

	4. VERANTYX 流メトリクス

	4.1 基本指標
	• 根拠付き決定率
	• 正当拒否率
	• 誤った確信率（最重要）

	目標：誤った確信を限りなくゼロに。

	⸻

	4.2 Phase 別指標
	• Phase 1 成功率
	• Phase 2 使用率
	• Phase 2 依存率

	DB が育つほど Phase 2 は減るべき。

	⸻

	5. 失敗は「成果物」

	5.1 良い失敗
	• 証拠不足と明示
	• なぜ解けないかが分かる
	• ノイズゼロ

	5.2 悪い失敗
	• それっぽい選択
	• 仮知識を黙って使う
	• 自信満々

	⸻

	5.3 失敗ログに必須な情報
	• 問題文
	• 抽出仮定
	• 弾かれたルールと理由
	• Phase 到達点

	⸻

	6. 回帰テストの考え方

	確認すべきは：
	• 正解が壊れたか
	• 拒否すべきところで答えていないか

	⸻

	7. LLM との比較方法

	比較すべき：
	• 幻覚率
	• 拒否の正しさ
	• 仮定感度
	• 再現性

	比較すべきでない：
	• 流暢さ
	• 速度

	⸻

	8. 人間による監査

	定期的に：
	• 拒否ケース
	• 仮回答
	• 高証拠回答

	を人が確認する。

	「論文に書けるか？」が基準。

	⸻

	9. 推奨ベンチマーク構成
	• 解けない問題：30–40%
	• 罠問題：30%
	• 純粋公理：20%
	• 境界ケース：10–20%

	全部解けたら危険。

	⸻

	10. やってはいけない評価

	❌ 多数決
	❌ 拒否＝失敗
	❌ Phase 2 を隠す
	❌ ベンチ対策プロンプト

	⸻

	終わりに（日本語）

	VERANTYX が成功している状態とは、
	• よく「分からない」と言い
	• なぜ分からないかが分かる

	状態です。

	完璧に見えるベンチマークは
	何も教えてくれません。