pyrrho-modernbert-base-v1

Decide whether your retrieved sources support a confident answer, contradict each other, or simply don't contain it — without an LLM call.

This is a fine-tune of answerdotai/ModernBERT-base on fitz-gov V5.1 for 3-class RAG governance classification: given a (query, retrieved contexts) pair, predicts one of:

Verdict	Meaning
`ABSTAIN`	The sources do not contain enough information to answer.
`DISPUTED`	The sources contradict each other on the answer.
`TRUSTWORTHY`	The sources consistently and sufficiently support an answer.

A drop-in replacement for the constraint+sklearn governance pipeline in fitz-sage. Single forward pass, ~30 ms on CPU after INT8 ONNX quantization, no external LLM dependency.

Results

Validated on the fitz-gov V5.1 eval split (584 cases, stratified 20% hold-out from tier1_core). All numbers are 3-seed mean ± std across seeds [42, 1337, 7].

Metric	pyrrho v1	fitz-sage v0.11 (sklearn baseline)	Δ
Overall accuracy (calibrated)	86.13 ± 0.86	78.7	+7.43
False-trustworthy rate (safety)	5.27 ± 0.21	5.7	-0.43 (safer)
Trustworthy recall	79.38 ± 1.64	70.0	+9.38
Disputed recall	94.81 ± 1.28	86.1	+8.71
Abstain recall	92.94 ± 1.11	86.5	+6.44
Macro F1	86.10 ± 0.80	n/a	—

Known limitations

Multi-source-convergence cases can be misclassified as DISPUTED. When multiple authoritative sources state the same fact with slight numerical variation that falls within measurement tolerance (e.g., 4 climate agencies citing 1.09–1.20 °C of warming, or NIST and IUPAC both giving the speed of light), the model occasionally classifies the case as DISPUTED with high confidence. On the relevant fitz-gov subcategory (multi_source_convergence, n=7) the error rate is ~57%. A v2 release with augmented training data targeting this pattern is planned.
Short, direct factual contexts can trigger over-abstention. Smoke-test example: query "When was the iPhone released?" + a single-sentence context confirming June 29, 2007 → predicted ABSTAIN with P(ABSTAIN)=0.92. The model was trained on 62.7% hard tier1 cases (rich methodological contexts), so it underweights the short-clean-answer pattern. Production RAG chunks (typically 200–500 chars) are tier1-like and largely unaffected.

Usage

Direct (transformers)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").eval()

query = "Has the company achieved profitability?"
contexts = [
    "The company posted its first profitable quarter, with net income of $4 million.",
    "The company recorded a quarterly loss of $12 million, the third consecutive losing quarter.",
]

# Build the input the same way training data was formatted
text = f"Question: {query}\n\nSources:\n" + "\n".join(
    f"[{i}] {c}" for i, c in enumerate(contexts, start=1)
)

enc = tokenizer(text, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits[0]
probs = torch.softmax(logits, dim=-1).numpy()
labels = ["ABSTAIN", "DISPUTED", "TRUSTWORTHY"]
print(f"Predicted: {labels[int(probs.argmax())]}")
print(f"Probs    : A={probs[0]:.3f} D={probs[1]:.3f} T={probs[2]:.3f}")

CPU-optimized (ONNX + INT8)

For production CPU inference at ~30 ms / case, load the INT8 ONNX variant via optimum:

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = ORTModelForSequenceClassification.from_pretrained(
    "yafitzdev/pyrrho-modernbert-base-v1",
    file_name="model_quantized.onnx",
)
# Same input format as above...

Calibrated decision rule

The headline numbers above use threshold calibration on the TRUSTWORTHY softmax probability. To match the published numbers, fall back from TRUSTWORTHY to the runner-up class when P(TRUSTWORTHY) < tau. The per-seed selected tau varied across runs (0.34–0.62); the safest default is tau = 0.50.

TAU = 0.50
pred = int(probs.argmax())
if pred == 2 and probs[2] < TAU:  # TRUSTWORTHY id is 2
    pred = int(probs[:2].argmax())   # fall back to runner-up between ABSTAIN/DISPUTED

Training

Hyperparameter	Value
Base model	`answerdotai/ModernBERT-base`
Architecture	ModernBERT (sequence classification head)
Labels (3-class)	ABSTAIN (0), DISPUTED (1), TRUSTWORTHY (2)
Max sequence length	4096 tokens
Epochs	5 (with early stopping, patience 2)
Per-device batch size	16
Effective batch size	16
Learning rate	5e-5
LR scheduler	cosine, 10% warmup
Weight decay	0.01
Label smoothing	0.15
Class weights	[2.3, 2.3, 1.0] (counters TRUSTWORTHY-over-prediction from 53% class imbalance)
Loss	Weighted cross-entropy + label smoothing
Selection metric	`ft_penalized_accuracy = accuracy - 3 * max(0, FT - 0.057)`
Optimizer	adamw_torch_fused (bf16)
Hardware	NVIDIA RTX 5090 (Blackwell sm_120)
Training time	~80–500 s per run depending on GPU contention

Training data: fitz-gov V5.1 tier1_core, stratified 80/20 split by (label, difficulty) for train/eval. The 60-case tier0_sanity set is held out separately as a noise-prone diagnostic.

Dataset

This model is trained and evaluated on fitz-gov V5.1, a 2,980-case benchmark for RAG governance (epistemic honesty). The eval split (584 cases) is a stratified 20% hold-out from tier1_core (2,920 cases, 62.7% hard difficulty, 17 domains, 113+ subcategories).

fitz-gov commit at training time: 3e1d22e22fdff726330a0d70503b07f73dacf817

Limitations & intended use

Intended use: as a CPU-friendly governance head inside a RAG pipeline that needs to decide when to answer, abstain, or flag a dispute. Drop-in replacement for the constraint+sklearn cascade in fitz-sage.

Not intended for:

Generating answers (this is a classification model, not a generator).
Token-level hallucination localization (see LettuceDetect for that — complementary use).
Languages other than English. fitz-gov is English-only; multilingual variants are a v3+ consideration.

Safety axis: the false-trustworthy rate is the production safety metric (a case wrongly classified as TRUSTWORTHY is the dangerous error — the system would confidently surface a hallucinated or unsupported answer). Threshold calibration is tuned to keep this rate at or below the fitz-sage baseline (5.7%).

Citation

@misc{pyrrho_v1_2026,
  title  = { pyrrho-modernbert-base-v1 },
  author = { Yan Fitzner },
  year   = { 2026 },
  url    = { https://huggingface.co/yafitzdev/pyrrho-modernbert-base-v1 },
}

License

Apache 2.0 — see LICENSE.

Related projects

fitz-sage — production RAG library that uses this model.
fitz-gov — the benchmark dataset.
pyrrho — training code and roadmap for the full model family.

Downloads last month: 42

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for yafitzdev/pyrrho-modernbert-base-v1

Base model

answerdotai/ModernBERT-base

Quantized

(26)

this model

yafitzdev
/

pyrrho-modernbert-base-v1