pyrrho-modernbert-base-v1

Decide whether your retrieved sources support a confident answer, contradict each other, or simply don't contain it β€” without an LLM call.

This is a fine-tune of answerdotai/ModernBERT-base on fitz-gov V5.1 for 3-class RAG governance classification: given a (query, retrieved contexts) pair, predicts one of:

Verdict Meaning
ABSTAIN The sources do not contain enough information to answer.
DISPUTED The sources contradict each other on the answer.
TRUSTWORTHY The sources consistently and sufficiently support an answer.

A drop-in replacement for the constraint+sklearn governance pipeline in fitz-sage. Single forward pass, ~30 ms on CPU after INT8 ONNX quantization, no external LLM dependency.


Results

Validated on the fitz-gov V5.1 eval split (584 cases, stratified 20% hold-out from tier1_core). All numbers are 3-seed mean Β± std across seeds [42, 1337, 7].

Metric pyrrho v1 fitz-sage v0.11 (sklearn baseline) Ξ”
Overall accuracy (calibrated) 86.13 Β± 0.86 78.7 +7.43
False-trustworthy rate (safety) 5.27 Β± 0.21 5.7 -0.43 (safer)
Trustworthy recall 79.38 Β± 1.64 70.0 +9.38
Disputed recall 94.81 Β± 1.28 86.1 +8.71
Abstain recall 92.94 Β± 1.11 86.5 +6.44
Macro F1 86.10 Β± 0.80 n/a β€”

Known limitations

  1. Multi-source-convergence cases can be misclassified as DISPUTED. When multiple authoritative sources state the same fact with slight numerical variation that falls within measurement tolerance (e.g., 4 climate agencies citing 1.09–1.20 Β°C of warming, or NIST and IUPAC both giving the speed of light), the model occasionally classifies the case as DISPUTED with high confidence. On the relevant fitz-gov subcategory (multi_source_convergence, n=7) the error rate is ~57%. A v2 release with augmented training data targeting this pattern is planned.

  2. Short, direct factual contexts can trigger over-abstention. Smoke-test example: query "When was the iPhone released?" + a single-sentence context confirming June 29, 2007 β†’ predicted ABSTAIN with P(ABSTAIN)=0.92. The model was trained on 62.7% hard tier1 cases (rich methodological contexts), so it underweights the short-clean-answer pattern. Production RAG chunks (typically 200–500 chars) are tier1-like and largely unaffected.


Usage

Direct (transformers)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").eval()

query = "Has the company achieved profitability?"
contexts = [
    "The company posted its first profitable quarter, with net income of $4 million.",
    "The company recorded a quarterly loss of $12 million, the third consecutive losing quarter.",
]

# Build the input the same way training data was formatted
text = f"Question: {query}\n\nSources:\n" + "\n".join(
    f"[{i}] {c}" for i, c in enumerate(contexts, start=1)
)

enc = tokenizer(text, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits[0]
probs = torch.softmax(logits, dim=-1).numpy()
labels = ["ABSTAIN", "DISPUTED", "TRUSTWORTHY"]
print(f"Predicted: {labels[int(probs.argmax())]}")
print(f"Probs    : A={probs[0]:.3f} D={probs[1]:.3f} T={probs[2]:.3f}")

CPU-optimized (ONNX + INT8)

For production CPU inference at ~30 ms / case, load the INT8 ONNX variant via optimum:

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = ORTModelForSequenceClassification.from_pretrained(
    "yafitzdev/pyrrho-modernbert-base-v1",
    file_name="model_quantized.onnx",
)
# Same input format as above...

Calibrated decision rule

The headline numbers above use threshold calibration on the TRUSTWORTHY softmax probability. To match the published numbers, fall back from TRUSTWORTHY to the runner-up class when P(TRUSTWORTHY) < tau. The per-seed selected tau varied across runs (0.34–0.62); the safest default is tau = 0.50.

TAU = 0.50
pred = int(probs.argmax())
if pred == 2 and probs[2] < TAU:  # TRUSTWORTHY id is 2
    pred = int(probs[:2].argmax())   # fall back to runner-up between ABSTAIN/DISPUTED

Training

Hyperparameter Value
Base model answerdotai/ModernBERT-base
Architecture ModernBERT (sequence classification head)
Labels (3-class) ABSTAIN (0), DISPUTED (1), TRUSTWORTHY (2)
Max sequence length 4096 tokens
Epochs 5 (with early stopping, patience 2)
Per-device batch size 16
Effective batch size 16
Learning rate 5e-5
LR scheduler cosine, 10% warmup
Weight decay 0.01
Label smoothing 0.15
Class weights [2.3, 2.3, 1.0] (counters TRUSTWORTHY-over-prediction from 53% class imbalance)
Loss Weighted cross-entropy + label smoothing
Selection metric ft_penalized_accuracy = accuracy - 3 * max(0, FT - 0.057)
Optimizer adamw_torch_fused (bf16)
Hardware NVIDIA RTX 5090 (Blackwell sm_120)
Training time ~80–500 s per run depending on GPU contention

Training data: fitz-gov V5.1 tier1_core, stratified 80/20 split by (label, difficulty) for train/eval. The 60-case tier0_sanity set is held out separately as a noise-prone diagnostic.


Dataset

This model is trained and evaluated on fitz-gov V5.1, a 2,980-case benchmark for RAG governance (epistemic honesty). The eval split (584 cases) is a stratified 20% hold-out from tier1_core (2,920 cases, 62.7% hard difficulty, 17 domains, 113+ subcategories).

fitz-gov commit at training time: 3e1d22e22fdff726330a0d70503b07f73dacf817


Limitations & intended use

Intended use: as a CPU-friendly governance head inside a RAG pipeline that needs to decide when to answer, abstain, or flag a dispute. Drop-in replacement for the constraint+sklearn cascade in fitz-sage.

Not intended for:

  • Generating answers (this is a classification model, not a generator).
  • Token-level hallucination localization (see LettuceDetect for that β€” complementary use).
  • Languages other than English. fitz-gov is English-only; multilingual variants are a v3+ consideration.

Safety axis: the false-trustworthy rate is the production safety metric (a case wrongly classified as TRUSTWORTHY is the dangerous error β€” the system would confidently surface a hallucinated or unsupported answer). Threshold calibration is tuned to keep this rate at or below the fitz-sage baseline (5.7%).


Citation

@misc{pyrrho_v1_2026,
  title  = { pyrrho-modernbert-base-v1 },
  author = { Yan Fitzner },
  year   = { 2026 },
  url    = { https://huggingface.co/yafitzdev/pyrrho-modernbert-base-v1 },
}

License

Apache 2.0 β€” see LICENSE.

Related projects

  • fitz-sage β€” production RAG library that uses this model.
  • fitz-gov β€” the benchmark dataset.
  • pyrrho β€” training code and roadmap for the full model family.
Downloads last month
42
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yafitzdev/pyrrho-modernbert-base-v1

Quantized
(26)
this model

Dataset used to train yafitzdev/pyrrho-modernbert-base-v1