Bosun-4B (4B)

Bosun-4B β€” judging which edges in an agent's memory graph are warranted

Launch post: Introducing Bosun β†’

The judge that keeps an agent's memory β€” its knowledge graph β€” clean. As an agent accumulates memory as a graph of facts linked by relationships, Bosun-4B decides, edge by edge, which connections are warranted β€” supported, non-redundant, still-true β€” so the graph stays useful instead of growing into noise that drowns the model reading it back. Nothing else scores that "judge" step; Bosun-4B is a small, fast, calibrated model built for it, and you program it with a sentence.

Given two findings and an instruction it emits P = sigmoid(logit_yes - logit_no) ∈ [0,1] β€” how strongly the pair satisfies the rule you supplied, with no opinion of its own. "Warranted" isn't one fixed rule (same-entity, cross-domain bridge, not-a-duplicate, still-supported-by-evidence), so you define it per graph; Bosun-4B follows the rule, respects negation, and generalizes to rules it never trained on. That same capability is exactly what RAG filtering, content moderation, and deduplication need too β€” knowledge-graph curation is simply where the need bites first and hardest.

LoRA fine-tune of Qwen/Qwen3-Reranker-4B, scored on the native reranker yes/no logits.

Changelog

v1.1 β€” broader general judgment (current)

Same architecture and inference contract as v1.0; retrained on an expanded blend (DialAM-2024 argument edges, NLI, PAWS, e-CARE/COPA causal, dedup hard-negatives, completeness, and synthetic directional data, on top of v1.0). Still one model, programmed by a sentence β€” no per-task fine-tuning.

New: directional & typed-edge judgment β€” supersession ("B replaces A"), depends-on, supports / contradicts. Bosun now reads the ordered pair for asymmetric relations, not just symmetric similarity.

Generality on held-out public benchmarks (one instruction each), vs a frontier LLM on the same items:

benchmark Bosun-4B v1.1 gemini-3.1-flash-lite similarity baseline fine-tuned specialist
PAWS (adversarial paraphrase) 0.91 0.81 ~chance (0.53 AUROC) ~0.95 (DeBERTa)
e-CARE (causal direction) 0.85 0.86 0.60 ~0.75 (paper)
ANLI (adversarial NLI) 0.57 0.74 0.33 ~0.69

Bosun-4B beats gemini-3.1-flash-lite on PAWS, ties it on e-CARE, and trails on ANLI β€” while crushing it on steerable judgment (WarrantBench 0.945 vs 0.575). Edge curation (DialAM-2024): recall 0.71, beating Sonnet on recall + precision.

No regression: FollowIR flat vs v1.0; WarrantBench steerability 0.885 β†’ 0.945.

v1.0 β€” launch

Symmetric programmable judge. WarrantBench steerability 0.885; FollowIR state-of-the-art (+17.9 p-MRR).

Inference contract

Native Qwen3-Reranker template; read the last-token logits:

<Instruct>: <your rule, e.g. "Connected only if the two findings share a specific named entity.">
<Query>: These two findings share the specified relationship.
<Document>: FINDING A:\n<text_a>\n\nFINDING B:\n<text_b>

score = sigmoid(logits[yes_id] - logits[no_id]) at the final position (logits_to_keep=1). The exact yes_id / no_id / template prefix+suffix and max_len are in serving.json.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

repo = "Hanno-Labs/bosun-4b"
cfg  = ...  # serving.json from this repo
tok  = AutoTokenizer.from_pretrained(repo, subfolder="tokenizer", padding_side="left")
base = AutoModelForCausalLM.from_pretrained(cfg["base_model"], torch_dtype=torch.bfloat16,
                                            attn_implementation="sdpa", trust_remote_code=True)
model = PeftModel.from_pretrained(base, repo).merge_and_unload().eval().cuda()
# build ids = prefix + <Instruct/Query/Document> + suffix, then:
# lg = model(input_ids, attention_mask, logits_to_keep=1).logits[:, -1, :]
# p  = torch.sigmoid(lg[:, cfg["yes_id"]] - lg[:, cfg["no_id"]])

Run locally (GGUF / llama.cpp)

CPU / Apple-Silicon / edge builds (f16, Q8_0, Q4_K_M β€” all calibration-safe at 4B) live at Hanno-Labs/bosun-4b-GGUF.

⚠️ Do not use llama.cpp's --rerank mode β€” it silently discards the <Instruct> and returns degenerate, instruction-blind scores. Use the completion + logits path documented in that repo (validated per-pair against this model's transformers reference β€” Q8_0 within ~0.001).

Results

Bosun-4B is state-of-the-art on FollowIR (public instruction-following retrieval), averaging +17.9 p-MRR on the full pool β€” it changes its judgments correctly when the instruction changes, where most retrievers move the wrong way. On a capped pool it matches gemini-3.1-flash-lite head-to-head (12.0 = 12.0) at a fraction of the cost.

WarrantBench (Hanno-Labs/warrantbench): follows arbitrary rules and their negations, and flips correctly on steerability triples. The 4B capacity closes the hardest-slice gap to the frontier LLM that the 0.6B leaves open.

Files

file what
adapter_model.safetensors, adapter_config.json the LoRA adapter (load with PEFT over the base)
serving.json inference contract: template + yes_id/no_id + max_len
tokenizer/ Qwen tokenizer (left-padding)

Links

From Hanno Labs.

Downloads last month
94
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Hanno-Labs/bosun-4b

Adapter
(2)
this model

Dataset used to train Hanno-Labs/bosun-4b

Articles mentioning Hanno-Labs/bosun-4b

Evaluation results

  • Steerability (score flips with the rule) on WarrantBench
    self-reported
    0.945
  • p-MRR (full pool, avg of 3 tasks) on FollowIR
    self-reported
    17.900