meta-qwen-14b-universal — a general Doubter wrapper for Qwen2.5-14B-Instruct

⚠️ This is a wrapper, not a standalone model. To load and run it you need the meta-spider framework — get it from Codeberg: https://codeberg.org/imperius/meta-spider (pip install -e meta-core). The files here (doubter_checkpoint.pt / doubter_sidecar.gguf) are the trained wrapper weights only; they attach to a frozen Qwen/Qwen2.5-14B-Instruct via that framework (PyTorch) or its llama.cpp deploy path (GGUF). See Usage and Framework below.

A trained meta-attention "Doubter" wrapper for Qwen/Qwen2.5-14B-Instruct. It is not a full model — it is a thin wrapper (~2–3% of the base) that reads the frozen base's own activations and injects cognitive tokens through gated cross-attention, so the model learns when to act vs. when to hold: answer from memory, call a tool, ask to clarify, or decline — calibrated to its own uncertainty. The base weights are never modified.

Unlike a narrow Doubter (trained on one task — e.g. MMLU-refusal or one agentic benchmark), this checkpoint is diverse-trained: one wrapper trained on a balanced mix of commit and hold scenarios across the whole decision space. The result is the first of our wrappers that does not collapse on any single axis — it is a general uncertainty modifier, not one over-fit to caution.

What's in here

File	What it is
`doubter_checkpoint.pt`	the trained wrapper weights (selective encoder + cross-attention + gates), ~365 MB
`run.json`	the training manifest (base model, layers, encoder type, quantization, the diverse mix)
`doubter_sidecar.gguf`	the same wrapper exported for llama.cpp (CPU / Metal / edge), ~535 MB (float32; 232 tensors)

What "general" means — results on a 6-axis agentic suite

Evaluated on a held-out diverse agentic suite (115 tasks, 6 decision axes; sources When2Call + PopQA + SQuAD2), strictly disjoint from training. Two grading methods by axis nature: action axes (call / abstain / clarify) by log-prob MCQ; knowledge axes (memory / lookup / unknown) by generation + LLM-judge (templated options are broken for open knowledge). Compare deltas within an axis, not raw accuracy across axis types.

axis	base	narrow (MMLU-refusal)	narrow (When2Call-FT)	this (diverse)
call (use the right tool)	0.40	0.27	0.40	0.47 ← best
abstain (no tool fits)	0.27	0.33	0.93	0.67
clarify (ambiguous)	0.60	0.53	0.73	0.73
memory (answer from knowledge)	0.93	0.87	0.73	0.87
lookup (need a fact)	0.87	0.93	1.00	0.93
unknown (unanswerable)	0.13	0.93	1.00	0.93
worst axis (floor)	0.13	0.27	0.40	0.47 ← best
commit mean (call, memory)	0.67	0.57	0.57	0.67 ← best

How to read this.

The base is badly miscalibrated: it answers everything (unknown 0.13 — it confidently answers the unanswerable) and under-uses tools.
A narrow Doubter transfers abstention — even cross-domain (the When2Call-trained one fixes QA unknown 0.13→1.0) — but over-abstains: it sacrifices the commit axes (memory 0.93→0.73, call stuck at base).
This diverse Doubter is the only arm that lifts the hold axes (unknown 0.13→0.93) without collapsing any commit axis — it actually raises call above every other arm (0.40→0.47), keeps memory high (0.87), and has the highest worst-axis floor (0.47). That is the point of a general modifier: it works across the whole decision space rather than just becoming cautious.

Trade-off, not domination: the narrow When2Call wrapper has higher peak hold scores. They are different points on the same calibration curve — the diverse wrapper is the balanced one. Which you want depends on the cost of over- vs. under-abstention in your deployment.

Training configuration (from `run.json`)

Base: Qwen/Qwen2.5-14B-Instruct (frozen), nf4 quantized, bfloat16
Encoder: selective (16 cognitive tokens, bottleneck 256, scalar tanh gates)
Layers (read + inject): [32..47] (the late third — chosen by a linear-probe sweep showing the uncertainty signal is concentrated there)
Data: a balanced diverse mix, 490 examples, commit 210 / hold 280, all disjoint from the test suite:
- call ← When2Call train_pref.chosen_response (<TOOLCALL> → native tool call)
- memory ← PopQA high-popularity + SQuAD2-train answerable (direct answer)
- abstain / clarify ← When2Call train_sft
- lookup ← PopQA long-tail (search tool call)
- unknown ← SQuAD2-train unanswerable (refuse)
6 epochs, best val-loss 0.478. Started from the MMLU-refusal checkpoint.

Usage

from meta_core import MetaSpiderConfig, MetaSpiderPipeline, Doubter

cfg = MetaSpiderConfig(
    model_name="Qwen/Qwen2.5-14B-Instruct",
    device="cuda", dtype="bfloat16", quantization="nf4",
    target_layers=list(range(32, 48)),
    cross_attn_layers=list(range(32, 48)),
)
pipe = MetaSpiderPipeline.from_pretrained(cfg)
d = Doubter.from_checkpoint("doubter_checkpoint.pt")
pipe.attach(d)
d.set_gain(1.0)   # the uncertainty "volume" knob; 0 = base, ~1.5 = max caution

print(pipe.generate("What is the capital of France?"))          # answers from memory
print(pipe.generate("<an unanswerable / false-premise question>"))  # declines instead of hallucinating

Needs pip install meta-core transformers>=5.11 accelerate bitsandbytes and torch ≥ 2.5.

Framework

Produced and consumed by the meta-spider framework (codeberg.org/imperius/meta-spider — meta-core / meta-loom / meta-agent / meta-deploy). The gain knob runs at inference time (d.set_gain(x) / META_GAIN), and a GGUF sidecar (metadeploy export) runs the same two-pass wrapper on CPU inside llama.cpp.

Caveats

Model-specific — calibrated to the activation distribution of Qwen/Qwen2.5-14B-Instruct; it will not transfer cleanly to a different model or fine-tune.
It does not add knowledge or capability — it surfaces an existing internal uncertainty signal and routes it into the agentic decision (answer / call / clarify / refuse).
Two-pass inference (read → inject) adds latency vs. the bare base; the dominant cost is the number of cross-attention layers.
Evaluated at N=15/axis on suite v1 (no gold answers → lookup/memory grading is lenient = "recognized uncertainty / didn't blind-guess"); a stricter v2 with gold answers is planned.

Downloads last month: 9

GGUF

Model size

0.1B params

Architecture

meta-spider

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Imperius/meta-qwen-14b-universal

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Quantized

(141)

this model