meta-qwen-14b-universal β€” a general Doubter wrapper for Qwen2.5-14B-Instruct

⚠️ This is a wrapper, not a standalone model. To load and run it you need the meta-spider framework β€” get it from Codeberg: https://codeberg.org/imperius/meta-spider (pip install -e meta-core). The files here (doubter_checkpoint.pt / doubter_sidecar.gguf) are the trained wrapper weights only; they attach to a frozen Qwen/Qwen2.5-14B-Instruct via that framework (PyTorch) or its llama.cpp deploy path (GGUF). See Usage and Framework below.

A trained meta-attention "Doubter" wrapper for Qwen/Qwen2.5-14B-Instruct. It is not a full model β€” it is a thin wrapper (~2–3% of the base) that reads the frozen base's own activations and injects cognitive tokens through gated cross-attention, so the model learns when to act vs. when to hold: answer from memory, call a tool, ask to clarify, or decline β€” calibrated to its own uncertainty. The base weights are never modified.

Unlike a narrow Doubter (trained on one task β€” e.g. MMLU-refusal or one agentic benchmark), this checkpoint is diverse-trained: one wrapper trained on a balanced mix of commit and hold scenarios across the whole decision space. The result is the first of our wrappers that does not collapse on any single axis β€” it is a general uncertainty modifier, not one over-fit to caution.

What's in here

File What it is
doubter_checkpoint.pt the trained wrapper weights (selective encoder + cross-attention + gates), ~365 MB
run.json the training manifest (base model, layers, encoder type, quantization, the diverse mix)
doubter_sidecar.gguf the same wrapper exported for llama.cpp (CPU / Metal / edge), ~535 MB (float32; 232 tensors)

What "general" means β€” results on a 6-axis agentic suite

Evaluated on a held-out diverse agentic suite (115 tasks, 6 decision axes; sources When2Call + PopQA + SQuAD2), strictly disjoint from training. Two grading methods by axis nature: action axes (call / abstain / clarify) by log-prob MCQ; knowledge axes (memory / lookup / unknown) by generation + LLM-judge (templated options are broken for open knowledge). Compare deltas within an axis, not raw accuracy across axis types.

axis base narrow (MMLU-refusal) narrow (When2Call-FT) this (diverse)
call (use the right tool) 0.40 0.27 0.40 0.47 ← best
abstain (no tool fits) 0.27 0.33 0.93 0.67
clarify (ambiguous) 0.60 0.53 0.73 0.73
memory (answer from knowledge) 0.93 0.87 0.73 0.87
lookup (need a fact) 0.87 0.93 1.00 0.93
unknown (unanswerable) 0.13 0.93 1.00 0.93
worst axis (floor) 0.13 0.27 0.40 0.47 ← best
commit mean (call, memory) 0.67 0.57 0.57 0.67 ← best

How to read this.

  • The base is badly miscalibrated: it answers everything (unknown 0.13 β€” it confidently answers the unanswerable) and under-uses tools.
  • A narrow Doubter transfers abstention β€” even cross-domain (the When2Call-trained one fixes QA unknown 0.13β†’1.0) β€” but over-abstains: it sacrifices the commit axes (memory 0.93β†’0.73, call stuck at base).
  • This diverse Doubter is the only arm that lifts the hold axes (unknown 0.13β†’0.93) without collapsing any commit axis β€” it actually raises call above every other arm (0.40β†’0.47), keeps memory high (0.87), and has the highest worst-axis floor (0.47). That is the point of a general modifier: it works across the whole decision space rather than just becoming cautious.

Trade-off, not domination: the narrow When2Call wrapper has higher peak hold scores. They are different points on the same calibration curve β€” the diverse wrapper is the balanced one. Which you want depends on the cost of over- vs. under-abstention in your deployment.

Training configuration (from run.json)

  • Base: Qwen/Qwen2.5-14B-Instruct (frozen), nf4 quantized, bfloat16
  • Encoder: selective (16 cognitive tokens, bottleneck 256, scalar tanh gates)
  • Layers (read + inject): [32..47] (the late third β€” chosen by a linear-probe sweep showing the uncertainty signal is concentrated there)
  • Data: a balanced diverse mix, 490 examples, commit 210 / hold 280, all disjoint from the test suite:
    • call ← When2Call train_pref.chosen_response (<TOOLCALL> β†’ native tool call)
    • memory ← PopQA high-popularity + SQuAD2-train answerable (direct answer)
    • abstain / clarify ← When2Call train_sft
    • lookup ← PopQA long-tail (search tool call)
    • unknown ← SQuAD2-train unanswerable (refuse)
  • 6 epochs, best val-loss 0.478. Started from the MMLU-refusal checkpoint.

Usage

from meta_core import MetaSpiderConfig, MetaSpiderPipeline, Doubter

cfg = MetaSpiderConfig(
    model_name="Qwen/Qwen2.5-14B-Instruct",
    device="cuda", dtype="bfloat16", quantization="nf4",
    target_layers=list(range(32, 48)),
    cross_attn_layers=list(range(32, 48)),
)
pipe = MetaSpiderPipeline.from_pretrained(cfg)
d = Doubter.from_checkpoint("doubter_checkpoint.pt")
pipe.attach(d)
d.set_gain(1.0)   # the uncertainty "volume" knob; 0 = base, ~1.5 = max caution

print(pipe.generate("What is the capital of France?"))          # answers from memory
print(pipe.generate("<an unanswerable / false-premise question>"))  # declines instead of hallucinating

Needs pip install meta-core transformers>=5.11 accelerate bitsandbytes and torch β‰₯ 2.5.

Framework

Produced and consumed by the meta-spider framework (codeberg.org/imperius/meta-spider β€” meta-core / meta-loom / meta-agent / meta-deploy). The gain knob runs at inference time (d.set_gain(x) / META_GAIN), and a GGUF sidecar (metadeploy export) runs the same two-pass wrapper on CPU inside llama.cpp.

Caveats

  • Model-specific β€” calibrated to the activation distribution of Qwen/Qwen2.5-14B-Instruct; it will not transfer cleanly to a different model or fine-tune.
  • It does not add knowledge or capability β€” it surfaces an existing internal uncertainty signal and routes it into the agentic decision (answer / call / clarify / refuse).
  • Two-pass inference (read β†’ inject) adds latency vs. the bare base; the dominant cost is the number of cross-attention layers.
  • Evaluated at N=15/axis on suite v1 (no gold answers β†’ lookup/memory grading is lenient = "recognized uncertainty / didn't blind-guess"); a stricter v2 with gold answers is planned.
Downloads last month
9
GGUF
Model size
0.1B params
Architecture
meta-spider
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Imperius/meta-qwen-14b-universal

Base model

Qwen/Qwen2.5-14B
Quantized
(141)
this model