meta-qwen-14b-universal β a general Doubter wrapper for Qwen2.5-14B-Instruct
β οΈ This is a wrapper, not a standalone model. To load and run it you need the meta-spider framework β get it from Codeberg: https://codeberg.org/imperius/meta-spider (
pip install -e meta-core). The files here (doubter_checkpoint.pt/doubter_sidecar.gguf) are the trained wrapper weights only; they attach to a frozenQwen/Qwen2.5-14B-Instructvia that framework (PyTorch) or its llama.cpp deploy path (GGUF). See Usage and Framework below.
A trained meta-attention "Doubter" wrapper for Qwen/Qwen2.5-14B-Instruct. It is not a full
model β it is a thin wrapper (~2β3% of the base) that reads the frozen base's own activations and
injects cognitive tokens through gated cross-attention, so the model learns when to act vs. when
to hold: answer from memory, call a tool, ask to clarify, or decline β calibrated to its own
uncertainty. The base weights are never modified.
Unlike a narrow Doubter (trained on one task β e.g. MMLU-refusal or one agentic benchmark), this checkpoint is diverse-trained: one wrapper trained on a balanced mix of commit and hold scenarios across the whole decision space. The result is the first of our wrappers that does not collapse on any single axis β it is a general uncertainty modifier, not one over-fit to caution.
What's in here
| File | What it is |
|---|---|
doubter_checkpoint.pt |
the trained wrapper weights (selective encoder + cross-attention + gates), ~365 MB |
run.json |
the training manifest (base model, layers, encoder type, quantization, the diverse mix) |
doubter_sidecar.gguf |
the same wrapper exported for llama.cpp (CPU / Metal / edge), ~535 MB (float32; 232 tensors) |
What "general" means β results on a 6-axis agentic suite
Evaluated on a held-out diverse agentic suite (115 tasks, 6 decision axes; sources When2Call + PopQA + SQuAD2), strictly disjoint from training. Two grading methods by axis nature: action axes (call / abstain / clarify) by log-prob MCQ; knowledge axes (memory / lookup / unknown) by generation + LLM-judge (templated options are broken for open knowledge). Compare deltas within an axis, not raw accuracy across axis types.
| axis | base | narrow (MMLU-refusal) | narrow (When2Call-FT) | this (diverse) |
|---|---|---|---|---|
| call (use the right tool) | 0.40 | 0.27 | 0.40 | 0.47 β best |
| abstain (no tool fits) | 0.27 | 0.33 | 0.93 | 0.67 |
| clarify (ambiguous) | 0.60 | 0.53 | 0.73 | 0.73 |
| memory (answer from knowledge) | 0.93 | 0.87 | 0.73 | 0.87 |
| lookup (need a fact) | 0.87 | 0.93 | 1.00 | 0.93 |
| unknown (unanswerable) | 0.13 | 0.93 | 1.00 | 0.93 |
| worst axis (floor) | 0.13 | 0.27 | 0.40 | 0.47 β best |
| commit mean (call, memory) | 0.67 | 0.57 | 0.57 | 0.67 β best |
How to read this.
- The base is badly miscalibrated: it answers everything (unknown 0.13 β it confidently answers the unanswerable) and under-uses tools.
- A narrow Doubter transfers abstention β even cross-domain (the When2Call-trained one fixes QA unknown 0.13β1.0) β but over-abstains: it sacrifices the commit axes (memory 0.93β0.73, call stuck at base).
- This diverse Doubter is the only arm that lifts the hold axes (unknown 0.13β0.93) without collapsing any commit axis β it actually raises call above every other arm (0.40β0.47), keeps memory high (0.87), and has the highest worst-axis floor (0.47). That is the point of a general modifier: it works across the whole decision space rather than just becoming cautious.
Trade-off, not domination: the narrow When2Call wrapper has higher peak hold scores. They are different points on the same calibration curve β the diverse wrapper is the balanced one. Which you want depends on the cost of over- vs. under-abstention in your deployment.
Training configuration (from run.json)
- Base:
Qwen/Qwen2.5-14B-Instruct(frozen), nf4 quantized, bfloat16 - Encoder:
selective(16 cognitive tokens, bottleneck 256, scalar tanh gates) - Layers (read + inject):
[32..47](the late third β chosen by a linear-probe sweep showing the uncertainty signal is concentrated there) - Data: a balanced diverse mix, 490 examples, commit 210 / hold 280, all disjoint from the test suite:
- call β When2Call
train_pref.chosen_response(<TOOLCALL>β native tool call) - memory β PopQA high-popularity + SQuAD2-train answerable (direct answer)
- abstain / clarify β When2Call
train_sft - lookup β PopQA long-tail (search tool call)
- unknown β SQuAD2-train unanswerable (refuse)
- call β When2Call
- 6 epochs, best val-loss 0.478. Started from the MMLU-refusal checkpoint.
Usage
from meta_core import MetaSpiderConfig, MetaSpiderPipeline, Doubter
cfg = MetaSpiderConfig(
model_name="Qwen/Qwen2.5-14B-Instruct",
device="cuda", dtype="bfloat16", quantization="nf4",
target_layers=list(range(32, 48)),
cross_attn_layers=list(range(32, 48)),
)
pipe = MetaSpiderPipeline.from_pretrained(cfg)
d = Doubter.from_checkpoint("doubter_checkpoint.pt")
pipe.attach(d)
d.set_gain(1.0) # the uncertainty "volume" knob; 0 = base, ~1.5 = max caution
print(pipe.generate("What is the capital of France?")) # answers from memory
print(pipe.generate("<an unanswerable / false-premise question>")) # declines instead of hallucinating
Needs pip install meta-core transformers>=5.11 accelerate bitsandbytes and torch β₯ 2.5.
Framework
Produced and consumed by the meta-spider framework
(codeberg.org/imperius/meta-spider β meta-core / meta-loom /
meta-agent / meta-deploy). The gain knob runs at inference time (d.set_gain(x) / META_GAIN), and a
GGUF sidecar (metadeploy export) runs the same two-pass wrapper on CPU inside llama.cpp.
Caveats
- Model-specific β calibrated to the activation distribution of
Qwen/Qwen2.5-14B-Instruct; it will not transfer cleanly to a different model or fine-tune. - It does not add knowledge or capability β it surfaces an existing internal uncertainty signal and routes it into the agentic decision (answer / call / clarify / refuse).
- Two-pass inference (read β inject) adds latency vs. the bare base; the dominant cost is the number of cross-attention layers.
- Evaluated at N=15/axis on suite v1 (no gold answers β lookup/memory grading is lenient = "recognized uncertainty / didn't blind-guess"); a stricter v2 with gold answers is planned.
- Downloads last month
- 9
We're not able to determine the quantization variants.