Brumby 14B Base β€” GGUF

GGUF quantizations of Manifest AI's Brumby 14B Base, a 14B parameter language model using power retention β€” a novel attention mechanism that replaces softmax attention with a degree-2 symmetric power feature map and linear recurrence.

What is Power Retention?

Power retention (paper) is a recurrent attention mechanism that:

  • Expands keys and queries via a symmetric power feature map phi(x) (degree-2 polynomial expansion, head_dim β†’ D=9216 for head_dim=128)
  • Maintains a recurrent state S (updated as S = decay * S + outer(phi(k), v)) with a normalizer s
  • Produces output as (phi(q) @ S) / (phi(q) @ s) β€” normalized linear attention

This gives the model constant-time per-token inference (like RWKV/Mamba) while maintaining strong language modeling performance.

Quantizations

File Quant Size Description
Brumby-14B-Base.Q2_K.gguf Q2_K 5.4 GB Smallest, lower quality
Brumby-14B-Base.Q3_K_M.gguf Q3_K_M 6.9 GB Small, good balance
Brumby-14B-Base.Q4_K_M.gguf Q4_K_M 8.4 GB Recommended
Brumby-14B-Base.Q5_K_M.gguf Q5_K_M 9.8 GB Good quality
Brumby-14B-Base.Q6_K.gguf Q6_K 12 GB Very good quality
Brumby-14B-Base.Q8_0.gguf Q8_0 15 GB Near-lossless

How to Run

Important: This model requires a custom llama.cpp build with power retention support.

1. Build llama.cpp with power retention

git clone https://github.com/audreyt/llama.cpp
cd llama.cpp
git checkout power-retention
cmake -B build -DGGML_CUDA=ON  # or without CUDA for CPU-only
cmake --build build -j

2. Run inference

./build/bin/llama-cli \
  --model brumby-14b-base-pr.Q4_K_M.gguf \
  -p "Once upon a time" \
  -n 100 \
  --temp 0.7 \
  --repeat-penalty 1.2

Model Details

  • Architecture: Qwen2PR (Qwen2 backbone with power retention replacing softmax attention)
  • Parameters: 14.77B
  • Context: 131,072 tokens
  • Heads: 40 Q / 8 KV (GQA with 5 groups)
  • Head dim: 128
  • Layers: 40
  • Vocab: 151,936 (Qwen2 tokenizer)
  • Recurrent state: ~1.4 GB per sequence (9216 Γ— 129 Γ— 8 KV heads Γ— 40 layers Γ— 4 bytes)

Credits

Downloads last month
22
GGUF
Model size
15B params
Architecture
qwen2pr
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support