@Banaxi-Tech on Hugging Face: "Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware…"

Post

596

Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision.

Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2.

Model: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

The important part: this is not just post-training KV cache quantization.
Instead we take the BitNet approach.

KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime.

During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte.

WikiText-2 eval vs 16-bit KV cache reference:

Mean KLD: 0.0916 nats/token
Mean KLD: 0.1322 bits/token
Average KV cache shrink vs FP16: 5.33x
Evaluated positions: 372,675

If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine!
Try it here: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

Code: https://github.com/Banaxi-Tech/kv1

Join the conversation