YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
LeanLlama-8B-INT4
LeanLlama-8B-INT4 is a 4-bit quantized variant of LeanLlama-8B that combines NF4 weight quantization with learned KV cache compression. It reduces both model weight memory and inference-time KV cache memory, making it suitable for deployment on consumer GPUs.
What changed
- Weight quantization (Phase 1): All transformer weights (including embeddings) are quantized to NF4 with double quantization, reducing the model from ~16 GB to ~5.7 GB on disk.
- KV cache compression (Phase 2): Inherited from LeanLlama-8B. Learned projection modules compress the value representations stored in the KV cache at a subset of layers, reducing the memory footprint of long-context inference.
The base Llama 3.1 8B Instruct weights are preserved through NF4 quantization. The KV cache compression modules remain in fp16 for maximum compression fidelity.
Quality
Expected quality relative to the uncompressed Llama 3.1 8B Instruct baseline:
| Metric | Delta |
|---|---|
| Perplexity | ~+7% |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"miike-ai/LeanLlama-8B-INT4",
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("miike-ai/LeanLlama-8B-INT4")
inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
No special configuration is needed. The NF4 dequantization and KV cache compression both run transparently inside the model's forward pass.
Base model
- Architecture: Llama 3.1
- Parameters: 8B
- Source:
meta-llama/Llama-3.1-8B-Instructviamiike-ai/LeanLlama-8B - Context window: 128K tokens
- Quantization: NF4 (bitsandbytes) with double quantization
- License: Llama 3.1 Community License
- Downloads last month
- 10