Nandi-Mini-600M-Early-Checkpoint
Introduction
Nandi-Mini-600M-Early-Checkpoint is an early-stage checkpoint (After 200 Billions tokens) from the upcoming Nandi-Mini-600M model family, a compact multilingual language model focused on strong efficiency, deployment flexibility, and Indic language support.
The model is being trained completely from scratch and is designed to deliver strong performance at low compute and memory budgets. This checkpoint is shared to provide an early look into the modelβs scaling behavior and training progress.
This release is an early checkpoint and not the final converged model. Performance is expected to improve further with continued training and scaling.
π’ We will soon share technical blog ! Stay tuned!
Architectural Highlights
Nandi-Mini-600M introduces several efficiency-focused architectural optimizations designed for compact yet capable language models.
Shared KV (Shared Key-Value Vectors)
Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of computing separate Key and Value projections, both reuse a shared latent representation, while a lightweight Key normalization step is applied specifically for attention computation.
This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since RoPE and Key normalization are applied dynamically during attention computation.
Nandi supports two KV cache modes:
"kv_cache_mode": "shared"
Uses Shared KV, reducing KV-cache memory by ~50% with slightly higher compute overhead.
"kv_cache_mode": "vanilla"
Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.
KV-Cache Memory Comparison
- Vanilla KV β Standard KV-cache memory usage
- Shared KV β ~50% lower KV-cache footprint
Shared KV is part of our broader focus on deployable foundation models optimized for:
- On-premise AI systems
- Memory-constrained deployments
- Edge devices
- Long-context inference workloads
This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
Model Details
- Type: Causal Language Model
- Training Stage: Early Pretraining Checkpoint (200 Billions tokens)
- Parameters: ~600M
- Architecture: Transformer decoder
- Positional Encoding: RoPE
- Normalization: RMSNorm + QK Norm
- Activation: SwiGLU
- Attention: GQA + Shared KV
- Embeddings: Tied embeddings with factorized design
- Context length: 2,048 tokens (planned to be extended to 32,000 tokens)
- Vocabulary Size: 131,072
π Benchmark Results
General Benchmarks
| Model | Trained Tokens | HellaSwag | WinoGrande | OBQA | PIQA | GPQA | ARC-e | ARC-c | MMLU | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| MobiLlama-0.5B-Base | 1.3 | 39.65 | 53.67 | 30.60 | 70.35 | 24.33 | 52.82 | 23.63 | 24.18 | 39.90 |
| Qwen-2-0.5B-Base | 12 | 49.01 | 57.69 | 33.20 | 68.98 | 27.23 | 54.79 | 25.42 | 44.06 | 45.05 |
| Qwen2.5-0.5B-Base | 18 | 52.16 | 56.82 | 35.40 | 70.29 | 24.10 | 64.64 | 29.86 | 47.41 | 47.59 |
| Qwen3-0.6B-Base | 36 | 53.77 | 59.19 | 34.40 | 70.29 | 30.80 | 65.44 | 33.78 | 50.34 | 49.75 |
| Qwen3.5-0.8B-Base | 36 | 54.87 | 60.54 | 35.80 | 70.02 | 31.25 | 70.50 | 38.23 | 52.73 | 51.74 |
| SmolLM-360M-Base | 0.6 | 53.33 | 57.22 | 37.60 | 70.56 | 21.20 | 70.24 | 33.27 | 24.92 | 46.04 |
| SmolLM2-360M-Base | 4 | 56.30 | 59.19 | 37.60 | 71.81 | 25.22 | 67.88 | 36.68 | 25.55 | 47.53 |
| Nandi-Mini-600M-Early-Checkpoint-Base | 0.2 | 44.86 | 54.77 | 34.80 | 68.60 | 26.33 | 64.73 | 29.70 | 29.01 | 44.10 |
Tokenization Fertility Score Across Languages
| Language | SmolLM3-3B | Qwen3-0.6B-Base | Sarvam-1 | Nandi-Mini-600M |
|---|---|---|---|---|
| English | 1.17 | 1.16 | 1.32 | 1.18 |
| Bengali | 8.66 | 7.51 | 1.55 | 1.44 |
| Gujarati | 10.47 | 9.37 | 1.55 | 1.53 |
| Hindi | 2.71 | 5.14 | 1.25 | 1.32 |
| Kannada | 16.43 | 12.96 | 2.10 | 1.90 |
| Malayalam | 17.77 | 14.56 | 2.49 | 2.05 |
| Marathi | 3.73 | 6.70 | 1.55 | 1.55 |
| Oriya | 19.07 | 15.75 | 2.18 | 2.68 |
| Punjabi | 9.23 | 8.66 | 1.47 | 1.42 |
| Tamil | 13.56 | 10.93 | 2.06 | 2.05 |
| Telugu | 15.40 | 13.38 | 2.09 | 1.77 |
| Assamese | 9.26 | 8.13 | 4.31 | 1.51 |
π Supported Languages
The model is trained on English and a diverse set of Indic languages, including:
Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia
π Usage
!pip install transformers=='5.4.0'
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16
).eval()
model.config.kv_cache_mode = "shared"
# model.config.kv_cache_mode = "vanilla"
prompt = """The night was quiet and the streets were empty"""
model_inputs = tokenizer(
[prompt],
return_tensors="pt"
).to(model.device)
outputs = model.generate(
**model_inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.3,
top_k=20,
top_p=0.95,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
use_cache=True, # Disable KV cache
)
response = tokenizer.decode(
outputs[0],
skip_special_tokens=True
)
print(response)
- Downloads last month
- -