Param-1-5B
Param-1-5B is a bilingual (English–Hindi) large language model developed under the Param-1 family. With 5 billion parameters, this model extends the capabilities of Param-1-2.9B by incorporating enhanced mathematical reasoning and code understanding/generation.
The model is pretrained from scratch and designed to serve as a strong foundation for downstream tasks such as mathematical problem solving, and code-related understanding / generation.
Key Highlights
- 5B parameter dense Transformer model
- Bilingual: English and Hindi
- Enhanced domains: Math and Code (compared to Param-1-2.9B)
- Updated dataset mixture and percentage ratios
- Designed as a pretrained (PT) base model
Model Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "bharatgenai/Param-1-5B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
prompt = "Your prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.6,
eos_token_id=tokenizer.eos_token_id,
use_cache=False
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Benchmarks
| Task | Param-1-5B (PT) |
|---|---|
| ARC Challenge | 55.1 (few) |
| ARC Easy | 75.6 (few) |
| HellaSwag | 76 (few) |
| HellaSwag Hi | 44.6 (few) |
| MMLU En | 50.8 (few) |
| MMLU Hi | 39.1 (few) |
| TriviaQA | 48.2 |
| TruthfulQA - Gen | 32.9 |
| TruthfulQA - MC1 | 27.1 |
| TruthfulQA - MC2 | 38.6 |
| PIQA | 79.6 |
| SuperGLUE - BoolQ | 72.7 |
| SuperGLUE - WiC | 50.5 |
| SuperGLUE - WSC | 48.1 |
| SuperGLUE - RTE | 65 |
Model Architecture
- Architecture: Transformer (Decoder-only)
- Number of parameters: ~5B
- Hidden size:
4096 - Intermediate size:
14336 - Number of layers:
16 - Number of attention heads:
32 - Number of key-value heads:
8 - Maximum position embedding:
4096 - Positional embeddings: Rotary (RoPE)
- Attention mechanism: Grouped-query attention
- Activation function:
silu - Precision:
bf16-mixed
Training Data
The model is pretrained on a large-scale bilingual corpus with a strong focus on English and Hindi, along with dedicated Math and Code data.
Dataset Composition (Total ≈ 6.61T tokens)
- English Natural Language:
3.6T - Hindi Natural Language:
2.77T - Math & Code:
238.4B- Math: 40%
- Code: 60%
Compared to Param-1-2.9B, Param-1-5B includes:
- A larger overall training corpus
- Explicit inclusion of Math and Code datasets
- Improved performance in mathematical and code understanding/generation
- Continued emphasis on English–Hindi bilingual capability
Training Details
- Training framework:
NVIDIA NeMo - Training infrastructure:
Yotta's Shakti Cloud - Hardware:
NVIDIA H100 - 1024 GPUs - Total training tokens:
6.61 Trillion
Limitations
- This is a pretrained base model and may require fine-tuning for instruction-following or chat use cases.
- The model may reflect biases present in large-scale web and code data.
- Performance on low-resource Hindi domains may vary depending on task complexity.
License
This model is released under the BharatGen non-commercial license.
Please refer to the LICENSE file for detailed terms and conditions.
- Downloads last month
- 272