Param-1-5B

Param-1-5B is a bilingual (English–Hindi) large language model developed under the Param-1 family. With 5 billion parameters, this model extends the capabilities of Param-1-2.9B by incorporating enhanced mathematical reasoning and code understanding/generation.

The model is pretrained from scratch and designed to serve as a strong foundation for downstream tasks such as mathematical problem solving, and code-related understanding / generation.


Key Highlights

  • 5B parameter dense Transformer model
  • Bilingual: English and Hindi
  • Enhanced domains: Math and Code (compared to Param-1-2.9B)
  • Updated dataset mixture and percentage ratios
  • Designed as a pretrained (PT) base model

Model Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "bharatgenai/Param-1-5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

prompt = "Your prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.6,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=False
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Benchmarks

Task Param-1-5B (PT)
ARC Challenge 55.1 (few)
ARC Easy 75.6 (few)
HellaSwag 76 (few)
HellaSwag Hi 44.6 (few)
MMLU En 50.8 (few)
MMLU Hi 39.1 (few)
TriviaQA 48.2
TruthfulQA - Gen 32.9
TruthfulQA - MC1 27.1
TruthfulQA - MC2 38.6
PIQA 79.6
SuperGLUE - BoolQ 72.7
SuperGLUE - WiC 50.5
SuperGLUE - WSC 48.1
SuperGLUE - RTE 65

Model Architecture

  • Architecture: Transformer (Decoder-only)
  • Number of parameters: ~5B
  • Hidden size: 4096
  • Intermediate size: 14336
  • Number of layers: 16
  • Number of attention heads: 32
  • Number of key-value heads: 8
  • Maximum position embedding: 4096
  • Positional embeddings: Rotary (RoPE)
  • Attention mechanism: Grouped-query attention
  • Activation function: silu
  • Precision: bf16-mixed

Training Data

The model is pretrained on a large-scale bilingual corpus with a strong focus on English and Hindi, along with dedicated Math and Code data.

Dataset Composition (Total ≈ 6.61T tokens)

  • English Natural Language: 3.6T
  • Hindi Natural Language: 2.77T
  • Math & Code: 238.4B
    • Math: 40%
    • Code: 60%

Compared to Param-1-2.9B, Param-1-5B includes:

  • A larger overall training corpus
  • Explicit inclusion of Math and Code datasets
  • Improved performance in mathematical and code understanding/generation
  • Continued emphasis on English–Hindi bilingual capability

Training Details

  • Training framework: NVIDIA NeMo
  • Training infrastructure: Yotta's Shakti Cloud
  • Hardware: NVIDIA H100 - 1024 GPUs
  • Total training tokens: 6.61 Trillion

Limitations

  • This is a pretrained base model and may require fine-tuning for instruction-following or chat use cases.
  • The model may reflect biases present in large-scale web and code data.
  • Performance on low-resource Hindi domains may vary depending on task complexity.

License

This model is released under the BharatGen non-commercial license.

Please refer to the LICENSE file for detailed terms and conditions.


Downloads last month
272
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support