Instructions to use enstazao/Qalb-1.0-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use enstazao/Qalb-1.0-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="enstazao/Qalb-1.0-8B-Instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("enstazao/Qalb-1.0-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("enstazao/Qalb-1.0-8B-Instruct")

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use enstazao/Qalb-1.0-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "enstazao/Qalb-1.0-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enstazao/Qalb-1.0-8B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/enstazao/Qalb-1.0-8B-Instruct

SGLang

How to use enstazao/Qalb-1.0-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "enstazao/Qalb-1.0-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enstazao/Qalb-1.0-8B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "enstazao/Qalb-1.0-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enstazao/Qalb-1.0-8B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Unsloth Studio

How to use enstazao/Qalb-1.0-8B-Instruct with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for enstazao/Qalb-1.0-8B-Instruct to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for enstazao/Qalb-1.0-8B-Instruct to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for enstazao/Qalb-1.0-8B-Instruct to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="enstazao/Qalb-1.0-8B-Instruct",
    max_seq_length=2048,
)

Docker Model Runner
How to use enstazao/Qalb-1.0-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/enstazao/Qalb-1.0-8B-Instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qalb-1.0-8B-Instruct (Urdu Llama 3.1)

Qalb-1.0-8B-Instruct is a state-of-the-art Urdu language model designed to bridge the gap in low-resource language processing. Built on the powerful Llama-3.1-8B architecture, Qalb has been rigorously adapted for the Urdu language through a two-stage process: Continued Pre-training on a massive Urdu corpus of 1.97 billion tokens followed by Supervised Fine-Tuning for instruction following.

Unlike general multilingual models that struggle with Urdu grammar and cultural nuance, Qalb delivers fluent, culturally accurate, and context-aware responses.

🌟 Key Features

State-of-the-Art Performance: Outperforms previous best models (Alif-1.0 and LLaMA-3.1 Base) on 6 out of 7 benchmarks.
Deep Urdu Understanding: Pre-trained on a diverse mix of news, literature, government documents, and social media to capture the depth of the language.
Ethical & Safe: Fine-tuned to provide helpful, harmless, and honest assistants, refusing to generate toxic or misleading content.
Reasoning Capable: Excellent performance on logical reasoning, mathematical word problems, and commonsense tasks in Urdu.
Bilingual Proficiency: Retains strong English capabilities while excelling in Urdu, making it ideal for translation and code-switching tasks.

📊 Performance Benchmarks

Qalb establishes a new standard for Urdu LLMs, achieving an Overall Score of 90.34. It significantly outperforms the base model and the previous state-of-the-art.

🏆 Comparison vs. SOTA Models

Task	Qalb (Ours)	Alif-1.0-Instruct	LLaMA-3.1-8B-Instruct
Overall Score	90.34	87.1	45.7
Translation	94.41	89.3	58.9
Classification	96.38	93.9	61.4
Sentiment Analysis	95.79	94.3	54.3
Ethics	90.83	85.7	27.3
Reasoning	88.59	83.5	45.6
QA (Question Answering)	80.40	73.8	30.5
Generation	85.97	90.2	42.8

> Note: Scores are on a 0-100 scale. Qalb outperforms the previous best model (Alif) in 6 out of 7 categories.

🚀 How to Use

Google COlab

Method 1: Using Unsloth (Recommended - Fast & Efficient)

The easiest way to run Qalb is using the Unsloth library, which provides 2x faster inference.

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "enstazao/Qalb-1.0-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True, # <--- Currently set to use 4-bit quantization
)
FastLanguageModel.for_inference(model)


urdu_system_prompt = "آپ ایک مددگار اور بے ضرر مصنوعی ذہانت کے اسسٹنٹ ہیں۔ آپ اردو میں سوالات کے درست جوابات دیتے ہیں۔"

questions = [
    "پاکستان کا قومی کھیل کیا ہے؟",                         
    "لاہور شہر کیوں مشہور ہے؟ مختصر وضاحت کریں۔",
    "سوال: لیاقت علی خان کون تھے؟",
    "کراچی کو روشنیوں کا شہر کیوں کہا جاتا ہے؟",             
    "انگریزی میں ترجمہ کریں: 'محنت کامیابی کی کنجی ہے۔'"
]

print("🚀 Starting Batch Generation...\n")


for user_input in questions:
    print(f"🔹 Question: {user_input}")

    # Manually Format Prompt (Llama-3 Style)
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{urdu_system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens = 256,
        temperature = 0.1,
        top_p = 0.9,
        repetition_penalty = 1.1,
        do_sample = True,
        eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
    )

    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
    
    print(f"✅ Answer: {response}")
    print("-" * 50)

Method 2: Using Hugging Face Transformers

Compatible with standard transformers if Unsloth is not available.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch


model_name = "enstazao/Qalb-1.0-8B-Instruct"
urdu_system_prompt = "آپ ایک مددگار اور بے ضرر مصنوعی ذہانت کے اسسٹنٹ ہیں۔ آپ اردو میں سوالات کے درست جوابات دیتے ہیں۔"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)



print("⏳ Loading model in 4-bit...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # <--- Apply 4-bit here
    device_map="auto"               # <--- Required for quantization
)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]


questions = [
    "پاکستان کا قومی کھیل کیا ہے؟",                         
    "لاہور شہر کیوں مشہور ہے؟ مختصر وضاحت کریں۔",            
    "سوال: لیاقت علی خان کون تھے؟",                     
    "سوال: اسلام آباد شہر کے بارے میں بتائیں۔",  
    "انگریزی میں ترجمہ کریں: 'محنت کامیابی کی کنجی ہے۔'"
]

print("Model Loaded. Starting Generation...\n")

# 5. Loop through questions
for user_input in questions:
    print(f"🔹 Question: {user_input}")
    
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{urdu_system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

    input_ids = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **input_ids,
        max_new_tokens = 256,
        temperature = 0.1,
        top_p = 0.9,
        repetition_penalty = 1.1,
        do_sample = True,
        eos_token_id = terminators
    )

    response = tokenizer.decode(outputs[0][input_ids['input_ids'].shape[1]:], skip_special_tokens=True)
    
    print(f"✅ Answer: {response}")
    print("-" * 50)

Limitation & Bias

While Qalb has been trained to be helpful and harmless, it may still reflect biases present in the training data. Users should fact-check critical information, especially in medical, legal, or religious contexts.

Citation

If you use QALB in your research, please cite:

@article{qalb2025,
  title={Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training},
  author={Hassan, Muhammad Taimoor and Ahmed, Jawad and Awais, Muhammad},
  journal={arXiv preprint arXiv:2601.08141},
  year={2026},
  eprint={2601.08141},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={[https://arxiv.org/abs/2601.08141](https://arxiv.org/abs/2601.08141)},
  doi={10.48550/arXiv.2601.08141}
}