Model Card: sin-qwen-3B-base
sin-qwen-3B-base is a foundational Small Language Model (SLM) for the Sinhala language. It was developed by performing Full-Parameter Continual Pre-training on the Qwen-3B architecture using a massive 23.6GB Sinhala corpus.
This model is specifically designed to bridge the "Digital Divide" in Sri Lanka, providing a high-performance, linguistically accurate model that is light enough to run on edge devices (standard laptops and smartphones with 4GB-8GB VRAM).
Key Features
- Native Sinhala Tokenizer: Uses a custom-trained WordPiece tokenizer (32,000 vocab size) optimized for the Sinhala script.
- ZWJ Preservation: Unlike standard multilingual models, this model correctly preserves the Zero Width Joiner (ZWJ), ensuring 100% accurate rendering of complex Sinhala clusters like Rakāransaya (ක්ර), Yansaya (ත්ය), and Ksha (ක්ෂ).
- Edge-Ready: At 3B parameters, it offers high inference speeds and low memory overhead, making it ideal for local deployment in Sri Lanka.
- World Knowledge: Demonstrates emergent understanding of local entities, including Sri Lankan ministries, politicians, schools, and geographical locations.
Technical Specifications
- Base Model: Qwen2.5-3B
- Training Method: Full-parameter Continual Pre-training (CPT)
- Dataset: 23.6GB Sinhala-Corpus (Web-crawled news, literature, and government documents)
- Training Steps: 10,000
- Hardware: NVIDIA A6000 (48GB VRAM)
- Optimizer: AdamW with Cosine Learning Rate Decay
- Learning Rate: 3e-5 (with 500 warmup steps)
Usage & Inference
This is a Base Model. It is designed for text completion. To get the best results, provide the start of a sentence or a news header.
Corrected Inference Script (Fixes Prompt-Response Merging)
The following script ensures a trailing space is added to the prompt and special tokens are handled correctly to prevent the "space problem" (e.g., preventing "ලංකාව" + "දී" from merging into "ලංකාවදී").
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# --- 1. Setup ---
model_path = "eshangj/sin-qwen-3B-base" # Update with your HF repo ID
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Explicitly map special tokens
tokenizer.pad_token = "[PAD]"
tokenizer.unk_token = "[UNK]"
tokenizer.bos_token = "[CLS]"
tokenizer.eos_token = "[SEP]"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa"
)
def generate_sinhala(prompt, max_new_tokens=150):
# Fix: Ensure there is a space at the end of the prompt to avoid token merging
if not prompt.endswith(" "):
prompt += " "
# Format with BOS token
formatted_prompt = f"{tokenizer.bos_token} {prompt}"
inputs = tokenizer(
formatted_prompt,
return_tensors="pt",
add_special_tokens=False, # We added BOS manually
return_token_type_ids=False
).to(device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# --- 2. Example ---
prompt = "විද්යාව"
print(f"Prompt: {prompt}")
print(f"Generated: {generate_sinhala(prompt)}")
Examples
Prompt: විද්යාව
Response: විද්යාව හා සම්බන්ධ කිරීම සහ සමාජ ජාලා තුළ පවතින ආර්ථික දුෂ්කරතා සම්බන්ධයෙන් අවධානය යොමු කළ යුතු බව ජනාධිපති මෛත්රීපාල සිරිසේන මහතා පවසයි. අද ( 09 ) පෙරවරුවේ ජනාධිපති කාර්යාලයේ පැවති සාකච්ඡාවකදී ජනාධිපතිවරයා මේ අදහස් පළ කළේය. මෙම තත්ත්වය මත එම විද්යාවට අදාළ විෂය පථයන්, සංස්කෘතික කටයුතු දෙපාර්තමේන්තුව හෝ වෙනත් විෂයයක් මඟින් සිදුකරන බවයි ඔහු කියා සිටියේ. එමෙන්ම, දේශපාලන අධිකාරියක් ලෙස මෙයද සිදුවන බවද ජනාධිපතිවරයා පෙන්වාදුන්නේය.
Prompt: පාසල් අධ්යාපනය
Response: පාසල් අධ්යාපනය සඳහා සහභාගි වන පළමු අවස්ථාව මෙය වන බව අධ්යාපන අමාත්ය අකිල විරාජ් කාරියවසම් මහතා පවසනවා. ඒ අනුව අද පස්වරුවේ කොළඹ පැවති මාධ්ය හමුවකට එක්වෙමින් අමාත්යවරයා මේ අදහස් පළ කළා.
Limitations & Biases
- Base Model Nature: This model is not a chatbot. It performs best at document completion. For Q&A tasks, further Instruction Fine-Tuning (IFT) is required.
- Data Bias: Due to the heavy concentration of news articles in the training data, the model tends to favor a formal, news-reporting tone.
- Hallucinations: As with all LLMs, the model may generate factually incorrect information while maintaining perfect grammatical structure.
Citation
If you use this model in your research, please credit the work:
@article{SinQwen2025,
title={SinQwen-3B: A Native Small Language Model for Edge Deployment in Sri Lanka},
author={Eshan G.J.},
year={2025},
publisher={Hugging Face}
}
- Downloads last month
- 2