You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card: sin-qwen-3B-base

sin-qwen-3B-base is a foundational Small Language Model (SLM) for the Sinhala language. It was developed by performing Full-Parameter Continual Pre-training on the Qwen-3B architecture using a massive 23.6GB Sinhala corpus.

This model is specifically designed to bridge the "Digital Divide" in Sri Lanka, providing a high-performance, linguistically accurate model that is light enough to run on edge devices (standard laptops and smartphones with 4GB-8GB VRAM).

Key Features

Native Sinhala Tokenizer: Uses a custom-trained WordPiece tokenizer (32,000 vocab size) optimized for the Sinhala script.
ZWJ Preservation: Unlike standard multilingual models, this model correctly preserves the Zero Width Joiner (ZWJ), ensuring 100% accurate rendering of complex Sinhala clusters like Rakāransaya (ක්‍ර), Yansaya (ත්‍ය), and Ksha (ක්ෂ).
Edge-Ready: At 3B parameters, it offers high inference speeds and low memory overhead, making it ideal for local deployment in Sri Lanka.
World Knowledge: Demonstrates emergent understanding of local entities, including Sri Lankan ministries, politicians, schools, and geographical locations.

Technical Specifications

Base Model: Qwen2.5-3B
Training Method: Full-parameter Continual Pre-training (CPT)
Dataset: 23.6GB Sinhala-Corpus (Web-crawled news, literature, and government documents)
Training Steps: 10,000
Hardware: NVIDIA A6000 (48GB VRAM)
Optimizer: AdamW with Cosine Learning Rate Decay
Learning Rate: 3e-5 (with 500 warmup steps)

Usage & Inference

This is a Base Model. It is designed for text completion. To get the best results, provide the start of a sentence or a news header.

Corrected Inference Script (Fixes Prompt-Response Merging)

The following script ensures a trailing space is added to the prompt and special tokens are handled correctly to prevent the "space problem" (e.g., preventing "ලංකාව" + "දී" from merging into "ලංකාවදී").

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# --- 1. Setup ---
model_path = "eshangj/sin-qwen-3B-base" # Update with your HF repo ID
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
# Explicitly map special tokens
tokenizer.pad_token = "[PAD]"
tokenizer.unk_token = "[UNK]"
tokenizer.bos_token = "[CLS]"
tokenizer.eos_token = "[SEP]"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa"
)

def generate_sinhala(prompt, max_new_tokens=150):
    # Fix: Ensure there is a space at the end of the prompt to avoid token merging
    if not prompt.endswith(" "):
        prompt += " "
        
    # Format with BOS token
    formatted_prompt = f"{tokenizer.bos_token} {prompt}"
    
    inputs = tokenizer(
        formatted_prompt, 
        return_tensors="pt", 
        add_special_tokens=False, # We added BOS manually
        return_token_type_ids=False
    ).to(device)
    
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# --- 2. Example ---
prompt = "විද්‍යාව"
print(f"Prompt: {prompt}")
print(f"Generated: {generate_sinhala(prompt)}")

Examples

Prompt: විද්‍යාව Response: විද්‍යාව හා සම්බන්ධ කිරීම සහ සමාජ ජාලා තුළ පවතින ආර්ථික දුෂ්කරතා සම්බන්ධයෙන් අවධානය යොමු කළ යුතු බව ජනාධිපති මෛත්‍රීපාල සිරිසේන මහතා පවසයි. අද ( 09 ) පෙරවරුවේ ජනාධිපති කාර්යාලයේ පැවති සාකච්ඡාවකදී ජනාධිපතිවරයා මේ අදහස් පළ කළේය. මෙම තත්ත්වය මත එම විද්‍යාවට අදාළ විෂය පථයන්, සංස්කෘතික කටයුතු දෙපාර්තමේන්තුව හෝ වෙනත් විෂයයක් මඟින් සිදුකරන බවයි ඔහු කියා සිටියේ. එමෙන්ම, දේශපාලන අධිකාරියක් ලෙස මෙයද සිදුවන බවද ජනාධිපතිවරයා පෙන්වාදුන්නේය.

Prompt: පාසල් අධ්‍යාපනය Response: පාසල් අධ්‍යාපනය සඳහා සහභාගි වන පළමු අවස්ථාව මෙය වන බව අධ්‍යාපන අමාත්‍ය අකිල විරාජ් කාරියවසම් මහතා පවසනවා. ඒ අනුව අද පස්වරුවේ කොළඹ පැවති මාධ්‍ය හමුවකට එක්වෙමින් අමාත්‍යවරයා මේ අදහස් පළ කළා.

Limitations & Biases

Base Model Nature: This model is not a chatbot. It performs best at document completion. For Q&A tasks, further Instruction Fine-Tuning (IFT) is required.
Data Bias: Due to the heavy concentration of news articles in the training data, the model tends to favor a formal, news-reporting tone.
Hallucinations: As with all LLMs, the model may generate factually incorrect information while maintaining perfect grammatical structure.

Citation

If you use this model in your research, please credit the work:

@article{SinQwen2025,
  title={SinQwen-3B: A Native Small Language Model for Edge Deployment in Sri Lanka},
  author={Eshan G.J.},
  year={2025},
  publisher={Hugging Face}
}

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eshangj/sin-qwen-3B-base

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct