SindhiLM-Qwen-0.5B (v1.0)

Developed by Aakash Meghwar (Founder, Text Tech Solutions)

📌 Project Overview

This model is a specialized Large Language Model (LLM) for the Sindhi Language, developed as part of the broader SindhiLM Project. It is a fine-tuned version of the Qwen2.5-0.5B architecture, specifically optimized to understand and generate Sindhi text using a custom-built BPE tokenizer.

The model was trained for 10,000 steps using a 4-bit LoRA (Low-Rank Adaptation) approach. This project adheres to the "Green AI" philosophy, providing a high-performance, compact model that is accessible for low-resource languages and deployable on consumer-grade hardware.

🚀 About the Author

I am Aakash Meghwar, a Computational Linguist specializing in South Asian languages.

🎓 M.S. in Applied Linguistics & Text Analytics (HSE, Russia - Graduating June 2026)
🎓 B.S. in English Language & Literature (NUML, Islamabad)
💡 Founder: Text Tech Solutions
✍️ Published Researcher: Author of "Compact Transformer Models for Classical Urdu Poetry" (Corporum Journal).

🤝 Open for Collaboration & PhD Opportunities

I am actively seeking PhD opportunities and Research Collaborations in the following areas:

Low-Resource NLP: Developing efficient models for Sindhi, Urdu, and Siraiki.
Literary Informatics: Computational stylistics and affective registers in classical poetry.
SindhiLM Evolution: I am currently developing SindhiLM-v2 (featuring an improved Sindhi-BPE tokenizer and knowledge distillation).

Are you looking for a researcher to join your NLP lab or organization? I am open to discussing projects involving Model Compression, Cross-Lingual Transfer, and South Asian Language Tech.

📩 Contact & Links

Email: aakashmeghwar01@gmail.com
LinkedIn: Aakash Meghwar
Projects: SindhiLM | Urdu Poetry Research

💻 How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_id = "Qwen/Qwen2.5-0.5B"
adapter_id = "aakashMeghwar01/SindhiLM-Qwen-0.5B"

# Load Tokenizer and Base Model
tokenizer = AutoTokenizer.from_pretrained(adapter_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Load Sindhi Adapters
model = PeftModel.from_pretrained(model, adapter_id)

# Example Sindhi Input
prompt = "سنڌي ٻوليءَ جي اهميت"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: -

Model tree for aakashMeghwar01/SindhiLM-Qwen-0.5B

Base model

Qwen/Qwen2.5-0.5B

Adapter

(356)

this model