hellosindh
/

qwen3-sindhi-cpt

@@ -1,22 +1,114 @@
 ---
-base_model: unsloth/qwen3-8b-bnb-4bit
 tags:
-- text-generation-inference
-- transformers
-- unsloth
 - qwen3
-- trl
 license: apache-2.0
-language:
-- en
 ---
-# Uploaded  model
-- **Developed by:** hellosindh
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/qwen3-8b-bnb-4bit
-This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+language:
+- sd
 tags:
+- sindhi
 - qwen3
+- continued-pretraining
+- sindh-text-generation
+- lora
+base_model: unsloth/Qwen3-8B-bnb-4bit
+library_name: peft
 license: apache-2.0
 ---
+# Qwen3-8B Sindhi CPT (Continued Pre-Training)
+This is a **LoRA adapter** for [Qwen3-8B](https://huggingface.co/unsloth/Qwen3-8B-bnb-4bit), continued pre-trained on **~164M tokens of Sindhi text**.
+---
+## Model Details
+| Property | Value |
+|---|---|
+| Base Model | `unsloth/Qwen3-8B-bnb-4bit` |
+| Training Type | Continued Pre-Training (CPT) |
+| Training Tokens | ~164M Sindhi tokens |
+| LoRA Rank | 32 |
+| LoRA Alpha | 64 |
+| Sequence Length | 2048 |
+| Quantization | 4-bit (bnb) |
+| Framework | Unsloth + HuggingFace PEFT |
+---
+## Usage
+### Option 1 — Load with Unsloth (recommended, faster)
+```python
+from unsloth import FastLanguageModel
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name   = "hellosindh/qwen3-sindhi-cpt",
+    load_in_4bit = True,
+    max_seq_length = 2048,
+)
+# Enable fast inference
+FastLanguageModel.for_inference(model)
+```
+### Option 2 — Load base + adapter separately with PEFT
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-8B",
+    torch_dtype  = torch.bfloat16,
+    device_map   = "auto",
+    load_in_4bit = True,
+)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("hellosindh/qwen3-sindhi-cpt")
+# Apply Sindhi adapter on top
+model = PeftModel.from_pretrained(base_model, "hellosindh/qwen3-sindhi-cpt")
+```
+### Generate Sindhi text
+```python
+inputs = tokenizer("سنڌ جي ماڻهو", return_tensors="pt").to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens  = 200,
+    temperature     = 0.8,
+    do_sample       = True,
+    repetition_penalty = 1.1,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## Training Details
+- **Dataset**: ~164M Sindhi tokens from multiple sources
+- **Tokenizer**: Qwen3 original tokenizer (no modifications)
+- **Hardware**: NVIDIA A100 40GB
+- **Framework**: [Unsloth](https://github.com/unslothai/unsloth) for efficient training
+- **Optimizer**: AdamW 8-bit
+- **Learning Rate**: `5e-5` with cosine scheduler
+- **Final Loss**: ~1.20
+---
+## Intended Use
+- Sindhi text generation
+- Synthetic data generation for low-resource Sindhi NLP
+- Base for further fine-tuning on Sindhi tasks (NER, QA, summarization)
+- Pretraining data augmentation for encoder models like [SindhiBERT](https://huggingface.co/hellosindh/sindhi-bert-base)
+---
+## Limitations
+- This is a **continued pre-training** adapter, not an instruction-tuned model
+- Outputs may not be factually accurate — intended for linguistic pattern learning
+- Best used as a base for task-specific fine-tuning