File size: 1,738 Bytes
9f10a26 b80a5f9 9f10a26 a2a1035 9f10a26 a6e4c77 9f10a26 a2a1035 a6e4c77 a2a1035 a6e4c77 9f10a26 a2a1035 a6e4c77 9f10a26 a6e4c77 9f10a26 a2a1035 9f10a26 a6e4c77 a2a1035 a6e4c77 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | ---
library_name: transformers
tags:
- Computational_linguistics
- Low_resouce_Language
- LLM
- GPT
license: apache-2.0
datasets:
- ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens
language:
- sd
metrics:
- perplexity
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---
# SindhiLM: A Specialized GPT-2 Model for Sindhi
SindhiLM is a causal language model trained from scratch to provide high-quality text generation for the **Sindhi language**. It significantly outperforms general multilingual models by focusing specifically on Sindhi morphology and syntax.
## Model Details
- **Developed by:** Aakash Meghwar
- **Model type:** Causal Language Model (GPT-2 architecture)
- **Language:** Sindhi (sd)
- **Library Name:** Transformers
- **Base Model:** openai-community/gpt2
## Evaluation Results
The model was evaluated using **Perplexity (PPL)** on a held-out Sindhi test set. Lower perplexity indicates a better understanding of the language.
| Model | Perplexity (Lower is Better) |
| :--- | :--- |
| **mBERT (Baseline)** | 2,360,312 |
| **GPT-2 (Base)** | 500,000 |
| **SindhiLM (Ours)** | **212,503** |
## Training Data
The model was trained on the **Sindhi Mega Corpus**, consisting of approximately 118 million tokens. This dataset includes diverse Sindhi literature, news, and web content.
## How to Get Started
You can use this model directly with the Hugging Face `transformers` library:
```python
from transformers import pipeline
# Load the model
generator = pipeline("text-generation", model="aakashMeghwar01/SindhiLM")
# Generate Sindhi text
prompt = "سنڌ جي ثقافت"
output = generator(prompt, max_new_tokens=20, do_sample=True, temperature=0.7)
print(output[0]['generated_text']) |