File size: 1,738 Bytes

9f10a26
 
b80a5f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f10a26
a2a1035
9f10a26
a6e4c77
9f10a26
 
a2a1035
a6e4c77
 
a2a1035
a6e4c77
 
 
 
 
 
 
 
 
 
9f10a26
 
 
a2a1035
a6e4c77
9f10a26
a6e4c77
 
9f10a26
a2a1035
 
9f10a26
a6e4c77
a2a1035
a6e4c77

---
library_name: transformers
tags:
- Computational_linguistics
- Low_resouce_Language
- LLM
- GPT
license: apache-2.0
datasets:
- ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens
language:
- sd
metrics:
- perplexity
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---
# SindhiLM: A Specialized GPT-2 Model for Sindhi

SindhiLM is a causal language model trained from scratch to provide high-quality text generation for the **Sindhi language**. It significantly outperforms general multilingual models by focusing specifically on Sindhi morphology and syntax.

## Model Details
- **Developed by:** Aakash Meghwar
- **Model type:** Causal Language Model (GPT-2 architecture)
- **Language:** Sindhi (sd)
- **Library Name:** Transformers
- **Base Model:** openai-community/gpt2

## Evaluation Results
The model was evaluated using **Perplexity (PPL)** on a held-out Sindhi test set. Lower perplexity indicates a better understanding of the language.

| Model | Perplexity (Lower is Better) |
| :--- | :--- |
| **mBERT (Baseline)** | 2,360,312 |
| **GPT-2 (Base)** | 500,000 |
| **SindhiLM (Ours)** | **212,503** |



## Training Data
The model was trained on the **Sindhi Mega Corpus**, consisting of approximately 118 million tokens. This dataset includes diverse Sindhi literature, news, and web content.

## How to Get Started
You can use this model directly with the Hugging Face `transformers` library:

```python
from transformers import pipeline

# Load the model
generator = pipeline("text-generation", model="aakashMeghwar01/SindhiLM")

# Generate Sindhi text
prompt = "سنڌ جي ثقافت"
output = generator(prompt, max_new_tokens=20, do_sample=True, temperature=0.7)

print(output[0]['generated_text'])