| | --- |
| | library_name: transformers |
| | tags: |
| | - Computational_linguistics |
| | - Low_resouce_Language |
| | - LLM |
| | - GPT |
| | license: apache-2.0 |
| | datasets: |
| | - ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens |
| | language: |
| | - sd |
| | metrics: |
| | - perplexity |
| | base_model: |
| | - openai-community/gpt2 |
| | pipeline_tag: text-generation |
| | --- |
| | # SindhiLM: A Specialized GPT-2 Model for Sindhi |
| |
|
| | SindhiLM is a causal language model trained from scratch to provide high-quality text generation for the **Sindhi language**. It significantly outperforms general multilingual models by focusing specifically on Sindhi morphology and syntax. |
| |
|
| | ## Model Details |
| | - **Developed by:** Aakash Meghwar |
| | - **Model type:** Causal Language Model (GPT-2 architecture) |
| | - **Language:** Sindhi (sd) |
| | - **Library Name:** Transformers |
| | - **Base Model:** openai-community/gpt2 |
| |
|
| | ## Evaluation Results |
| | The model was evaluated using **Perplexity (PPL)** on a held-out Sindhi test set. Lower perplexity indicates a better understanding of the language. |
| |
|
| | | Model | Perplexity (Lower is Better) | |
| | | :--- | :--- | |
| | | **mBERT (Baseline)** | 2,360,312 | |
| | | **GPT-2 (Base)** | 500,000 | |
| | | **SindhiLM (Ours)** | **212,503** | |
| |
|
| |
|
| |
|
| | ## Training Data |
| | The model was trained on the **Sindhi Mega Corpus**, consisting of approximately 118 million tokens. This dataset includes diverse Sindhi literature, news, and web content. |
| |
|
| | ## How to Get Started |
| | You can use this model directly with the Hugging Face `transformers` library: |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | # Load the model |
| | generator = pipeline("text-generation", model="aakashMeghwar01/SindhiLM") |
| | |
| | # Generate Sindhi text |
| | prompt = "سنڌ جي ثقافت" |
| | output = generator(prompt, max_new_tokens=20, do_sample=True, temperature=0.7) |
| | |
| | print(output[0]['generated_text']) |