--- library_name: transformers tags: - Computational_linguistics - Low_resouce_Language - LLM - GPT license: apache-2.0 datasets: - ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens language: - sd metrics: - perplexity base_model: - openai-community/gpt2 pipeline_tag: text-generation --- # SindhiLM: A Specialized GPT-2 Model for Sindhi SindhiLM is a causal language model trained from scratch to provide high-quality text generation for the **Sindhi language**. It significantly outperforms general multilingual models by focusing specifically on Sindhi morphology and syntax. ## Model Details - **Developed by:** Aakash Meghwar - **Model type:** Causal Language Model (GPT-2 architecture) - **Language:** Sindhi (sd) - **Library Name:** Transformers - **Base Model:** openai-community/gpt2 ## Evaluation Results The model was evaluated using **Perplexity (PPL)** on a held-out Sindhi test set. Lower perplexity indicates a better understanding of the language. | Model | Perplexity (Lower is Better) | | :--- | :--- | | **mBERT (Baseline)** | 2,360,312 | | **GPT-2 (Base)** | 500,000 | | **SindhiLM (Ours)** | **212,503** | ## Training Data The model was trained on the **Sindhi Mega Corpus**, consisting of approximately 118 million tokens. This dataset includes diverse Sindhi literature, news, and web content. ## How to Get Started You can use this model directly with the Hugging Face `transformers` library: ```python from transformers import pipeline # Load the model generator = pipeline("text-generation", model="aakashMeghwar01/SindhiLM") # Generate Sindhi text prompt = "سنڌ جي ثقافت" output = generator(prompt, max_new_tokens=20, do_sample=True, temperature=0.7) print(output[0]['generated_text'])