|
|
--- |
|
|
tags: |
|
|
- model_hub_mixin |
|
|
- pytorch_model_hub_mixin |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ur |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration: |
|
|
|
|
|
|
|
|
# ALIF Base 100M |
|
|
|
|
|
**ALIF Base 100M** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid) |
|
|
* **Supervised by:** Dr. Abdul Samad (Habib University) |
|
|
* **Model type:** Decoder-only Transformer, GPT-like |
|
|
* **Variant:** ALIF-Base-100M |
|
|
* **Language(s) (NLP):** Urdu (ur) |
|
|
* **License:** Apache 2.0 |
|
|
* **Architecture:** Transformer (GPT-Based) |
|
|
* **Framework:** PyTorch |
|
|
* **Tokeniezer:** SentencePiece Custom Tokenizer |
|
|
* **Hyperparameters:**: |
|
|
* **Vocabulary Size:** 32000 |
|
|
* **Embedding Size:** 768 |
|
|
* **Attention Heads:** 12 |
|
|
* **Layers:** 12 |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model: |
|
|
|
|
|
```python |
|
|
from modeling_gpt import GPTLanguageModel |
|
|
from transformers import AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_name = "orature/ALIF-Base-100M" |
|
|
model = GPTLanguageModel.from_pretrained(model_name) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
# For text generation |
|
|
prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, " |
|
|
inputs = tokenizer.encode(prompt_urdu) |
|
|
inputs_tensor = torch.tensor(inputs).unsqueeze(0) # Add batch dimension |
|
|
|
|
|
# Generate text |
|
|
outputs = model.generate(inputs_tensor, max_new_tokens=64, temperature=0.7) |
|
|
outputs_tensor = torch.tensor(outputs).unsqueeze(0) |
|
|
generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist()) |
|
|
|
|
|
print(f"Prompt: {prompt_urdu}") |
|
|
print(f"Generated Text: {generated_text}") |
|
|
``` |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**ALIF Base 100M** is designed to generate coherent and contextually relevant Urdu text. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text. |
|
|
|
|
|
**Key Features:** |
|
|
* Optimized for Urdu language nuances. |
|
|
* Strong foundational capabilities for further fine-tuning (for base models) |
|
|
* Capable of generating next tokens in a sequence, making it suitable for various text generation tasks. |
|
|
* Part of a series aiming to provide efficient and accessible SLMs for Urdu. |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
**Intended Uses:** |
|
|
* **Text Generation:** Creative writing, content generation, story completion in Urdu. |
|
|
* **Research:** Base for further research in Urdu NLP, low-resource language modeling. |
|
|
* **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu. |
|
|
* **Educational Purposes:** Understanding SLM behavior for Urdu. |
|
|
* |
|
|
**Limitations:** |
|
|
* The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant). |
|
|
* As a base generative model, it may generate plausible-sounding but incorrect or nonsensical information (hallucinations). |
|
|
* The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist. |
|
|
* Performance on highly specific or technical domains may be limited without further fine-tuning. |
|
|
* The model does not have real-time knowledge and its information is limited to its training data. |
|
|
* Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications. |
|
|
|
|
|
**Out-of-Scope Uses:** |
|
|
* Generating high-stakes advice (medical, legal, financial) without human oversight. |
|
|
* Impersonation or generating misleading information. |
|
|
* Applications that could lead to harm or discrimination. |
|
|
* Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning. |
|
|
* Any use that violates ethical guidelines or legal standards. |