File size: 4,370 Bytes
b91fbc9
862348e
 
 
dfc282e
 
 
 
f4c5464
 
862348e
dfc282e
 
1b4d7d1
dfc282e
1b4d7d1
dfc282e
 
 
 
1b4d7d1
dfc282e
 
 
 
 
 
 
 
1b4d7d1
dfc282e
 
 
 
 
 
 
 
 
 
 
 
 
 
851fe17
dfc282e
 
 
 
 
 
 
 
f960f9b
851fe17
dfc282e
 
 
 
 
 
 
 
1b4d7d1
dfc282e
 
 
 
 
 
 
 
 
 
 
 
 
 
1b4d7d1
dfc282e
 
1b4d7d1
dfc282e
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
license: apache-2.0
language:
- ur
pipeline_tag: text-generation
---

This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:


# ALIF Base 100M

**ALIF Base 100M** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**.

## Model Details

*   **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
*   **Supervised by:** Dr. Abdul Samad (Habib University)
*   **Model type:** Decoder-only Transformer, GPT-like
*   **Variant:** ALIF-Base-100M
*   **Language(s) (NLP):** Urdu (ur)
*   **License:** Apache 2.0
* **Architecture:** Transformer (GPT-Based)
* **Framework:** PyTorch
* **Tokeniezer:** SentencePiece Custom Tokenizer
* **Hyperparameters:**:
    *   **Vocabulary Size:** 32000
    *   **Embedding Size:** 768
    *   **Attention Heads:** 12
    *   **Layers:** 12 

## How to Get Started with the Model

First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model:

```python
from modeling_gpt import GPTLanguageModel
from transformers import AutoTokenizer
import torch

model_name = "orature/ALIF-Base-100M"
model = GPTLanguageModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# For text generation
prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, "
inputs = tokenizer.encode(prompt_urdu)
inputs_tensor = torch.tensor(inputs).unsqueeze(0)  # Add batch dimension

# Generate text
outputs = model.generate(inputs_tensor, max_new_tokens=64, temperature=0.7)
outputs_tensor = torch.tensor(outputs).unsqueeze(0)
generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist())

print(f"Prompt: {prompt_urdu}")
print(f"Generated Text: {generated_text}")
```

## Model Description

**ALIF Base 100M** is designed to generate coherent and contextually relevant Urdu text. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.

**Key Features:**
*   Optimized for Urdu language nuances.
*   Strong foundational capabilities for further fine-tuning (for base models)
*   Capable of generating next tokens in a sequence, making it suitable for various text generation tasks.
*   Part of a series aiming to provide efficient and accessible SLMs for Urdu.

## Intended Uses & Limitations

**Intended Uses:**
*   **Text Generation:** Creative writing, content generation, story completion in Urdu.
*   **Research:** Base for further research in Urdu NLP, low-resource language modeling.
*   **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
*   **Educational Purposes:** Understanding SLM behavior for Urdu.
*   
**Limitations:**
*   The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
*   As a base generative model, it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
*   The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
*   Performance on highly specific or technical domains may be limited without further fine-tuning.
*   The model does not have real-time knowledge and its information is limited to its training data.
*   Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications.

**Out-of-Scope Uses:**
*   Generating high-stakes advice (medical, legal, financial) without human oversight.
*   Impersonation or generating misleading information.
*   Applications that could lead to harm or discrimination.
*   Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning.
*   Any use that violates ethical guidelines or legal standards.