AliMuhammad73 commited on
Commit
1b4d7d1
·
verified ·
1 Parent(s): f960f9b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -11
README.md CHANGED
@@ -13,26 +13,23 @@ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://h
13
  - Docs: [More Information Needed]
14
 
15
 
16
- # [ALIF Base 100M]
17
 
18
- **[ALIF Base 100M]** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**. This model is a [decoder-only Transformer / Naive GPT-2 based] architecture, specifically pretrained for the Urdu language.
19
 
20
  ## Model Details
21
 
22
  * **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
23
- <!-- * **Supervised by:** Dr. Abdul Samad (Habib University) -->
24
  * **Model type:** Decoder-only Transformer, GPT-like
25
  * **Variant:** ALIF-Base-100M
26
  * **Language(s) (NLP):** Urdu (ur)
27
  * **License:** Apache 2.0
28
- <!-- * **Finetuned from model (if applicable):** [e.g., `OratureAI/ALIF-Base-1B`] -->
29
- <!-- * **Related Models:** Other models in the ALIF الف series by Orature AI. -->
30
- <!-- * **Project Repository/Paper:** [Link to ALIF GitHub Repo or Paper arXiv/Website] -->
31
  * **Architecture:** Transformer (GPT-Based)
32
  * **Framework:** PyTorch
33
  * **Tokeniezer:** SentencePiece Custom Tokenizer
34
  * **Hyperparameters:**:
35
- * **Vocabulary Size:** 20000
36
  * **Embedding Size:** 768
37
  * **Attention Heads:** 12
38
  * **Layers:** 12
@@ -66,7 +63,7 @@ print(f"Generated Text: {generated_text}")
66
 
67
  ## Model Description
68
 
69
- **ALIF Base 100M** is designed to [generate coherent and contextually relevant Urdu text / understand and follow instructions in Urdu]. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.
70
 
71
  **Key Features:**
72
  * Optimized for Urdu language nuances.
@@ -81,11 +78,10 @@ print(f"Generated Text: {generated_text}")
81
  * **Research:** Base for further research in Urdu NLP, low-resource language modeling.
82
  * **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
83
  * **Educational Purposes:** Understanding SLM behavior for Urdu.
84
- * **(For Instruct Models):** Conversational AI, Q&A, task completion in Urdu.
85
-
86
  **Limitations:**
87
  * The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
88
- * As a base generative model (especially for non-instruct versions), it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
89
  * The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
90
  * Performance on highly specific or technical domains may be limited without further fine-tuning.
91
  * The model does not have real-time knowledge and its information is limited to its training data.
 
13
  - Docs: [More Information Needed]
14
 
15
 
16
+ # ALIF Base 100M
17
 
18
+ **ALIF Base 100M** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**.
19
 
20
  ## Model Details
21
 
22
  * **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
23
+ * **Supervised by:** Dr. Abdul Samad (Habib University)
24
  * **Model type:** Decoder-only Transformer, GPT-like
25
  * **Variant:** ALIF-Base-100M
26
  * **Language(s) (NLP):** Urdu (ur)
27
  * **License:** Apache 2.0
 
 
 
28
  * **Architecture:** Transformer (GPT-Based)
29
  * **Framework:** PyTorch
30
  * **Tokeniezer:** SentencePiece Custom Tokenizer
31
  * **Hyperparameters:**:
32
+ * **Vocabulary Size:** 32000
33
  * **Embedding Size:** 768
34
  * **Attention Heads:** 12
35
  * **Layers:** 12
 
63
 
64
  ## Model Description
65
 
66
+ **ALIF Base 100M** is designed to generate coherent and contextually relevant Urdu text. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.
67
 
68
  **Key Features:**
69
  * Optimized for Urdu language nuances.
 
78
  * **Research:** Base for further research in Urdu NLP, low-resource language modeling.
79
  * **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
80
  * **Educational Purposes:** Understanding SLM behavior for Urdu.
81
+ *
 
82
  **Limitations:**
83
  * The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
84
+ * As a base generative model, it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
85
  * The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
86
  * Performance on highly specific or technical domains may be limited without further fine-tuning.
87
  * The model does not have real-time knowledge and its information is limited to its training data.