AliMuhammad73 commited on
Commit
f4c5464
·
verified ·
1 Parent(s): b11113e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -3
README.md CHANGED
@@ -1,3 +1,88 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0 # Or your chosen open-source license (e.g., mit, cc-by-sa-4.0)
3
+ ---
4
+
5
+ # [ALIF Base 100M]
6
+
7
+ **[ALIF Base 100M]** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**. This model is a [decoder-only Transformer / Naive GPT-2 based] architecture, specifically pretrained for the Urdu language.
8
+
9
+ ## Model Details
10
+
11
+ * **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
12
+ <!-- * **Supervised by:** Dr. Abdul Samad (Habib University) -->
13
+ * **Model type:** Decoder-only Transformer, GPT-like
14
+ * **Variant:** ALIF-Base-100M
15
+ * **Language(s) (NLP):** Urdu (ur)
16
+ * **License:** Apache 2.0
17
+ <!-- * **Finetuned from model (if applicable):** [e.g., `OratureAI/ALIF-Base-1B`] -->
18
+ <!-- * **Related Models:** Other models in the ALIF الف series by Orature AI. -->
19
+ <!-- * **Project Repository/Paper:** [Link to ALIF GitHub Repo or Paper arXiv/Website] -->
20
+ * **Architecture:** Transformer (GPT-Based)
21
+ * **Framework:** PyTorch
22
+ * **Tokeniezer:** SentencePiece Custom Tokenizer
23
+ * **Hyperparameters:**:
24
+ * **Vocabulary Size:** 20000
25
+ * **Embedding Size:** 768
26
+ * **Attention Heads:** 12
27
+ * **Layers:** 12
28
+
29
+ ## How to Get Started with the Model
30
+
31
+ First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model:
32
+
33
+ ```python
34
+ from modeling_gpt import GPTLanguageModel
35
+ from transformers import AutoTokenizer
36
+ import torch
37
+
38
+ model_name = "OratureAI/[MODEL_NAME_ON_HF_HUB]" # e.g., OratureAI/ALIF-Instruct-1B
39
+ model = AutoModelForCausalLM.from_pretrained(model_name)
40
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
41
+
42
+ # For text generation
43
+ prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, "
44
+ inputs = tokenizer.encode(prompt_urdu)
45
+ inputs_tensor = torch.tensor(inputs).unsqueeze(0) # Add batch dimension
46
+
47
+ # Generate text
48
+ outputs = model.generate(inputs_tensor, max_new_tokens=128, temperature=0.7)
49
+ outputs_tensor = torch.tensor(encoded).unsqueeze(0)
50
+ generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist())
51
+
52
+ print(f"Prompt: {prompt_urdu}")
53
+ print(f"Generated Text: {generated_text}")
54
+ ```
55
+
56
+ ## Model Description
57
+
58
+ ALIF Base 100M is designed to generate coherent and contextually relevant Urdu text. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.
59
+
60
+ **Key Features:**
61
+ * Optimized for Urdu language nuances.
62
+ * Strong foundational capabilities for further fine-tuning (for base models)
63
+ * Capable of generating next tokens in a sequence, making it suitable for various text generation tasks.
64
+ * Part of a series aiming to provide efficient and accessible SLMs for Urdu.
65
+
66
+ ## Intended Uses & Limitations
67
+
68
+ **Intended Uses:**
69
+ * **Text Generation:** Creative writing, content generation, story completion in Urdu.
70
+ * **Research:** Base for further research in Urdu NLP, low-resource language modeling.
71
+ * **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
72
+ * **Educational Purposes:** Understanding SLM behavior for Urdu.
73
+ * **(For Instruct Models):** Conversational AI, Q&A, task completion in Urdu.
74
+
75
+ **Limitations:**
76
+ * The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
77
+ * As a base generative model (especially for non-instruct versions), it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
78
+ * The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
79
+ * Performance on highly specific or technical domains may be limited without further fine-tuning.
80
+ * The model does not have real-time knowledge and its information is limited to its training data.
81
+ * Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications.
82
+
83
+ **Out-of-Scope Uses:**
84
+ * Generating high-stakes advice (medical, legal, financial) without human oversight.
85
+ * Impersonation or generating misleading information.
86
+ * Applications that could lead to harm or discrimination.
87
+ * Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning.
88
+ * Any use that violates ethical guidelines or legal standards.