AliMuhammad73 commited on
Commit
dfc282e
·
verified ·
1 Parent(s): a310461

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -1
README.md CHANGED
@@ -2,8 +2,98 @@
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
 
5
  ---
6
 
7
  This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
  - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
5
+ license: apache-2.0
6
+ language:
7
+ - ur
8
+ pipeline_tag: text-generation
9
  ---
10
 
11
  This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
12
  - Library: [More Information Needed]
13
+ - Docs: [More Information Needed]
14
+
15
+
16
+ # [ALIF Base 100M]
17
+
18
+ **[ALIF Base 100M]** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**. This model is a [decoder-only Transformer / Naive GPT-2 based] architecture, specifically pretrained for the Urdu language.
19
+
20
+ ## Model Details
21
+
22
+ * **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
23
+ <!-- * **Supervised by:** Dr. Abdul Samad (Habib University) -->
24
+ * **Model type:** Decoder-only Transformer, GPT-like
25
+ * **Variant:** ALIF-Base-100M
26
+ * **Language(s) (NLP):** Urdu (ur)
27
+ * **License:** Apache 2.0
28
+ <!-- * **Finetuned from model (if applicable):** [e.g., `OratureAI/ALIF-Base-1B`] -->
29
+ <!-- * **Related Models:** Other models in the ALIF الف series by Orature AI. -->
30
+ <!-- * **Project Repository/Paper:** [Link to ALIF GitHub Repo or Paper arXiv/Website] -->
31
+ * **Architecture:** Transformer (GPT-Based)
32
+ * **Framework:** PyTorch
33
+ * **Tokeniezer:** SentencePiece Custom Tokenizer
34
+ * **Hyperparameters:**:
35
+ * **Vocabulary Size:** 20000
36
+ * **Embedding Size:** 768
37
+ * **Attention Heads:** 12
38
+ * **Layers:** 12
39
+
40
+ ## How to Get Started with the Model
41
+
42
+ First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model:
43
+
44
+ ```python
45
+ from modeling_gpt import GPTLanguageModel
46
+ from transformers import AutoTokenizer
47
+ import torch
48
+
49
+ model_name = "orature/ALIF-Base-100M"
50
+ model = AutoModelForCausalLM.from_pretrained(model_name)
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
52
+
53
+ # For text generation
54
+ prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, "
55
+ inputs = tokenizer.encode(prompt_urdu)
56
+ inputs_tensor = torch.tensor(inputs).unsqueeze(0) # Add batch dimension
57
+
58
+ # Generate text
59
+ outputs = model.generate(inputs_tensor, max_new_tokens=128, temperature=0.7)
60
+ outputs_tensor = torch.tensor(encoded).unsqueeze(0)
61
+ generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist())
62
+
63
+ print(f"Prompt: {prompt_urdu}")
64
+ print(f"Generated Text: {generated_text}")
65
+ ```
66
+
67
+ ## Model Description
68
+
69
+ **ALIF Base 100M** is designed to [generate coherent and contextually relevant Urdu text / understand and follow instructions in Urdu]. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.
70
+
71
+ **Key Features:**
72
+ * Optimized for Urdu language nuances.
73
+ * Strong foundational capabilities for further fine-tuning (for base models)
74
+ * Capable of generating next tokens in a sequence, making it suitable for various text generation tasks.
75
+ * Part of a series aiming to provide efficient and accessible SLMs for Urdu.
76
+
77
+ ## Intended Uses & Limitations
78
+
79
+ **Intended Uses:**
80
+ * **Text Generation:** Creative writing, content generation, story completion in Urdu.
81
+ * **Research:** Base for further research in Urdu NLP, low-resource language modeling.
82
+ * **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
83
+ * **Educational Purposes:** Understanding SLM behavior for Urdu.
84
+ * **(For Instruct Models):** Conversational AI, Q&A, task completion in Urdu.
85
+
86
+ **Limitations:**
87
+ * The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
88
+ * As a base generative model (especially for non-instruct versions), it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
89
+ * The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
90
+ * Performance on highly specific or technical domains may be limited without further fine-tuning.
91
+ * The model does not have real-time knowledge and its information is limited to its training data.
92
+ * Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications.
93
+
94
+ **Out-of-Scope Uses:**
95
+ * Generating high-stakes advice (medical, legal, financial) without human oversight.
96
+ * Impersonation or generating misleading information.
97
+ * Applications that could lead to harm or discrimination.
98
+ * Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning.
99
+ * Any use that violates ethical guidelines or legal standards.