orature
/

ALIF-Base-100M

@@ -2,120 +2,8 @@
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
-language:
-- ur
 ---
-<!-- ---
-tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
 This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
 - Library: [More Information Needed]
-- Docs: [More Information Needed] -->
----
-license: apache-2.0
----
-# [ALIF Base 100M]
-**[ALIF Base 100M]** is an Urdu generative language model from the **ALIF الف** series (a Final Year Project at Habib University), developed by **Orature AI**. This model is a [decoder-only Transformer / Naive GPT-2 based] architecture, specifically pretrained for the Urdu language.
-## Model Details
-*   **Developed by:** Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
-<!-- *   **Supervised by:** Dr. Abdul Samad (Habib University) -->
-*   **Model type:** Decoder-only Transformer, GPT-like
-*   **Variant:** ALIF-Base-100M
-*   **Language(s) (NLP):** Urdu (ur)
-*   **License:** Apache 2.0
-<!-- *   **Finetuned from model (if applicable):** [e.g., `OratureAI/ALIF-Base-1B`] -->
-<!-- *   **Related Models:** Other models in the ALIF الف series by Orature AI. -->
-<!-- *   **Project Repository/Paper:** [Link to ALIF GitHub Repo or Paper arXiv/Website] -->
-* **Architecture:** Transformer (GPT-Based)
-* **Framework:** PyTorch
-* **Tokeniezer:** SentencePiece Custom Tokenizer
-* **Hyperparameters:**:
-    *   **Vocabulary Size:** 20000
-    *   **Embedding Size:** 768
-    *   **Attention Heads:** 12
-    *   **Layers:** 12
-## How to Get Started with the Model
-First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model:
-```python
-from modeling_gpt import GPTLanguageModel
-from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
-model_name = "orature/ALIF-Base-100M"
-model = AutoModelForCausalLM.from_pretrained(model_name)
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-# For text generation
-prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, "
-inputs = tokenizer.encode(prompt_urdu)
-inputs_tensor = torch.tensor(inputs).unsqueeze(0)  # Add batch dimension
-# Generate text
-outputs = model.generate(inputs_tensor, max_new_tokens=128, temperature=0.7)
-outputs_tensor = torch.tensor(outputs).unsqueeze(0)
-generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist())
-print(f"Prompt: {prompt_urdu}")
-print(f"Generated Text: {generated_text}")
-```
-## Model Description
-**ALIF Base 100M** is designed to [generate coherent and contextually relevant Urdu text / understand and follow instructions in Urdu]. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.
-**Key Features:**
-*   Optimized for Urdu language nuances.
-*   Strong foundational capabilities for further fine-tuning (for base models)
-*   Capable of generating next tokens in a sequence, making it suitable for various text generation tasks.
-*   Part of a series aiming to provide efficient and accessible SLMs for Urdu.
-## Intended Uses & Limitations
-**Intended Uses:**
-*   **Text Generation:** Creative writing, content generation, story completion in Urdu.
-*   **Research:** Base for further research in Urdu NLP, low-resource language modeling.
-*   **Fine-tuning:** Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
-*   **Educational Purposes:** Understanding SLM behavior for Urdu.
-*   **(For Instruct Models):** Conversational AI, Q&A, task completion in Urdu.
-**Limitations:**
-*   The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
-*   As a base generative model (especially for non-instruct versions), it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
-*   The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
-*   Performance on highly specific or technical domains may be limited without further fine-tuning.
-*   The model does not have real-time knowledge and its information is limited to its training data.
-*   Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications.
-**Out-of-Scope Uses:**
-*   Generating high-stakes advice (medical, legal, financial) without human oversight.
-*   Impersonation or generating misleading information.
-*   Applications that could lead to harm or discrimination.
-*   Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning.
-*   Any use that violates ethical guidelines or legal standards.
-<!-- ## Citation
-If you use this model in your research or applications, please cite the following paper:
-@misc{alif2025,
-    title={},
-    author={},
-    year={},
-    publisher={},
-    howpublished={},
-    note={},
-    url={}
-}
--->

 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
 ---
 This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
 - Library: [More Information Needed]
+- Docs: [More Information Needed]

config.json CHANGED Viewed

@@ -1,5 +1,4 @@
 {
-  "model_type": "llama",
   "block_size": 1024,
   "n_embd": 768,
   "n_head": 12,

 {
   "block_size": 1024,
   "n_embd": 768,
   "n_head": 12,