ALIF-Base-100M / README.md

Update README.md

106b19c verified 8 months ago

4.37 kB

	---
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	license: apache-2.0
	language:
	- ur
	pipeline_tag: text-generation
	---

	This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:


	# ALIF Base 100M

	ALIF Base 100M is an Urdu generative language model from the ALIF الف series (a Final Year Project at Habib University), developed by Orature AI.

	## Model Details

	* Developed by: Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
	* Supervised by: Dr. Abdul Samad (Habib University)
	* Model type: Decoder-only Transformer, GPT-like
	* Variant: ALIF-Base-100M
	* Language(s) (NLP): Urdu (ur)
	* License: Apache 2.0
	* Architecture: Transformer (GPT-Based)
	* Framework: PyTorch
	* Tokeniezer: SentencePiece Custom Tokenizer
	* Hyperparameters::
	* Vocabulary Size: 32000
	* Embedding Size: 768
	* Attention Heads: 12
	* Layers: 12

	## How to Get Started with the Model

	First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model:

	```python
	from modeling_gpt import GPTLanguageModel
	from transformers import AutoTokenizer
	import torch

	model_name = "orature/ALIF-Base-100M"
	model = GPTLanguageModel.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# For text generation
	prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, "
	inputs = tokenizer.encode(prompt_urdu)
	inputs_tensor = torch.tensor(inputs).unsqueeze(0) # Add batch dimension

	# Generate text
	outputs = model.generate(inputs_tensor, max_new_tokens=64, temperature=0.7)
	outputs_tensor = torch.tensor(outputs).unsqueeze(0)
	generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist())

	print(f"Prompt: {prompt_urdu}")
	print(f"Generated Text: {generated_text}")
	```

	## Model Description

	ALIF Base 100M is designed to generate coherent and contextually relevant Urdu text. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.

	Key Features:
	* Optimized for Urdu language nuances.
	* Strong foundational capabilities for further fine-tuning (for base models)
	* Capable of generating next tokens in a sequence, making it suitable for various text generation tasks.
	* Part of a series aiming to provide efficient and accessible SLMs for Urdu.

	## Intended Uses & Limitations

	Intended Uses:
	* Text Generation: Creative writing, content generation, story completion in Urdu.
	* Research: Base for further research in Urdu NLP, low-resource language modeling.
	* Fine-tuning: Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
	* Educational Purposes: Understanding SLM behavior for Urdu.
	*
	Limitations:
	* The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
	* As a base generative model, it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
	* The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
	* Performance on highly specific or technical domains may be limited without further fine-tuning.
	* The model does not have real-time knowledge and its information is limited to its training data.
	* Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications.

	Out-of-Scope Uses:
	* Generating high-stakes advice (medical, legal, financial) without human oversight.
	* Impersonation or generating misleading information.
	* Applications that could lead to harm or discrimination.
	* Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning.
	* Any use that violates ethical guidelines or legal standards.