pashto-base-bloom / ModelCard.md

Push retrained pashto-base-bloom (on base_pashto_clean) and update ModelCard

0d65f9c verified 7 months ago

7.19 kB

	---
	---
	license: mit # Or your chosen license: apache-2.0, cc-by-4.0, etc.
	language:
	- ps # Pashto
	library_name: transformers
	tags:
	- text-generation
	- pashto
	- bloom
	- zamai-bloom
	datasets:

	Note on Dataset Identifiers:
	The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above.


	- tasal9/pashto_base_bloom
	pipeline_tag: text-generation
	widget:
	- text: "پښتو ژبه"
	---

	# ZamAI Bloom Pashto - checkpoint5207 (and Final Model)

	This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.

	## Model Description

	This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.

	Base Model: `bigscience/bloom-560m`
	Fine-tuning Checkpoint: `checkpoint5207`
	Final Model: [tasal9/zamai-bloom-ps-final]

	## Intended Uses & Limitations

	### Intended Uses

	This model is intended for:
	* Generating Pashto text.
	* Assisting with Pashto language content creation.
	* Research in Pashto NLP.
	* Educational purposes for Pashto language learning.

	### Limitations and Bias

	* The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
	* It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
	* The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
	* Performance on specific Pashto dialects might vary depending on their representation in the training data.

	## How to use

	You can use this model with the Hugging Face `transformers` library for text generation.

	First, install the library:
	```bash
	pip install transformers torch
	```

	Then, you can use the model in Python:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
	inputs = tokenizer(prompt, return_tensors="pt")

	# Generate text
	# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
	outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(generated_text)
	```

	## Training Data

	Describe the dataset(s) used for fine-tuning.
	* Source: [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`]
	* Size: [e.g., Number of documents, tokens, GBs]
	* Preprocessing: [e.g., Cleaning steps, tokenization details]
	* Language Variety: [e.g., Predominant dialects, formal/informal text]

	If your dataset is on the Hugging Face Hub, link to it.

	## Training Procedure

	### Preprocessing

	The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model.
	[Add any other specific preprocessing steps you took.]

	### Fine-tuning

	The model was fine-tuned using the Hugging Face `transformers` library with PyTorch.
	* Training script: [Link to your `train_base_model.py` if applicable]
	* Hyperparameters:
	* Learning rate: 2e-5
	* Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
	* Number of epochs: 3 # Adjust based on convergence and overfitting
	* Optimizer: AdamW
	* Weight decay: 0.01
	* Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
	* Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
	* Seed: 42 # For reproducibility
	* Infrastructure:
	* Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
	* Training time: [e.g., X hours]

	This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.

	## Evaluation Results

	Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).
	* Test set: [Describe your test set]
	* Metrics: [e.g., Perplexity, BLEU, ROUGE]
	* Results for checkpoint5207:
	* [Metric 1]: [Value]
	* [Metric 2]: [Value]
	* Results for final model:
	* [Metric 1]: [Value]
	* [Metric 2]: [Value]

	Qualitative observations can also be included.

	## Model Card Contact

	Author: Yaqoob Tasal
	Username: tasal9
	Organization: ZamAI
	[GitHub: https://github.com/tasal9](https://github.com/tasal9)

	## Citation

	If you use this model or its checkpoints, please consider citing:

	```bibtex
	@misc{zamai_bloom_pashto_2025,
	author = {Yaqoob Tasal},
	title = {ZamAI Bloom Pashto - Fine-tuned Language Model},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
	}
	```

	And the original Bloom model:
	```bibtex
	@article{scao2022bloom,
	title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
	author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
	journal={arXiv preprint arXiv:2211.05100},
	year={2022}
	}
	```

	---

	Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub.

	## Training Details (Cleaned Base Model - June 2025)

	This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script.

	- Training Data: The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary.
	- Training Objective: To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
	- Output Directory (during training): `models/pashto-bloom-base-clean-colab`
	- Key Training Hyperparameters:
	- Epochs: 3
	- Per Device Batch Size: 2
	- Gradient Accumulation Steps: 4
	- Learning Rate: 5e-5
	- FP16 (Mixed Precision): True
	- Optimizer: AdamW