pashto-base-bloom / ModelCard.md
tasal9's picture
Push retrained pashto-base-bloom (on base_pashto_clean) and update ModelCard
0d65f9c verified
---
---
license: mit # Or your chosen license: apache-2.0, cc-by-4.0, etc.
language:
- ps # Pashto
library_name: transformers
tags:
- text-generation
- pashto
- bloom
- zamai-bloom
datasets:
**Note on Dataset Identifiers:**
The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above.
- tasal9/pashto_base_bloom
pipeline_tag: text-generation
widget:
- text: "پښتو ژبه"
---
# ZamAI Bloom Pashto - checkpoint5207 (and Final Model)
This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.
## Model Description
This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.
**Base Model:** `bigscience/bloom-560m`
**Fine-tuning Checkpoint:** `checkpoint5207`
**Final Model:** [tasal9/zamai-bloom-ps-final]
## Intended Uses & Limitations
### Intended Uses
This model is intended for:
* Generating Pashto text.
* Assisting with Pashto language content creation.
* Research in Pashto NLP.
* Educational purposes for Pashto language learning.
### Limitations and Bias
* The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
* It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
* The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
* Performance on specific Pashto dialects might vary depending on their representation in the training data.
## How to use
You can use this model with the Hugging Face `transformers` library for text generation.
First, install the library:
```bash
pip install transformers torch
```
Then, you can use the model in Python:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
## Training Data
Describe the dataset(s) used for fine-tuning.
* **Source:** [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`]
* **Size:** [e.g., Number of documents, tokens, GBs]
* **Preprocessing:** [e.g., Cleaning steps, tokenization details]
* **Language Variety:** [e.g., Predominant dialects, formal/informal text]
If your dataset is on the Hugging Face Hub, link to it.
## Training Procedure
### Preprocessing
The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model.
[Add any other specific preprocessing steps you took.]
### Fine-tuning
The model was fine-tuned using the Hugging Face `transformers` library with PyTorch.
* **Training script:** [Link to your `train_base_model.py` if applicable]
* **Hyperparameters:**
* Learning rate: 2e-5
* Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
* Number of epochs: 3 # Adjust based on convergence and overfitting
* Optimizer: AdamW
* Weight decay: 0.01
* Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
* Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
* Seed: 42 # For reproducibility
* **Infrastructure:**
* Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
* Training time: [e.g., X hours]
This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.
## Evaluation Results
Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).
* **Test set:** [Describe your test set]
* **Metrics:** [e.g., Perplexity, BLEU, ROUGE]
* **Results for checkpoint5207:**
* [Metric 1]: [Value]
* [Metric 2]: [Value]
* **Results for final model:**
* [Metric 1]: [Value]
* [Metric 2]: [Value]
Qualitative observations can also be included.
## Model Card Contact
**Author:** Yaqoob Tasal
**Username:** tasal9
**Organization:** ZamAI
[GitHub: https://github.com/tasal9](https://github.com/tasal9)
## Citation
If you use this model or its checkpoints, please consider citing:
```bibtex
@misc{zamai_bloom_pashto_2025,
author = {Yaqoob Tasal},
title = {ZamAI Bloom Pashto - Fine-tuned Language Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
}
```
And the original Bloom model:
```bibtex
@article{scao2022bloom,
title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
journal={arXiv preprint arXiv:2211.05100},
year={2022}
}
```
---
Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub.
## Training Details (Cleaned Base Model - June 2025)
This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script.
- **Training Data:** The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary.
- **Training Objective:** To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
- **Output Directory (during training):** `models/pashto-bloom-base-clean-colab`
- **Key Training Hyperparameters:**
- Epochs: 3
- Per Device Batch Size: 2
- Gradient Accumulation Steps: 4
- Learning Rate: 5e-5
- FP16 (Mixed Precision): True
- Optimizer: AdamW