|
|
--- |
|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- ps |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation |
|
|
- pashto |
|
|
- bloom |
|
|
- zamai-bloom |
|
|
datasets: |
|
|
|
|
|
**Note on Dataset Identifiers:** |
|
|
The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above. |
|
|
|
|
|
|
|
|
- tasal9/pashto_base_bloom |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "پښتو ژبه" |
|
|
--- |
|
|
|
|
|
# ZamAI Bloom Pashto - checkpoint5207 (and Final Model) |
|
|
|
|
|
This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text. |
|
|
|
|
|
**Base Model:** `bigscience/bloom-560m` |
|
|
**Fine-tuning Checkpoint:** `checkpoint5207` |
|
|
**Final Model:** [tasal9/zamai-bloom-ps-final] |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
### Intended Uses |
|
|
|
|
|
This model is intended for: |
|
|
* Generating Pashto text. |
|
|
* Assisting with Pashto language content creation. |
|
|
* Research in Pashto NLP. |
|
|
* Educational purposes for Pashto language learning. |
|
|
|
|
|
### Limitations and Bias |
|
|
|
|
|
* The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data. |
|
|
* It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts. |
|
|
* The model may not be suitable for critical applications without further evaluation and mitigation of potential harms. |
|
|
* Performance on specific Pashto dialects might vary depending on their representation in the training data. |
|
|
|
|
|
## How to use |
|
|
|
|
|
You can use this model with the Hugging Face `transformers` library for text generation. |
|
|
|
|
|
First, install the library: |
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
Then, you can use the model in Python: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
# Generate text |
|
|
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.) |
|
|
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True) |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
print(generated_text) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Describe the dataset(s) used for fine-tuning. |
|
|
* **Source:** [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`] |
|
|
* **Size:** [e.g., Number of documents, tokens, GBs] |
|
|
* **Preprocessing:** [e.g., Cleaning steps, tokenization details] |
|
|
* **Language Variety:** [e.g., Predominant dialects, formal/informal text] |
|
|
|
|
|
If your dataset is on the Hugging Face Hub, link to it. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model. |
|
|
[Add any other specific preprocessing steps you took.] |
|
|
|
|
|
### Fine-tuning |
|
|
|
|
|
The model was fine-tuned using the Hugging Face `transformers` library with PyTorch. |
|
|
* **Training script:** [Link to your `train_base_model.py` if applicable] |
|
|
* **Hyperparameters:** |
|
|
* Learning rate: 2e-5 |
|
|
* Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16) |
|
|
* Number of epochs: 3 # Adjust based on convergence and overfitting |
|
|
* Optimizer: AdamW |
|
|
* Weight decay: 0.01 |
|
|
* Warmup steps: 500 # Or warmup_ratio, e.g., 0.1 |
|
|
* Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory |
|
|
* Seed: 42 # For reproducibility |
|
|
* **Infrastructure:** |
|
|
* Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware] |
|
|
* Training time: [e.g., X hours] |
|
|
|
|
|
This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps. |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set). |
|
|
* **Test set:** [Describe your test set] |
|
|
* **Metrics:** [e.g., Perplexity, BLEU, ROUGE] |
|
|
* **Results for checkpoint5207:** |
|
|
* [Metric 1]: [Value] |
|
|
* [Metric 2]: [Value] |
|
|
* **Results for final model:** |
|
|
* [Metric 1]: [Value] |
|
|
* [Metric 2]: [Value] |
|
|
|
|
|
Qualitative observations can also be included. |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
**Author:** Yaqoob Tasal |
|
|
**Username:** tasal9 |
|
|
**Organization:** ZamAI |
|
|
[GitHub: https://github.com/tasal9](https://github.com/tasal9) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or its checkpoints, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{zamai_bloom_pashto_2025, |
|
|
author = {Yaqoob Tasal}, |
|
|
title = {ZamAI Bloom Pashto - Fine-tuned Language Model}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face Model Hub}, |
|
|
howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}} |
|
|
} |
|
|
``` |
|
|
|
|
|
And the original Bloom model: |
|
|
```bibtex |
|
|
@article{scao2022bloom, |
|
|
title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model}, |
|
|
author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others}, |
|
|
journal={arXiv preprint arXiv:2211.05100}, |
|
|
year={2022} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub. |
|
|
|
|
|
## Training Details (Cleaned Base Model - June 2025) |
|
|
|
|
|
This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script. |
|
|
|
|
|
- **Training Data:** The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary. |
|
|
- **Training Objective:** To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data. |
|
|
- **Output Directory (during training):** `models/pashto-bloom-base-clean-colab` |
|
|
- **Key Training Hyperparameters:** |
|
|
- Epochs: 3 |
|
|
- Per Device Batch Size: 2 |
|
|
- Gradient Accumulation Steps: 4 |
|
|
- Learning Rate: 5e-5 |
|
|
- FP16 (Mixed Precision): True |
|
|
- Optimizer: AdamW |
|
|
|