File size: 7,189 Bytes
0d65f9c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
---
license: mit # Or your chosen license: apache-2.0, cc-by-4.0, etc.
language:
- ps # Pashto
library_name: transformers
tags:
- text-generation
- pashto
- bloom
- zamai-bloom
datasets:
**Note on Dataset Identifiers:**
The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above.
- tasal9/pashto_base_bloom
pipeline_tag: text-generation
widget:
- text: "پښتو ژبه"
---
# ZamAI Bloom Pashto - checkpoint5207 (and Final Model)
This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.
## Model Description
This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.
**Base Model:** `bigscience/bloom-560m`
**Fine-tuning Checkpoint:** `checkpoint5207`
**Final Model:** [tasal9/zamai-bloom-ps-final]
## Intended Uses & Limitations
### Intended Uses
This model is intended for:
* Generating Pashto text.
* Assisting with Pashto language content creation.
* Research in Pashto NLP.
* Educational purposes for Pashto language learning.
### Limitations and Bias
* The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
* It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
* The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
* Performance on specific Pashto dialects might vary depending on their representation in the training data.
## How to use
You can use this model with the Hugging Face `transformers` library for text generation.
First, install the library:
```bash
pip install transformers torch
```
Then, you can use the model in Python:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
## Training Data
Describe the dataset(s) used for fine-tuning.
* **Source:** [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`]
* **Size:** [e.g., Number of documents, tokens, GBs]
* **Preprocessing:** [e.g., Cleaning steps, tokenization details]
* **Language Variety:** [e.g., Predominant dialects, formal/informal text]
If your dataset is on the Hugging Face Hub, link to it.
## Training Procedure
### Preprocessing
The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model.
[Add any other specific preprocessing steps you took.]
### Fine-tuning
The model was fine-tuned using the Hugging Face `transformers` library with PyTorch.
* **Training script:** [Link to your `train_base_model.py` if applicable]
* **Hyperparameters:**
* Learning rate: 2e-5
* Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
* Number of epochs: 3 # Adjust based on convergence and overfitting
* Optimizer: AdamW
* Weight decay: 0.01
* Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
* Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
* Seed: 42 # For reproducibility
* **Infrastructure:**
* Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
* Training time: [e.g., X hours]
This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.
## Evaluation Results
Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).
* **Test set:** [Describe your test set]
* **Metrics:** [e.g., Perplexity, BLEU, ROUGE]
* **Results for checkpoint5207:**
* [Metric 1]: [Value]
* [Metric 2]: [Value]
* **Results for final model:**
* [Metric 1]: [Value]
* [Metric 2]: [Value]
Qualitative observations can also be included.
## Model Card Contact
**Author:** Yaqoob Tasal
**Username:** tasal9
**Organization:** ZamAI
[GitHub: https://github.com/tasal9](https://github.com/tasal9)
## Citation
If you use this model or its checkpoints, please consider citing:
```bibtex
@misc{zamai_bloom_pashto_2025,
author = {Yaqoob Tasal},
title = {ZamAI Bloom Pashto - Fine-tuned Language Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
}
```
And the original Bloom model:
```bibtex
@article{scao2022bloom,
title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
journal={arXiv preprint arXiv:2211.05100},
year={2022}
}
```
---
Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub.
## Training Details (Cleaned Base Model - June 2025)
This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script.
- **Training Data:** The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary.
- **Training Objective:** To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
- **Output Directory (during training):** `models/pashto-bloom-base-clean-colab`
- **Key Training Hyperparameters:**
- Epochs: 3
- Per Device Batch Size: 2
- Gradient Accumulation Steps: 4
- Learning Rate: 5e-5
- FP16 (Mixed Precision): True
- Optimizer: AdamW
|