File size: 7,189 Bytes

0d65f9c

---
---
license: mit # Or your chosen license: apache-2.0, cc-by-4.0, etc.
language:
- ps # Pashto
library_name: transformers
tags:
- text-generation
- pashto
- bloom
- zamai-bloom
datasets:

**Note on Dataset Identifiers:**
The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above.


- tasal9/pashto_base_bloom
pipeline_tag: text-generation
widget:
- text: "پښتو ژبه"
---

# ZamAI Bloom Pashto - checkpoint5207 (and Final Model)

This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.

## Model Description

This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.

**Base Model:** `bigscience/bloom-560m`
**Fine-tuning Checkpoint:** `checkpoint5207`
**Final Model:** [tasal9/zamai-bloom-ps-final]

## Intended Uses & Limitations

### Intended Uses

This model is intended for:
* Generating Pashto text.
* Assisting with Pashto language content creation.
* Research in Pashto NLP.
* Educational purposes for Pashto language learning.

### Limitations and Bias

* The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
* It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
* The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
* Performance on specific Pashto dialects might vary depending on their representation in the training data.

## How to use

You can use this model with the Hugging Face `transformers` library for text generation.

First, install the library:
```bash
pip install transformers torch
```

Then, you can use the model in Python:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)
```

## Training Data

Describe the dataset(s) used for fine-tuning.
* **Source:** [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`]
* **Size:** [e.g., Number of documents, tokens, GBs]
* **Preprocessing:** [e.g., Cleaning steps, tokenization details]
* **Language Variety:** [e.g., Predominant dialects, formal/informal text]

If your dataset is on the Hugging Face Hub, link to it.

## Training Procedure

### Preprocessing

The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model.
[Add any other specific preprocessing steps you took.]

### Fine-tuning

The model was fine-tuned using the Hugging Face `transformers` library with PyTorch.
* **Training script:** [Link to your `train_base_model.py` if applicable]
* **Hyperparameters:**
    * Learning rate: 2e-5
    * Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
    * Number of epochs: 3 # Adjust based on convergence and overfitting
    * Optimizer: AdamW
    * Weight decay: 0.01
    * Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
    * Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
    * Seed: 42 # For reproducibility
* **Infrastructure:**
    * Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
    * Training time: [e.g., X hours]

This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.

## Evaluation Results

Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).
* **Test set:** [Describe your test set]
* **Metrics:** [e.g., Perplexity, BLEU, ROUGE]
* **Results for checkpoint5207:**
  * [Metric 1]: [Value]
  * [Metric 2]: [Value]
* **Results for final model:**
  * [Metric 1]: [Value]
  * [Metric 2]: [Value]

Qualitative observations can also be included.

## Model Card Contact

**Author:** Yaqoob Tasal  
**Username:** tasal9  
**Organization:** ZamAI  
[GitHub: https://github.com/tasal9](https://github.com/tasal9)

## Citation

If you use this model or its checkpoints, please consider citing:

```bibtex
@misc{zamai_bloom_pashto_2025,
  author    = {Yaqoob Tasal},
  title     = {ZamAI Bloom Pashto - Fine-tuned Language Model},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
}
```

And the original Bloom model:
```bibtex
@article{scao2022bloom,
  title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
  author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
  journal={arXiv preprint arXiv:2211.05100},
  year={2022}
}
```

---

Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub.

## Training Details (Cleaned Base Model - June 2025)

This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script.

- **Training Data:** The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary.
- **Training Objective:** To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
- **Output Directory (during training):** `models/pashto-bloom-base-clean-colab`
- **Key Training Hyperparameters:**
    - Epochs: 3
    - Per Device Batch Size: 2
    - Gradient Accumulation Steps: 4
    - Learning Rate: 5e-5
    - FP16 (Mixed Precision): True
    - Optimizer: AdamW