File size: 7,189 Bytes
0d65f9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
---
license: mit # Or your chosen license: apache-2.0, cc-by-4.0, etc.
language:
- ps # Pashto
library_name: transformers
tags:
- text-generation
- pashto
- bloom
- zamai-bloom
datasets:

**Note on Dataset Identifiers:**
The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above.


- tasal9/pashto_base_bloom
pipeline_tag: text-generation
widget:
- text: "پښتو ژبه"
---

# ZamAI Bloom Pashto - checkpoint5207 (and Final Model)

This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project.

## Model Description

This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text.

**Base Model:** `bigscience/bloom-560m`
**Fine-tuning Checkpoint:** `checkpoint5207`
**Final Model:** [tasal9/zamai-bloom-ps-final]

## Intended Uses & Limitations

### Intended Uses

This model is intended for:
* Generating Pashto text.
* Assisting with Pashto language content creation.
* Research in Pashto NLP.
* Educational purposes for Pashto language learning.

### Limitations and Bias

* The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
* It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
* The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
* Performance on specific Pashto dialects might vary depending on their representation in the training data.

## How to use

You can use this model with the Hugging Face `transformers` library for text generation.

First, install the library:
```bash
pip install transformers torch
```

Then, you can use the model in Python:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)
```

## Training Data

Describe the dataset(s) used for fine-tuning.
* **Source:** [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`]
* **Size:** [e.g., Number of documents, tokens, GBs]
* **Preprocessing:** [e.g., Cleaning steps, tokenization details]
* **Language Variety:** [e.g., Predominant dialects, formal/informal text]

If your dataset is on the Hugging Face Hub, link to it.

## Training Procedure

### Preprocessing

The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model.
[Add any other specific preprocessing steps you took.]

### Fine-tuning

The model was fine-tuned using the Hugging Face `transformers` library with PyTorch.
* **Training script:** [Link to your `train_base_model.py` if applicable]
* **Hyperparameters:**
    * Learning rate: 2e-5
    * Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16)
    * Number of epochs: 3 # Adjust based on convergence and overfitting
    * Optimizer: AdamW
    * Weight decay: 0.01
    * Warmup steps: 500 # Or warmup_ratio, e.g., 0.1
    * Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory
    * Seed: 42 # For reproducibility
* **Infrastructure:**
    * Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware]
    * Training time: [e.g., X hours]

This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps.

## Evaluation Results

Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set).
* **Test set:** [Describe your test set]
* **Metrics:** [e.g., Perplexity, BLEU, ROUGE]
* **Results for checkpoint5207:**
  * [Metric 1]: [Value]
  * [Metric 2]: [Value]
* **Results for final model:**
  * [Metric 1]: [Value]
  * [Metric 2]: [Value]

Qualitative observations can also be included.

## Model Card Contact

**Author:** Yaqoob Tasal  
**Username:** tasal9  
**Organization:** ZamAI  
[GitHub: https://github.com/tasal9](https://github.com/tasal9)

## Citation

If you use this model or its checkpoints, please consider citing:

```bibtex
@misc{zamai_bloom_pashto_2025,
  author    = {Yaqoob Tasal},
  title     = {ZamAI Bloom Pashto - Fine-tuned Language Model},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}}
}
```

And the original Bloom model:
```bibtex
@article{scao2022bloom,
  title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
  author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others},
  journal={arXiv preprint arXiv:2211.05100},
  year={2022}
}
```

---

Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub.

## Training Details (Cleaned Base Model - June 2025)

This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script.

- **Training Data:** The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary.
- **Training Objective:** To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data.
- **Output Directory (during training):** `models/pashto-bloom-base-clean-colab`
- **Key Training Hyperparameters:**
    - Epochs: 3
    - Per Device Batch Size: 2
    - Gradient Accumulation Steps: 4
    - Learning Rate: 5e-5
    - FP16 (Mixed Precision): True
    - Optimizer: AdamW