--- --- license: mit # Or your chosen license: apache-2.0, cc-by-4.0, etc. language: - ps # Pashto library_name: transformers tags: - text-generation - pashto - bloom - zamai-bloom datasets: **Note on Dataset Identifiers:** The `datasets` field in the metadata of this model card might list `tasal9/pashto_base_bloom`. This identifier may refer to an earlier version or a different collection of Pashto data. The specific training run culminating in this model update (June 2025) exclusively used the locally processed `datasets/base_pashto_clean` as described above. - tasal9/pashto_base_bloom pipeline_tag: text-generation widget: - text: "پښتو ژبه" --- # ZamAI Bloom Pashto - checkpoint5207 (and Final Model) This model card is for `checkpoint5207` and the final fine-tuned version of a Bloom model for Pashto text generation, developed under the ZamAI Bloom project. ## Model Description This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on a Pashto text corpus. The goal of this project was to create a language model proficient in generating coherent and contextually relevant Pashto text. **Base Model:** `bigscience/bloom-560m` **Fine-tuning Checkpoint:** `checkpoint5207` **Final Model:** [tasal9/zamai-bloom-ps-final] ## Intended Uses & Limitations ### Intended Uses This model is intended for: * Generating Pashto text. * Assisting with Pashto language content creation. * Research in Pashto NLP. * Educational purposes for Pashto language learning. ### Limitations and Bias * The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data. * It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts. * The model may not be suitable for critical applications without further evaluation and mitigation of potential harms. * Performance on specific Pashto dialects might vary depending on their representation in the training data. ## How to use You can use this model with the Hugging Face `transformers` library for text generation. First, install the library: ```bash pip install transformers torch ``` Then, you can use the model in Python: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "tasal9/zamai-bloom-ps-final" # Or the specific checkpoint identifier if using a checkpoint directly tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلي په اړه" # Example prompt: "Write a poem in Pashto about spring" inputs = tokenizer(prompt, return_tensors="pt") # Generate text # Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.) outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Training Data Describe the dataset(s) used for fine-tuning. * **Source:** [e.g., Web scraped data, specific Pashto corpora, data from `datasets/base_pashto/`] * **Size:** [e.g., Number of documents, tokens, GBs] * **Preprocessing:** [e.g., Cleaning steps, tokenization details] * **Language Variety:** [e.g., Predominant dialects, formal/informal text] If your dataset is on the Hugging Face Hub, link to it. ## Training Procedure ### Preprocessing The texts were tokenized using the `AutoTokenizer` associated with the base Bloom model. [Add any other specific preprocessing steps you took.] ### Fine-tuning The model was fine-tuned using the Hugging Face `transformers` library with PyTorch. * **Training script:** [Link to your `train_base_model.py` if applicable] * **Hyperparameters:** * Learning rate: 2e-5 * Batch size: 4 # Adjust based on your GPU memory (e.g., 8, 16) * Number of epochs: 3 # Adjust based on convergence and overfitting * Optimizer: AdamW * Weight decay: 0.01 * Warmup steps: 500 # Or warmup_ratio, e.g., 0.1 * Gradient accumulation steps: 1 # Increase if actual batch size is limited by memory * Seed: 42 # For reproducibility * **Infrastructure:** * Hardware: [e.g., 1x NVIDIA A100 40GB, or specify your hardware] * Training time: [e.g., X hours] This specific model card refers to `checkpoint5207`, which was saved at step 5207 of the training process. The final model represents the model after the completion of all training epochs/steps. ## Evaluation Results Provide quantitative results if available (e.g., perplexity, BLEU scores on a held-out test set). * **Test set:** [Describe your test set] * **Metrics:** [e.g., Perplexity, BLEU, ROUGE] * **Results for checkpoint5207:** * [Metric 1]: [Value] * [Metric 2]: [Value] * **Results for final model:** * [Metric 1]: [Value] * [Metric 2]: [Value] Qualitative observations can also be included. ## Model Card Contact **Author:** Yaqoob Tasal **Username:** tasal9 **Organization:** ZamAI [GitHub: https://github.com/tasal9](https://github.com/tasal9) ## Citation If you use this model or its checkpoints, please consider citing: ```bibtex @misc{zamai_bloom_pashto_2025, author = {Yaqoob Tasal}, title = {ZamAI Bloom Pashto - Fine-tuned Language Model}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/tasal9/zamai-bloom-ps-final}} } ``` And the original Bloom model: ```bibtex @article{scao2022bloom, title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model}, author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\'e}line and Demeusy, Thibault and others}, journal={arXiv preprint arXiv:2211.05100}, year={2022} } ``` --- Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details. Save this as a `README.md` file in your model repository on the Hugging Face Hub. ## Training Details (Cleaned Base Model - June 2025) This model version was trained from `bigscience/bloom-560m` using the `train_base_model.py` script. - **Training Data:** The model was trained on a locally prepared dataset located at `datasets/base_pashto_clean`. This dataset was created using `prepare_base_dataset.py` and is derived from `pashto_data/base_model/cleaned_base_data.txt`, which primarily contains Pashto text from a bilingual Pashto-English glossary. - **Training Objective:** To establish a foundational Pashto language model with improved coherence and reduced issues (e.g., repetition, off-language generation) compared to any prior versions trained on noisier data. - **Output Directory (during training):** `models/pashto-bloom-base-clean-colab` - **Key Training Hyperparameters:** - Epochs: 3 - Per Device Batch Size: 2 - Gradient Accumulation Steps: 4 - Learning Rate: 5e-5 - FP16 (Mixed Precision): True - Optimizer: AdamW