shenasa
/

persian-image-captioning

 - facebook/dinov2-base
 - HooshvareLab/gpt2-fa
 pipeline_tag: image-to-text
+---
+# Persian Image Captioning (PIC) Model
+## Intended Use
+- **Primary Use Cases**: Generating detailed Persian captions for images, particularly in contexts requiring cultural and linguistic accuracy. It serves as a core component in the PTIR framework for text-image retrieval, enabling applications in medical imaging, cultural heritage, and other domain-specific scenarios.
+- **Out-of-Scope Uses**: Not intended for non-Persian languages, real-time applications without optimization, or tasks beyond image captioning such as object detection or generation.
+## Training Data
+The model was trained on a custom dataset of approximately 1.2 million Persian image-caption pairs. This dataset was aggregated from diverse sources, with captions generated using advanced Vision-Language Models and refined for cultural and linguistic accuracy. Captions include detailed descriptions of object counts, shapes, colors, environmental contexts, age groups, and animal breeds.
+Evaluation was performed on the COCO-PIC validation dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/rasoulasadianub/coco-pic), which is derived from the COCO dataset with Persian captions.
+## Evaluation
+- **Metrics**: Evaluated using BLEU, ROUGE, CIDEr, and Hit@K for retrieval integration.
+- **Results**: Outperforms baselines in caption quality, with significant improvements in detailed descriptions. In retrieval, PTIR (using this model) achieves Hit@1: 22%, Hit@200: 80%.
+- **Comparisons**: Superior to Persian baselines and CLIP-based models in accuracy and efficiency.
+- **Dataset**: Tested on subsets of the training data and COCO-PIC validation set.
+## Usage
+To use the model, install the required libraries:
+```bash
+pip install transformers torch datasets arabic-reshaper python-bidi
+```
+Load and generate captions in Python:
+```python
+import torch
+from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
+from PIL import Image
+import arabic_reshaper
+from bidi.algorithm import get_display
+import matplotlib.pyplot as plt
+model_name = "shenasa/persian-image-captioning"
+model = VisionEncoderDecoderModel.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.pad_token_id = tokenizer.eos_token_id
+image_processor = AutoImageProcessor.from_pretrained(model_name)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+def generate_caption(image_path):
+    image = Image.open(image_path).convert('RGB')
+    pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)
+    with torch.no_grad():
+        output_ids = model.generate(pixel_values)
+    caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+    return caption
+def visualize_caption(image_path, caption):
+    image = Image.open(image_path).convert('RGB')
+    reshaped_caption = arabic_reshaper.reshape(caption)
+    bidi_text = get_display(reshaped_caption)
+    plt.imshow(image)
+    plt.axis("off")
+    plt.title(bidi_text)
+    plt.show()
+# Example
+image_path = "path/to/your/image.jpg"
+caption = generate_caption(image_path)
+visualize_caption(image_path, caption)
+```
+## Limitations and Biases
+- **Limitations**: Primarily optimized for Persian; performance may degrade on non-Persian or highly specialized images (e.g., abstract art). Dependent on the quality of the training dataset, which may not cover all cultural nuances.
+- **Biases**: Potential biases from source datasets (e.g., COCO-derived), including underrepresentation of certain demographics or regions. Efforts were made to refine captions for cultural accuracy, but users should evaluate for fairness in specific applications.
+## Citation
+If you use this model, please cite the original paper:
+```bibtex
+@article{asadian2025pic,
+  author = {Asadian, Rasoul and Akhavanpour, Alireza},
+  title = {Persian Text-Image Retrieval: A Framework Based on Image Captioning and Scalable Vector Search},
+  journal = {IEEE CSICC},
+  year = {2025},
+  doi = {10.1109/CSICC65765.2025.10967407},
+  url = {https://ieeexplore.ieee.org/document/10967407}
+}
+```
+## Additional Information
+- **Repository**: [GitHub - PTIR](https://github.com/rasoulasadiyan/PTIR)
+- **Demo**: Available at [PTIR Demo](https://rasoulasadiyan.github.io/PTIR)
+- **Related Work**: Based on prior implementations like [PIC in TensorFlow](https://github.com/rasoulasadiyan/Persian-Image-Captioning-PIC)
+- **Dataset**: [COCO-PIC Dataset](https://huggingface.co/datasets/rasoulasadianub/coco-pic)
+- **Acknowledgments**: This work advances Persian AI resources, building on open-source tools like Hugging Face and Milvus.