|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- fa |
|
|
metrics: |
|
|
- bleu |
|
|
- rouge |
|
|
base_model: |
|
|
- facebook/dinov2-base |
|
|
- HooshvareLab/gpt2-fa |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
# Persian Image Captioning (PIC) Model |
|
|
|
|
|
|
|
|
## Intended Use |
|
|
- **Primary Use Cases**: Generating detailed Persian captions for images, particularly in contexts requiring cultural and linguistic accuracy. It serves as a core component in the PTIR framework for text-image retrieval, enabling applications in medical imaging, cultural heritage, and other domain-specific scenarios. |
|
|
- **Out-of-Scope Uses**: Not intended for non-Persian languages, real-time applications without optimization, or tasks beyond image captioning such as object detection or generation. |
|
|
|
|
|
## Training Data |
|
|
The model was trained on a custom dataset of approximately 1.2 million Persian image-caption pairs. This dataset was aggregated from diverse sources, with captions generated using advanced Vision-Language Models and refined for cultural and linguistic accuracy. Captions include detailed descriptions of object counts, shapes, colors, environmental contexts, age groups, and animal breeds. |
|
|
|
|
|
Evaluation was performed on the COCO-PIC validation dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/rasoulasadianub/coco-pic), which is derived from the COCO dataset with Persian captions. |
|
|
|
|
|
## Evaluation |
|
|
- **Metrics**: Evaluated using BLEU, ROUGE, CIDEr, and Hit@K for retrieval integration. |
|
|
- **Results**: Outperforms baselines in caption quality, with significant improvements in detailed descriptions. In retrieval, PTIR (using this model) achieves Hit@1: 22%, Hit@200: 80%. |
|
|
- **Comparisons**: Superior to Persian baselines and CLIP-based models in accuracy and efficiency. |
|
|
- **Dataset**: Tested on subsets of the training data and COCO-PIC validation set. |
|
|
|
|
|
|
|
|
## Usage |
|
|
To use the model, install the required libraries: |
|
|
```bash |
|
|
pip install transformers torch datasets arabic-reshaper python-bidi |
|
|
``` |
|
|
|
|
|
Load and generate captions in Python: |
|
|
```python |
|
|
import torch |
|
|
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor |
|
|
from PIL import Image |
|
|
import arabic_reshaper |
|
|
from bidi.algorithm import get_display |
|
|
import matplotlib.pyplot as plt |
|
|
|
|
|
model_name = "shenasa/persian-image-captioning" |
|
|
model = VisionEncoderDecoderModel.from_pretrained(model_name) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
tokenizer.pad_token_id = tokenizer.eos_token_id |
|
|
image_processor = AutoImageProcessor.from_pretrained(model_name) |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
|
|
|
def generate_caption(image_path): |
|
|
image = Image.open(image_path).convert('RGB') |
|
|
pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device) |
|
|
with torch.no_grad(): |
|
|
output_ids = model.generate(pixel_values) |
|
|
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
|
return caption |
|
|
|
|
|
def visualize_caption(image_path, caption): |
|
|
image = Image.open(image_path).convert('RGB') |
|
|
reshaped_caption = arabic_reshaper.reshape(caption) |
|
|
bidi_text = get_display(reshaped_caption) |
|
|
plt.imshow(image) |
|
|
plt.axis("off") |
|
|
plt.title(bidi_text) |
|
|
plt.show() |
|
|
|
|
|
# Example |
|
|
image_path = "path/to/your/image.jpg" |
|
|
caption = generate_caption(image_path) |
|
|
visualize_caption(image_path, caption) |
|
|
``` |
|
|
|
|
|
## Limitations and Biases |
|
|
- **Limitations**: Primarily optimized for Persian; performance may degrade on non-Persian or highly specialized images (e.g., abstract art). Dependent on the quality of the training dataset, which may not cover all cultural nuances. |
|
|
- **Biases**: Potential biases from source datasets (e.g., COCO-derived), including underrepresentation of certain demographics or regions. Efforts were made to refine captions for cultural accuracy, but users should evaluate for fairness in specific applications. |
|
|
|
|
|
|
|
|
## Citation |
|
|
If you use this model, please cite the original paper: |
|
|
```bibtex |
|
|
@article{asadian2025pic, |
|
|
author = {Asadian, Rasoul and Akhavanpour, Alireza}, |
|
|
title = {Persian Text-Image Retrieval: A Framework Based on Image Captioning and Scalable Vector Search}, |
|
|
journal = {IEEE CSICC}, |
|
|
year = {2025}, |
|
|
doi = {10.1109/CSICC65765.2025.10967407}, |
|
|
url = {https://ieeexplore.ieee.org/document/10967407} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Additional Information |
|
|
- **Repository**: [GitHub - PTIR](https://github.com/rasoulasadiyan/PTIR) |
|
|
- **Demo**: Available at [PTIR Demo](https://rasoulasadiyan.github.io/PTIR) |
|
|
- **Related Work**: Based on prior implementations like [PIC in TensorFlow](https://github.com/rasoulasadiyan/Persian-Image-Captioning-PIC) |
|
|
- **Dataset**: [COCO-PIC Dataset](https://huggingface.co/datasets/rasoulasadianub/coco-pic) |
|
|
- **Acknowledgments**: This work advances Persian AI resources, building on open-source tools like Hugging Face and Milvus. |