Upload README.md
Browse files
README.md
CHANGED
|
@@ -9,4 +9,95 @@ base_model:
|
|
| 9 |
- facebook/dinov2-base
|
| 10 |
- HooshvareLab/gpt2-fa
|
| 11 |
pipeline_tag: image-to-text
|
| 12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- facebook/dinov2-base
|
| 10 |
- HooshvareLab/gpt2-fa
|
| 11 |
pipeline_tag: image-to-text
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Persian Image Captioning (PIC) Model
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
## Intended Use
|
| 18 |
+
- **Primary Use Cases**: Generating detailed Persian captions for images, particularly in contexts requiring cultural and linguistic accuracy. It serves as a core component in the PTIR framework for text-image retrieval, enabling applications in medical imaging, cultural heritage, and other domain-specific scenarios.
|
| 19 |
+
- **Out-of-Scope Uses**: Not intended for non-Persian languages, real-time applications without optimization, or tasks beyond image captioning such as object detection or generation.
|
| 20 |
+
|
| 21 |
+
## Training Data
|
| 22 |
+
The model was trained on a custom dataset of approximately 1.2 million Persian image-caption pairs. This dataset was aggregated from diverse sources, with captions generated using advanced Vision-Language Models and refined for cultural and linguistic accuracy. Captions include detailed descriptions of object counts, shapes, colors, environmental contexts, age groups, and animal breeds.
|
| 23 |
+
|
| 24 |
+
Evaluation was performed on the COCO-PIC validation dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/rasoulasadianub/coco-pic), which is derived from the COCO dataset with Persian captions.
|
| 25 |
+
|
| 26 |
+
## Evaluation
|
| 27 |
+
- **Metrics**: Evaluated using BLEU, ROUGE, CIDEr, and Hit@K for retrieval integration.
|
| 28 |
+
- **Results**: Outperforms baselines in caption quality, with significant improvements in detailed descriptions. In retrieval, PTIR (using this model) achieves Hit@1: 22%, Hit@200: 80%.
|
| 29 |
+
- **Comparisons**: Superior to Persian baselines and CLIP-based models in accuracy and efficiency.
|
| 30 |
+
- **Dataset**: Tested on subsets of the training data and COCO-PIC validation set.
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
## Usage
|
| 34 |
+
To use the model, install the required libraries:
|
| 35 |
+
```bash
|
| 36 |
+
pip install transformers torch datasets arabic-reshaper python-bidi
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
Load and generate captions in Python:
|
| 40 |
+
```python
|
| 41 |
+
import torch
|
| 42 |
+
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
|
| 43 |
+
from PIL import Image
|
| 44 |
+
import arabic_reshaper
|
| 45 |
+
from bidi.algorithm import get_display
|
| 46 |
+
import matplotlib.pyplot as plt
|
| 47 |
+
|
| 48 |
+
model_name = "shenasa/persian-image-captioning"
|
| 49 |
+
model = VisionEncoderDecoderModel.from_pretrained(model_name)
|
| 50 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 51 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 52 |
+
image_processor = AutoImageProcessor.from_pretrained(model_name)
|
| 53 |
+
|
| 54 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 55 |
+
model.to(device)
|
| 56 |
+
|
| 57 |
+
def generate_caption(image_path):
|
| 58 |
+
image = Image.open(image_path).convert('RGB')
|
| 59 |
+
pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)
|
| 60 |
+
with torch.no_grad():
|
| 61 |
+
output_ids = model.generate(pixel_values)
|
| 62 |
+
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
| 63 |
+
return caption
|
| 64 |
+
|
| 65 |
+
def visualize_caption(image_path, caption):
|
| 66 |
+
image = Image.open(image_path).convert('RGB')
|
| 67 |
+
reshaped_caption = arabic_reshaper.reshape(caption)
|
| 68 |
+
bidi_text = get_display(reshaped_caption)
|
| 69 |
+
plt.imshow(image)
|
| 70 |
+
plt.axis("off")
|
| 71 |
+
plt.title(bidi_text)
|
| 72 |
+
plt.show()
|
| 73 |
+
|
| 74 |
+
# Example
|
| 75 |
+
image_path = "path/to/your/image.jpg"
|
| 76 |
+
caption = generate_caption(image_path)
|
| 77 |
+
visualize_caption(image_path, caption)
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## Limitations and Biases
|
| 81 |
+
- **Limitations**: Primarily optimized for Persian; performance may degrade on non-Persian or highly specialized images (e.g., abstract art). Dependent on the quality of the training dataset, which may not cover all cultural nuances.
|
| 82 |
+
- **Biases**: Potential biases from source datasets (e.g., COCO-derived), including underrepresentation of certain demographics or regions. Efforts were made to refine captions for cultural accuracy, but users should evaluate for fairness in specific applications.
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
## Citation
|
| 86 |
+
If you use this model, please cite the original paper:
|
| 87 |
+
```bibtex
|
| 88 |
+
@article{asadian2025pic,
|
| 89 |
+
author = {Asadian, Rasoul and Akhavanpour, Alireza},
|
| 90 |
+
title = {Persian Text-Image Retrieval: A Framework Based on Image Captioning and Scalable Vector Search},
|
| 91 |
+
journal = {IEEE CSICC},
|
| 92 |
+
year = {2025},
|
| 93 |
+
doi = {10.1109/CSICC65765.2025.10967407},
|
| 94 |
+
url = {https://ieeexplore.ieee.org/document/10967407}
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## Additional Information
|
| 99 |
+
- **Repository**: [GitHub - PTIR](https://github.com/rasoulasadiyan/PTIR)
|
| 100 |
+
- **Demo**: Available at [PTIR Demo](https://rasoulasadiyan.github.io/PTIR)
|
| 101 |
+
- **Related Work**: Based on prior implementations like [PIC in TensorFlow](https://github.com/rasoulasadiyan/Persian-Image-Captioning-PIC)
|
| 102 |
+
- **Dataset**: [COCO-PIC Dataset](https://huggingface.co/datasets/rasoulasadianub/coco-pic)
|
| 103 |
+
- **Acknowledgments**: This work advances Persian AI resources, building on open-source tools like Hugging Face and Milvus.
|