rasoulasadianub commited on
Commit
8f15959
·
verified ·
1 Parent(s): 03716d6

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -1
README.md CHANGED
@@ -9,4 +9,95 @@ base_model:
9
  - facebook/dinov2-base
10
  - HooshvareLab/gpt2-fa
11
  pipeline_tag: image-to-text
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - facebook/dinov2-base
10
  - HooshvareLab/gpt2-fa
11
  pipeline_tag: image-to-text
12
+ ---
13
+
14
+ # Persian Image Captioning (PIC) Model
15
+
16
+
17
+ ## Intended Use
18
+ - **Primary Use Cases**: Generating detailed Persian captions for images, particularly in contexts requiring cultural and linguistic accuracy. It serves as a core component in the PTIR framework for text-image retrieval, enabling applications in medical imaging, cultural heritage, and other domain-specific scenarios.
19
+ - **Out-of-Scope Uses**: Not intended for non-Persian languages, real-time applications without optimization, or tasks beyond image captioning such as object detection or generation.
20
+
21
+ ## Training Data
22
+ The model was trained on a custom dataset of approximately 1.2 million Persian image-caption pairs. This dataset was aggregated from diverse sources, with captions generated using advanced Vision-Language Models and refined for cultural and linguistic accuracy. Captions include detailed descriptions of object counts, shapes, colors, environmental contexts, age groups, and animal breeds.
23
+
24
+ Evaluation was performed on the COCO-PIC validation dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/rasoulasadianub/coco-pic), which is derived from the COCO dataset with Persian captions.
25
+
26
+ ## Evaluation
27
+ - **Metrics**: Evaluated using BLEU, ROUGE, CIDEr, and Hit@K for retrieval integration.
28
+ - **Results**: Outperforms baselines in caption quality, with significant improvements in detailed descriptions. In retrieval, PTIR (using this model) achieves Hit@1: 22%, Hit@200: 80%.
29
+ - **Comparisons**: Superior to Persian baselines and CLIP-based models in accuracy and efficiency.
30
+ - **Dataset**: Tested on subsets of the training data and COCO-PIC validation set.
31
+
32
+
33
+ ## Usage
34
+ To use the model, install the required libraries:
35
+ ```bash
36
+ pip install transformers torch datasets arabic-reshaper python-bidi
37
+ ```
38
+
39
+ Load and generate captions in Python:
40
+ ```python
41
+ import torch
42
+ from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
43
+ from PIL import Image
44
+ import arabic_reshaper
45
+ from bidi.algorithm import get_display
46
+ import matplotlib.pyplot as plt
47
+
48
+ model_name = "shenasa/persian-image-captioning"
49
+ model = VisionEncoderDecoderModel.from_pretrained(model_name)
50
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
51
+ tokenizer.pad_token_id = tokenizer.eos_token_id
52
+ image_processor = AutoImageProcessor.from_pretrained(model_name)
53
+
54
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
55
+ model.to(device)
56
+
57
+ def generate_caption(image_path):
58
+ image = Image.open(image_path).convert('RGB')
59
+ pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)
60
+ with torch.no_grad():
61
+ output_ids = model.generate(pixel_values)
62
+ caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
63
+ return caption
64
+
65
+ def visualize_caption(image_path, caption):
66
+ image = Image.open(image_path).convert('RGB')
67
+ reshaped_caption = arabic_reshaper.reshape(caption)
68
+ bidi_text = get_display(reshaped_caption)
69
+ plt.imshow(image)
70
+ plt.axis("off")
71
+ plt.title(bidi_text)
72
+ plt.show()
73
+
74
+ # Example
75
+ image_path = "path/to/your/image.jpg"
76
+ caption = generate_caption(image_path)
77
+ visualize_caption(image_path, caption)
78
+ ```
79
+
80
+ ## Limitations and Biases
81
+ - **Limitations**: Primarily optimized for Persian; performance may degrade on non-Persian or highly specialized images (e.g., abstract art). Dependent on the quality of the training dataset, which may not cover all cultural nuances.
82
+ - **Biases**: Potential biases from source datasets (e.g., COCO-derived), including underrepresentation of certain demographics or regions. Efforts were made to refine captions for cultural accuracy, but users should evaluate for fairness in specific applications.
83
+
84
+
85
+ ## Citation
86
+ If you use this model, please cite the original paper:
87
+ ```bibtex
88
+ @article{asadian2025pic,
89
+ author = {Asadian, Rasoul and Akhavanpour, Alireza},
90
+ title = {Persian Text-Image Retrieval: A Framework Based on Image Captioning and Scalable Vector Search},
91
+ journal = {IEEE CSICC},
92
+ year = {2025},
93
+ doi = {10.1109/CSICC65765.2025.10967407},
94
+ url = {https://ieeexplore.ieee.org/document/10967407}
95
+ }
96
+ ```
97
+
98
+ ## Additional Information
99
+ - **Repository**: [GitHub - PTIR](https://github.com/rasoulasadiyan/PTIR)
100
+ - **Demo**: Available at [PTIR Demo](https://rasoulasadiyan.github.io/PTIR)
101
+ - **Related Work**: Based on prior implementations like [PIC in TensorFlow](https://github.com/rasoulasadiyan/Persian-Image-Captioning-PIC)
102
+ - **Dataset**: [COCO-PIC Dataset](https://huggingface.co/datasets/rasoulasadianub/coco-pic)
103
+ - **Acknowledgments**: This work advances Persian AI resources, building on open-source tools like Hugging Face and Milvus.