Upload README.md (#1)

18ed61b verified 5 months ago

4.8 kB

	---
	license: mit
	language:
	- fa
	metrics:
	- bleu
	- rouge
	base_model:
	- facebook/dinov2-base
	- HooshvareLab/gpt2-fa
	pipeline_tag: image-to-text
	---

	# Persian Image Captioning (PIC) Model


	## Intended Use
	- Primary Use Cases: Generating detailed Persian captions for images, particularly in contexts requiring cultural and linguistic accuracy. It serves as a core component in the PTIR framework for text-image retrieval, enabling applications in medical imaging, cultural heritage, and other domain-specific scenarios.
	- Out-of-Scope Uses: Not intended for non-Persian languages, real-time applications without optimization, or tasks beyond image captioning such as object detection or generation.

	## Training Data
	The model was trained on a custom dataset of approximately 1.2 million Persian image-caption pairs. This dataset was aggregated from diverse sources, with captions generated using advanced Vision-Language Models and refined for cultural and linguistic accuracy. Captions include detailed descriptions of object counts, shapes, colors, environmental contexts, age groups, and animal breeds.

	Evaluation was performed on the COCO-PIC validation dataset, available at [Hugging Face Datasets](https://huggingface.co/datasets/rasoulasadianub/coco-pic), which is derived from the COCO dataset with Persian captions.

	## Evaluation
	- Metrics: Evaluated using BLEU, ROUGE, CIDEr, and Hit@K for retrieval integration.
	- Results: Outperforms baselines in caption quality, with significant improvements in detailed descriptions. In retrieval, PTIR (using this model) achieves Hit@1: 22%, Hit@200: 80%.
	- Comparisons: Superior to Persian baselines and CLIP-based models in accuracy and efficiency.
	- Dataset: Tested on subsets of the training data and COCO-PIC validation set.


	## Usage
	To use the model, install the required libraries:
	```bash
	pip install transformers torch datasets arabic-reshaper python-bidi
	```

	Load and generate captions in Python:
	```python
	import torch
	from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
	from PIL import Image
	import arabic_reshaper
	from bidi.algorithm import get_display
	import matplotlib.pyplot as plt

	model_name = "shenasa/persian-image-captioning"
	model = VisionEncoderDecoderModel.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	tokenizer.pad_token_id = tokenizer.eos_token_id
	image_processor = AutoImageProcessor.from_pretrained(model_name)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	def generate_caption(image_path):
	image = Image.open(image_path).convert('RGB')
	pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)
	with torch.no_grad():
	output_ids = model.generate(pixel_values)
	caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	return caption

	def visualize_caption(image_path, caption):
	image = Image.open(image_path).convert('RGB')
	reshaped_caption = arabic_reshaper.reshape(caption)
	bidi_text = get_display(reshaped_caption)
	plt.imshow(image)
	plt.axis("off")
	plt.title(bidi_text)
	plt.show()

	# Example
	image_path = "path/to/your/image.jpg"
	caption = generate_caption(image_path)
	visualize_caption(image_path, caption)
	```

	## Limitations and Biases
	- Limitations: Primarily optimized for Persian; performance may degrade on non-Persian or highly specialized images (e.g., abstract art). Dependent on the quality of the training dataset, which may not cover all cultural nuances.
	- Biases: Potential biases from source datasets (e.g., COCO-derived), including underrepresentation of certain demographics or regions. Efforts were made to refine captions for cultural accuracy, but users should evaluate for fairness in specific applications.


	## Citation
	If you use this model, please cite the original paper:
	```bibtex
	@article{asadian2025pic,
	author = {Asadian, Rasoul and Akhavanpour, Alireza},
	title = {Persian Text-Image Retrieval: A Framework Based on Image Captioning and Scalable Vector Search},
	journal = {IEEE CSICC},
	year = {2025},
	doi = {10.1109/CSICC65765.2025.10967407},
	url = {https://ieeexplore.ieee.org/document/10967407}
	}
	```

	## Additional Information
	- Repository: [GitHub - PTIR](https://github.com/rasoulasadiyan/PTIR)
	- Demo: Available at [PTIR Demo](https://rasoulasadiyan.github.io/PTIR)
	- Related Work: Based on prior implementations like [PIC in TensorFlow](https://github.com/rasoulasadiyan/Persian-Image-Captioning-PIC)
	- Dataset: [COCO-PIC Dataset](https://huggingface.co/datasets/rasoulasadianub/coco-pic)
	- Acknowledgments: This work advances Persian AI resources, building on open-source tools like Hugging Face and Milvus.