Add comprehensive model card for AnyCap Project
#1
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: image-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- multimodal
|
| 7 |
+
- captioning
|
| 8 |
+
- controllable
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
|
| 12 |
+
|
| 13 |
+
<p align="center">
|
| 14 |
+
<img src="https://huggingface.co/qishisuren/AnyCapModel/resolve/main/assets/anycap_overview.jpg" width="500"/>
|
| 15 |
+
</p>
|
| 16 |
+
|
| 17 |
+
<p align="center">
|
| 18 |
+
๐ค <a href="https://huggingface.co/qishisuren/AnyCapModel">Model Weights</a> | ๐ <a href="https://huggingface.co/datasets/qishisuren/AnyCapEval">AnyCapEval Benchmark</a> | ๐ <a href="https://huggingface.co/papers/2507.12841">Paper</a> | ๐ <a href="https://github.com/qishisuren123/AnyCap">Code</a>
|
| 19 |
+
</p>
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Overview
|
| 24 |
+
|
| 25 |
+
**AnyCap** is a unified and controllable omni-modal captioning framework, supporting caption generation for images, audio, and videos with fine-grained style control. It was presented in the paper "[AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning](https://huggingface.co/papers/2507.12841)".
|
| 26 |
+
|
| 27 |
+
At its core, **AnyCapModel (ACM)** is a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. The project also introduces **AnyCapDataset (ACD)** to address data scarcity and **AnyCapEval** for reliable evaluation by decoupling content accuracy and stylistic fidelity.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## ๐ฉ Highlights
|
| 32 |
+
|
| 33 |
+
- ๐ **Unified Multi-modal Captioning:** One framework covers image, audio, and video captioning with controllable styles.
|
| 34 |
+
- ๐ **Customizable Caption Styles**: Control caption styles through predefined instructions and models.
|
| 35 |
+
- ๐ **Open Benchmark & Evaluation:** AnyCapEvalโan industry-level, multi-modal benchmark with comprehensive evaluation protocols.
|
| 36 |
+
- ๐ ๏ธ **End-to-End Open Source:** Full training pipeline, evaluation toolkits, dataset pipeline and open benchmark.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## ๐ Quick Start
|
| 41 |
+
|
| 42 |
+
### Installation
|
| 43 |
+
|
| 44 |
+
To get started with AnyCap, clone the repository and install the dependencies:
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
git clone https://github.com/qishisuren123/AnyCap.git
|
| 48 |
+
cd AnyCap
|
| 49 |
+
pip install -r requirements.txt
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
You may also need to install Fairseq manually:
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
git clone https://github.com/pytorch/fairseq
|
| 56 |
+
cd fairseq
|
| 57 |
+
pip install --editable ./
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### Sample Usage with ๐ค Transformers
|
| 61 |
+
|
| 62 |
+
This model can be loaded and used with the `transformers` library, for example, for image captioning:
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
from transformers import AutoProcessor, AutoModelForCausalLM
|
| 66 |
+
from PIL import Image
|
| 67 |
+
import requests
|
| 68 |
+
|
| 69 |
+
# Load model and processor
|
| 70 |
+
model_id = "qishisuren/AnyCapModel" # Replace with specific checkpoint if needed
|
| 71 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
|
| 72 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 73 |
+
|
| 74 |
+
# Example image input (replace with your image path or a PIL Image object)
|
| 75 |
+
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/image_captioning.png"
|
| 76 |
+
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
|
| 77 |
+
|
| 78 |
+
# Prepare messages for captioning
|
| 79 |
+
messages = [
|
| 80 |
+
{
|
| 81 |
+
"role": "user",
|
| 82 |
+
"content": [
|
| 83 |
+
{"type": "image", "image": image},
|
| 84 |
+
{"type": "text", "text": "Describe this image in detail."},
|
| 85 |
+
],
|
| 86 |
+
}
|
| 87 |
+
]
|
| 88 |
+
|
| 89 |
+
# Apply chat template and process inputs
|
| 90 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 91 |
+
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
|
| 92 |
+
|
| 93 |
+
# Generate caption
|
| 94 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
| 95 |
+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 96 |
+
|
| 97 |
+
print(generated_text)
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
For more detailed usage, including evaluation and other modalities, please refer to the [official GitHub repository](https://github.com/qishisuren123/AnyCap).
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## ๐ Benchmark & Evaluation
|
| 105 |
+
|
| 106 |
+
### AnyCapEval Benchmark
|
| 107 |
+
<p align="center">
|
| 108 |
+
<img src="https://huggingface.co/qishisuren/AnyCapModel/resolve/main/assets/bench_result.jpg" width="760"/>
|
| 109 |
+
</p>
|
| 110 |
+
|
| 111 |
+
**Figure 2 โ Evaluation methodology of AnyCapEval.**
|
| 112 |
+
(a) Examples demonstrating **content** scoring with *Key-point Density* (KPD) and **style** scoring rules.
|
| 113 |
+
(b) KPD correlation analysis, showing that KPD lengthโbased metrics achieve the highest Pearson/Spearman/Kendall correlations with human judgments.
|
| 114 |
+
(c) Radar chart illustrating the large performance gains delivered by **ACM** integration across ten dimensions (IAptโThm).
|
| 115 |
+
|
| 116 |
+
| | GPT-4o | **GPT-4o + ACM** | InternVL2.5-8B | **InternVL2.5-8B + ACM** |
|
| 117 |
+
|---|:---:|:---:|:---:|:---:|
|
| 118 |
+
| **Average โ** | 2.79 | **4.15** | 2.75 | **3.98** |
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
> **Key takeaway โข** ACM boosts GPT-4oโs content scores by **+45 %** and style scores by **+12 %**, and yields similar gains on strong open models, highlighting the reliability and coverage of AnyCapEval.
|
| 122 |
+
|
| 123 |
+
For detailed evaluation scripts and instructions, please refer to the [GitHub repository](https://github.com/qishisuren123/AnyCap).
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## ๐ Dataset
|
| 128 |
+
|
| 129 |
+
### AnyCapDataset (Coming Soon)
|
| 130 |
+
|
| 131 |
+
High-quality, fully annotated datasets for all three modalities (image, audio, video) will be released soon on HuggingFace. Stay tuned!
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## ๐ค Contributing
|
| 136 |
+
|
| 137 |
+
We welcome contributions! Please open issues or submit PRs for feedback and improvements.
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## ๐ Citation
|
| 142 |
+
|
| 143 |
+
```bibtex
|
| 144 |
+
@misc{ren2025anycapprojectunifiedframework,
|
| 145 |
+
title={AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning},
|
| 146 |
+
author={Yiming Ren and Zhiqiang Lin and Yu Li and Gao Meng and Weiyun Wang and Junjie Wang and Zicheng Lin and Jifeng Dai and Yujiu Yang and Wenhai Wang and Ruihang Chu},
|
| 147 |
+
year={2025},
|
| 148 |
+
eprint={2507.12841},
|
| 149 |
+
archivePrefix={arXiv},
|
| 150 |
+
primaryClass={cs.CV},
|
| 151 |
+
url={https://arxiv.org/abs/2507.12841},
|
| 152 |
+
}
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## License
|
| 158 |
+
|
| 159 |
+
This project is licensed under the MIT License โ see the [LICENSE](https://github.com/qishisuren123/AnyCap/blob/main/LICENSE) file for details.
|