--- license: mit pipeline_tag: image-to-text library_name: transformers tags: - multimodal - captioning - controllable --- # AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
๐ค Model Weights | ๐ AnyCapEval Benchmark | ๐ Paper | ๐ Code
--- ## Overview **AnyCap** is a unified and controllable omni-modal captioning framework, supporting caption generation for images, audio, and videos with fine-grained style control. It was presented in the paper "[AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning](https://huggingface.co/papers/2507.12841)". At its core, **AnyCapModel (ACM)** is a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. The project also introduces **AnyCapDataset (ACD)** to address data scarcity and **AnyCapEval** for reliable evaluation by decoupling content accuracy and stylistic fidelity. --- ## ๐ฉ Highlights - ๐ **Unified Multi-modal Captioning:** One framework covers image, audio, and video captioning with controllable styles. - ๐ **Customizable Caption Styles**: Control caption styles through predefined instructions and models. - ๐ **Open Benchmark & Evaluation:** AnyCapEvalโan industry-level, multi-modal benchmark with comprehensive evaluation protocols. - ๐ ๏ธ **End-to-End Open Source:** Full training pipeline, evaluation toolkits, dataset pipeline and open benchmark. --- ## ๐ Quick Start ### Installation To get started with AnyCap, clone the repository and install the dependencies: ```bash git clone https://github.com/qishisuren123/AnyCap.git cd AnyCap pip install -r requirements.txt ``` You may also need to install Fairseq manually: ```bash git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./ ``` ### Sample Usage with ๐ค Transformers This model can be loaded and used with the `transformers` library, for example, for image captioning: ```python from transformers import AutoProcessor, AutoModelForCausalLM from PIL import Image import requests # Load model and processor model_id = "qishisuren/AnyCapModel" # Replace with specific checkpoint if needed model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") processor = AutoProcessor.from_pretrained(model_id) # Example image input (replace with your image path or a PIL Image object) image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/image_captioning.png" image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") # Prepare messages for captioning messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Describe this image in detail."}, ], } ] # Apply chat template and process inputs text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, images=image, return_tensors="pt").to(model.device) # Generate caption generated_ids = model.generate(**inputs, max_new_tokens=128) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(generated_text) ``` For more detailed usage, including evaluation and other modalities, please refer to the [official GitHub repository](https://github.com/qishisuren123/AnyCap). --- ## ๐ Benchmark & Evaluation ### AnyCapEval Benchmark