|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1> OmniCaptioner: One Captioner to Rule Them All </h1> |
|
|
|
|
|
</div> |
|
|
<div align="center"> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://alpha-innovator.github.io/OmniCaptioner-project-page/"><b>HomePage</b></a>   |    <a href="https://github.com/Alpha-Innovator/OmniCaptioner">Github</a>   |    <a href="https://huggingface.co/papers/2504.07089">Paper</a>   |
|
|
</p> |
|
|
</div> |
|
|
|
|
|
|
|
|
## π» Finetuning Code |
|
|
### 1. Create a conda environment and install PyTorch |
|
|
```bash |
|
|
conda create -n OmniCap python=3.9 |
|
|
conda activate OmniCap |
|
|
``` |
|
|
### 2.Install dependencies |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
### 3. Install flash-attn |
|
|
```bash |
|
|
pip install flash-attn --no-build-isolation |
|
|
``` |
|
|
### 4. Prepare data |
|
|
You can place the links to your data files in `./data/caption_data.yaml`. |
|
|
|
|
|
### 5. Start finetuning |
|
|
```bash |
|
|
bash scripts/finetune_caption_slurm.sh |
|
|
``` |
|
|
## π Inference Code |
|
|
|
|
|
You can caption the image with AIGC style using the following command: |
|
|
|
|
|
|
|
|
```python |
|
|
CUDA_VISIBLE_DEVICES=0 python src/inference_single_image.py \ |
|
|
--model_path your_model_path \ |
|
|
--image_path your_image_path \ |
|
|
--image_type aigc |
|
|
``` |
|
|
|
|
|
You can caption the image with OCR style using the following command: |
|
|
|
|
|
```python |
|
|
CUDA_VISIBLE_DEVICES=0 python src/inference_single_image.py \ |
|
|
--model_path your_model_path \ |
|
|
--image_path your_image_path \ |
|
|
--image_type ocr |
|
|
``` |
|
|
## π Evaluation Code with LLM |
|
|
|
|
|
```python |
|
|
|
|
|
cd VLMEvalkit |
|
|
conda create -n VLMEvalkit python=3.9 |
|
|
conda activate VLMEvalkit |
|
|
pip install -e . |
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-qwen2-5-3B --verbose > output_omnicap_qwen2-5-3B_MMMU_DEV_VAL.log 2>&1 & |
|
|
CUDA_VISIBLE_DEVICES=0,1 nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-qwen2-5-7B --verbose > output_omnicap_qwen2-5-7B_MMMU_DEV_VAL.log 2>&1 & |
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-qwen2-5-32B --verbose > output_omnicap_qwen2-5-32B_MMMU_DEV_VAL.log 2>&1 & |
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-deepseek-distill-7B --verbose > output_omnicap_deepseek_distill_3B_MMMU_DEV_VAL.log 2>&1 & |
|
|
CUDA_VISIBLE_DEVICES=0,1 nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-deepseek-distill-32B --verbose > output_omnicap_deepseek_distill_32B_MMMU_DEV_VAL.log 2>&1 & |
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python run.py --data MMMU_DEV_VAL --model Omnicaptioner-deepseek-distill-70B --verbose > output_omnicap_deepseek_distill_70B_MMMU_DEV_VAL.log 2>&1 & |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find the provided code or models useful for your research, consider citing them as: |
|
|
```bibtex |
|
|
@misc{, |
|
|
title={OmniCaptioner: One Captioner to Rule Them All}, |
|
|
author={Anonymous Authors}, |
|
|
year={2024}, |
|
|
eprint={2504.07089}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2504.07089} |
|
|
} |
|
|
|
|
|
``` |