| | --- |
| | license: mit |
| | pipeline_tag: image-to-text |
| | library_name: transformers |
| | tags: |
| | - chart-captioning |
| | - multimodal |
| | - vision-language-model |
| | --- |
| | |
| | # ChartCap: Mitigating Hallucination of Dense Chart Captioning |
| |
|
| | This repository contains the model presented in the paper [**ChartCap: Mitigating Hallucination of Dense Chart Captioning**](https://huggingface.co/papers/2508.03164). |
| |
|
| | **Project Page:** [https://junyoung-00.github.io/ChartCap/](https://junyoung-00.github.io/ChartCap/)\ |
| | **Code:** [https://github.com/junyoung-00/ChartCap](https://github.com/junyoung-00/ChartCap) |
| |
|
| | ## Model Description |
| |
|
| | `Phi-3.5-vision-instruct-ChartCap` is a ChartCap-fine-tuned version of [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct). |
| |
|
| | The model aims to generate high-quality, dense captions for charts, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information. |
| |
|
| | ## Required Packages |
| |
|
| | ```bash |
| | flash_attn==2.5.8 |
| | numpy==1.24.4 |
| | Pillow==10.3.0 |
| | Requests==2.31.0 |
| | torch==2.3.0 |
| | torchvision==0.18.0 |
| | transformers==4.43.0 |
| | accelerate==0.30.0 |
| | ``` |
| |
|
| | ## How to Use |
| |
|
| | ```python |
| | from transformers import AutoProcessor, AutoModelForCausalLM |
| | from PIL import Image |
| | import requests |
| | import torch |
| | |
| | model_id = "junyoung-00/Phi-3.5-vision-instruct-ChartCap" |
| | |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") |
| | |
| | # Load an example chart image (URL or local path) |
| | image_url = "https://your-server.com/example_chart.png" |
| | image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") |
| | |
| | # Define the prompt for dense chart captioning |
| | prompt = "Please provide a detailed caption for the chart." |
| | messages = [ |
| | {"role": "user", "content": f"<|image|> |
| | {prompt}"} |
| | ] |
| | |
| | # Apply chat template and prepare inputs |
| | input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") |
| | # The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <|image|> is mapped. |
| | # For simplicity, we use the standard processor input which handles image embedding. |
| | inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device) |
| | |
| | |
| | # Generate response |
| | generated_ids = model.generate(**inputs, max_new_tokens=512) |
| | |
| | # Decode and print the output |
| | response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| | print(response.strip()) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find this model or the associated research helpful, please cite: |
| |
|
| | ```bibtex |
| | @inproceedings{lim2025chartcap, |
| | title = {ChartCap: Mitigating Hallucination of Dense Chart Captioning}, |
| | author = {Junyoung Lim and Jaewoo Ahn and Gunhee Kim}, |
| | booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, |
| | year = {2025} |
| | } |
| | ``` |