ChartCap fine-tuning 모델 벤치마크 성능 질문

by HeatFlux - opened Aug 25, 2025

Aug 25, 2025

•

edited Aug 25, 2025

Hello,

I compared the Phi-3.5-vision-instruct-ChartCap model you shared with microsoft/Phi-3.5-vision-instruct.
The dataset used was the ChartQA dataset included in the ChartCap corpora.

microsoft/Phi-3.5-vision-instruct
static_acc, related_acc = 72.76, 81.64

Phi-3.5-vision-instruct-ChartCap
static_acc, related_acc = 71.24, 79.20

The Phi-3.5-vision-instruct model performed slightly better than the fine-tuned model.
Could I ask for your opinion on how best to interpret this result?

Also, if you have any evaluation results on ChartQA, I would greatly appreciate it if you could share them.

Thank you in advance,

junyoung-00

Owner Aug 25, 2025

Thank you for running the comparison and for sharing the results.

The slightly lower accuracy of the fine-tuned model on ChartQA is most likely due to the effect of large-scale captioning fine-tuning. This process biases the model’s instruction-following behavior toward producing captions rather than concise answers, which is a common phenomenon when a model is adapted strongly to a specific downstream task.

In our own experiments on ChartQA, we observed the same pattern: the model frequently outputs captions or descriptive explanations instead of the short number or string expected by the evaluation metric.

junyoung-00 changed discussion status to closed Aug 25, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment