ChartCap fine-tuning 모델 벤치마크 성능 질문

#2
by HeatFlux - opened

Hello,

I compared the Phi-3.5-vision-instruct-ChartCap model you shared with microsoft/Phi-3.5-vision-instruct.
The dataset used was the ChartQA dataset included in the ChartCap corpora.

microsoft/Phi-3.5-vision-instruct
static_acc, related_acc = 72.76, 81.64

Phi-3.5-vision-instruct-ChartCap
static_acc, related_acc = 71.24, 79.20

The Phi-3.5-vision-instruct model performed slightly better than the fine-tuned model.
Could I ask for your opinion on how best to interpret this result?

Also, if you have any evaluation results on ChartQA, I would greatly appreciate it if you could share them.

Thank you in advance,

Thank you for running the comparison and for sharing the results.

The slightly lower accuracy of the fine-tuned model on ChartQA is most likely due to the effect of large-scale captioning fine-tuning. This process biases the model’s instruction-following behavior toward producing captions rather than concise answers, which is a common phenomenon when a model is adapted strongly to a specific downstream task.

In our own experiments on ChartQA, we observed the same pattern: the model frequently outputs captions or descriptive explanations instead of the short number or string expected by the evaluation metric.

junyoung-00 changed discussion status to closed

Sign up or log in to comment