Instructions to use junyoung-00/Phi-3.5-vision-instruct-ChartCap with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use junyoung-00/Phi-3.5-vision-instruct-ChartCap with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="junyoung-00/Phi-3.5-vision-instruct-ChartCap", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("junyoung-00/Phi-3.5-vision-instruct-ChartCap", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
ChartCap fine-tuning 모델 벤치마크 성능 질문
Hello,
I compared the Phi-3.5-vision-instruct-ChartCap model you shared with microsoft/Phi-3.5-vision-instruct.
The dataset used was the ChartQA dataset included in the ChartCap corpora.
microsoft/Phi-3.5-vision-instruct
static_acc, related_acc = 72.76, 81.64
Phi-3.5-vision-instruct-ChartCap
static_acc, related_acc = 71.24, 79.20
The Phi-3.5-vision-instruct model performed slightly better than the fine-tuned model.
Could I ask for your opinion on how best to interpret this result?
Also, if you have any evaluation results on ChartQA, I would greatly appreciate it if you could share them.
Thank you in advance,
Thank you for running the comparison and for sharing the results.
The slightly lower accuracy of the fine-tuned model on ChartQA is most likely due to the effect of large-scale captioning fine-tuning. This process biases the model’s instruction-following behavior toward producing captions rather than concise answers, which is a common phenomenon when a model is adapted strongly to a specific downstream task.
In our own experiments on ChartQA, we observed the same pattern: the model frequently outputs captions or descriptive explanations instead of the short number or string expected by the evaluation metric.