| | --- |
| | license: mit |
| | language: |
| | - en |
| | --- |
| | |
| | # ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild |
| |
|
| |
|
| | Paper Link: https://arxiv.org/abs/2407.04172 |
| |
|
| | The abstract of the paper states that: |
| | > Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, \emph{and} use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across $5$ benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. |
| | # Web Demo |
| | If you wish to quickly try our model, you can access our public web demo hosted on the Hugging Face Spaces platform with a friendly interface! |
| |
|
| | [ChartGemma Web Demo](https://huggingface.co/spaces/ahmed-masry/ChartGemma) |
| |
|
| | # Inference |
| | You can easily use our models for inference with the huggingface library! |
| | You just need to do the following: |
| | 1. Chage the **_image_path_** to your chart example image path on your system |
| | 2. Write the **_input_text_** |
| |
|
| | We recommend using beam search with a beam size of 4, but if your machine has low memory, you can remove the num_beams from the generate method. |
| | ``` |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, PaliGemmaForConditionalGeneration |
| | import torch |
| | |
| | torch.hub.download_url_to_file('https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/multi_col_1229.png', 'chart_example_1.png') |
| |
|
| | image_path = "/content/chart_example_1.png" |
| | input_text ="program of thought: what is the sum of Faceboob Messnger and Whatsapp values in the 18-29 age group?" |
| |
|
| | # Load Model |
| | model = PaliGemmaForConditionalGeneration.from_pretrained("ahmed-masry/chartgemma", torch_dtype=torch.float16) |
| | processor = AutoProcessor.from_pretrained("ahmed-masry/chartgemma") |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model = model.to(device) |
| |
|
| | # Process Inputs |
| | image = Image.open(image_path).convert('RGB') |
| | inputs = processor(text=input_text, images=image, return_tensors="pt") |
| | prompt_length = inputs['input_ids'].shape[1] |
| | inputs = {k: v.to(device) for k, v in inputs.items()} |
| | |
| | |
| | # Generate |
| | generate_ids = model.generate(**inputs, num_beams=4, max_new_tokens=512) |
| | output_text = processor.batch_decode(generate_ids[:, prompt_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
| | print(output_text) |
| | |
| | ``` |
| | |
| | # Contact |
| | If you have any questions about this work, please contact **[Ahmed Masry](https://ahmedmasryku.github.io/)** using the following email addresses: **amasry17@ku.edu.tr** or **ahmed.elmasry24653@gmail.com**. |
| | |
| | # Reference |
| | Please cite our paper if you use our model in your research. |
| | |
| | ``` |
| | @misc{masry2024chartgemmavisualinstructiontuningchart, |
| | title={ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild}, |
| | author={Ahmed Masry and Megh Thakkar and Aayush Bajaj and Aaryaman Kartha and Enamul Hoque and Shafiq Joty}, |
| | year={2024}, |
| | eprint={2407.04172}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.AI}, |
| | url={https://arxiv.org/abs/2407.04172}, |
| | } |
| | |
| | ``` |