Instructions to use google/pix2struct-infographics-vqa-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/pix2struct-infographics-vqa-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="google/pix2struct-infographics-vqa-base")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("google/pix2struct-infographics-vqa-base") model = AutoModelForMultimodalLM.from_pretrained("google/pix2struct-infographics-vqa-base") - Notebooks
- Google Colab
- Kaggle
Cannot reproduce results on InfographicsVQA
#1
by zhuowan - opened
I am using the pix2struct-infographics-vqa-base and pix2struct-infographics-vqa-large model here and doing inference on InfographicsVQA. However, I get 29.53 ANLS for base and 34.31 ANLS for large, which do not match with the 38.2 and 40.0 results as in the original paper. Could anyone help with this?
Here is my inference code:
import requests
from PIL import Image
import torch
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-infographics-vqa-base").to("cuda")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-infographics-vqa-base")
image_url = "https://blogs.constantcontact.com/wp-content/uploads/2019/03/Social-Media-Infographic.png"
image = Image.open(requests.get(image_url, stream=True).raw)
question = "Which social platform has heavy female audience?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda")
predictions = model.generate(**inputs)
pred = processor.decode(predictions[0], skip_special_tokens=True)
gt = 'pinterest'
print(pred)