Instructions to use gitlost-murali/pix2struct-refexp-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gitlost-murali/pix2struct-refexp-large with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="gitlost-murali/pix2struct-refexp-large")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("gitlost-murali/pix2struct-refexp-large") model = AutoModelForImageTextToText.from_pretrained("gitlost-murali/pix2struct-refexp-large") - Notebooks
- Google Colab
- Kaggle
Commit ·
0b170b2
1
Parent(s): c2e43c2
Update README.md
Browse files
README.md
CHANGED
|
@@ -27,6 +27,12 @@ tags:
|
|
| 27 |
|
| 28 |
# TL;DR
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
|
| 31 |
|
| 32 |

|
|
|
|
| 27 |
|
| 28 |
# TL;DR
|
| 29 |
|
| 30 |
+
## Details for Pix2Struct-RefExp: (Based on their [pre-processing](https://github.com/google-research/pix2struct/blob/main/pix2struct/preprocessing/convert_refexp.py))
|
| 31 |
+
|
| 32 |
+
-> __Input__: An image with a bounding box drawn on it around a candidate object and a header containing the referring expression (stored in the image feature).
|
| 33 |
+
|
| 34 |
+
-> __Output__: A boolean flag (parse feature) indicating whether the candidate object is the correct referent of the referring expression.
|
| 35 |
+
|
| 36 |
Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
|
| 37 |
|
| 38 |

|