| | --- |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - florence2 |
| | - smollm |
| | - custom_code |
| | license: apache-2.0 |
| | --- |
| | ## FloSmolV |
| |
|
| | A vision model for **Image-text to Text** generation produced by combining [HuggingFaceTB/SmolLM-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-360M-Instruct) and [microsoft/Florence-2-base](https://huggingface.co/microsoft/Florence-2-base). |
| |
|
| | The **Florence2-base** models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to |
| | answer questions. **SmolLM-360M** is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a |
| | Visual Question Answering model which can produce answers from Images. |
| |
|
| | ## Usage |
| |
|
| | Make sure to install the necessary dependencies. |
| |
|
| | ```bash |
| | pip install -qU transformers accelerate einops bitsandbytes flash_attn timm |
| | ``` |
| | ```python |
| | # load a free image from pixabay |
| | from PIL import Image |
| | import requests |
| | url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg" |
| | img = Image.open(requests.get(url, stream=True).raw) |
| | |
| | # download model |
| | from transformers import AutoModelForCausalLM |
| | model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda() |
| | model(img, "what is the object in the image?") |
| | ``` |
| |
|
| | You can find more about the model and configuration script here: https://huggingface.co/dmedhi/flosmolv/tree/main |