dmedhi
/

flosmolv

Image-Text-to-Text

Model card Files Files and versions

flosmolv / README.md

dmedhi's picture

Update README.md

7f01759 verified over 1 year ago

|

history blame contribute delete

1.46 kB

	---
	pipeline_tag: image-text-to-text
	tags:
	- florence2
	- smollm
	- custom_code
	license: apache-2.0
	---
	## FloSmolV

	A vision model for Image-text to Text generation produced by combining [HuggingFaceTB/SmolLM-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-360M-Instruct) and [microsoft/Florence-2-base](https://huggingface.co/microsoft/Florence-2-base).

	The Florence2-base models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to
	answer questions. SmolLM-360M is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a
	Visual Question Answering model which can produce answers from Images.

	## Usage

	Make sure to install the necessary dependencies.

	```bash
	pip install -qU transformers accelerate einops bitsandbytes flash_attn timm
	```
	```python
	# load a free image from pixabay
	from PIL import Image
	import requests
	url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg"
	img = Image.open(requests.get(url, stream=True).raw)

	# download model
	from transformers import AutoModelForCausalLM
	model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda()
	model(img, "what is the object in the image?")
	```

	You can find more about the model and configuration script here: https://huggingface.co/dmedhi/flosmolv/tree/main