Spaces:
Running
Running
| title: README | |
| emoji: ๐ | |
| colorFrom: purple | |
| colorTo: red | |
| sdk: static | |
| pinned: false | |
| > [!NOTE] | |
| > This is the organization for official transformers converted checkpoints of Microsoft's Florence model. Try the model itself [here](https://huggingface.co/spaces/gokaygokay/Florence-2). This integration unlocks use of Florence-2 with all the libraries/APIs in Hugging Face ecosystem. | |
| Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model. | |
| Resources and Technical Documentation: | |
| + [Florence-2 technical report](https://arxiv.org/abs/2311.06242). | |
| + [Jupyter Notebook for inference and visualization of Florence-2-large](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb) | |
| | Model | Model size | Model Description | | |
| | ------- | ------------- | ------------- | | |
| | Florence-2-base[[HF]](https://huggingface.co/florence-community/Florence-2-base) | 0.23B | Pretrained model with FLD-5B | |
| | Florence-2-large[[HF]](https://huggingface.co/florence-community/Florence-2-large) | 0.77B | Pretrained model with FLD-5B | |
| | Florence-2-base-ft[[HF]](https://huggingface.co/florence-community/Florence-2-base-ft) | 0.23B | Finetuned model on a colletion of downstream tasks | |
| | Florence-2-large-ft[[HF]](https://huggingface.co/florence-community/Florence-2-large-ft) | 0.77B | Finetuned model on a colletion of downstream tasks | |
| Use the code below to get started with the model. | |
| ```python | |
| import torch | |
| import requests | |
| from PIL import Image | |
| from transformers import AutoProcessor, Florence2ForConditionalGeneration | |
| model = Florence2ForConditionalGeneration.from_pretrained( | |
| "florence-community/Florence-2-base-ft", | |
| dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| processor = AutoProcessor.from_pretrained("florence-community/Florence-2-base-ft") | |
| url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true" | |
| image = Image.open(requests.get(url, stream=True).raw).convert("RGB") | |
| task_prompt = "<OD>" | |
| inputs = processor(text=task_prompt, images=image, return_tensors="pt").to(model.device, torch.bfloat16) | |
| generated_ids = model.generate( | |
| **inputs, | |
| max_new_tokens=1024, | |
| num_beams=3, | |
| ) | |
| generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] | |
| image_size = image.size | |
| parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=image_size) | |
| print(parsed_answer) | |
| ``` |