Instructions to use microsoft/Florence-2-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Florence-2-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="microsoft/Florence-2-large", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True) model = AutoModelForImageTextToText.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Florence-2-large with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Florence-2-large" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-large", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/Florence-2-large
- SGLang
How to use microsoft/Florence-2-large with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Florence-2-large" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-large", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Florence-2-large" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-large", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/Florence-2-large with Docker Model Runner:
docker model run hf.co/microsoft/Florence-2-large
Trying to get bounding box confidence values for object detection
I am currently trying to produce the bounding boxes, confidence level and labels for prediction on an image.
The code I am using is below.
image = Image.open(image_path)
inputs = self.processor(text=self.prompt, images=image, return_tensors="pt")
generated_ids = self.model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=128,
num_beams=2,
do_sample=False,
return_dict_in_generate=True,
output_scores=True,
)
generated_text = self.processor.batch_decode(
generated_ids.sequences, skip_special_tokens=False
)[0]
parsed_answer = self.processor.post_process_generation(
generated_text, task=self.prompt, image_size=(image.width, image.height)
)
transition_scores = self.model.compute_transition_scores(
generated_ids.sequences,
generated_ids.scores,
generated_ids.beam_indices,
normalize_logits=False,
)
bounding_box_tokens = generated_ids.sequences[0][4:-1].numpy()
bounding_box_scores = transition_scores[0][3:-1].numpy()
bounding_box_indexs = np.where(
np.logical_and(bounding_box_tokens >= 50269, bounding_box_tokens <= 51268)
)
bounding_box_scores = bounding_box_scores[bounding_box_indexs]
score_split_arrays = np.exp(
np.mean(
np.array_split(bounding_box_scores, len(bounding_box_scores) / 4),
axis=1,
)
)
return (
torch.Tensor(parsed_answer[self.prompt]["bboxes"]),
torch.Tensor(score_split_arrays),
parsed_answer[self.prompt]["labels"],
)
As you can see this is very similar to the example implementation. The key issue here is whether my approach to token isolation is correct. I am splitting the ends from the tokens and scores as they seem to belong to tokens that signify the ends of the sequence. I find token sequence indices where the token is between values that I believe signify the location tokens. I then use those to find the scores in the same indices. Here I am assuming that the scores are mapped to the same indices as their respective token. Is this the case? And more generally does this approach actually do what I believe it does based on my explanation?
hi @leoxiaobin do you have an intuition here? My guess is you would only be able to get confidence levels per token and would not be a directly score of the confidence level for a specific bounding box,
maybe from you experience modeling florence-2, would you have any suggesting on a good confidence level score here?
try the latest commit for confidence score
hi @haipingwu thanks for this update! quick question, can the scores also get extracted from description_with_bboxes_or_polygons and phrase_grounding ?
is it possible to get score for CAPTION_TO_PHRASE_GROUNDING task?
Hey!
Did you solve the problem?