Instructions to use microsoft/Florence-2-large-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Florence-2-large-ft with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="microsoft/Florence-2-large-ft", trust_remote_code=True)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Florence-2-large-ft with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Florence-2-large-ft"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large-ft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/microsoft/Florence-2-large-ft

SGLang

How to use microsoft/Florence-2-large-ft with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Florence-2-large-ft" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large-ft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Florence-2-large-ft" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large-ft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use microsoft/Florence-2-large-ft with Docker Model Runner:
```
docker model run hf.co/microsoft/Florence-2-large-ft
```

Fix KV cache compatibility with transformers 4.50+

#42

by kebabman - opened Jan 9

base: refs/heads/main

←

from: refs/pr/42

Discussion Files changed

+10

-10

kebabman

Jan 9

With transformers >= 4.50, use_cache=True fails with:
AttributeError: 'NoneType' object has no attribute 'shape'

Cause: transformers 4.50 changed empty caches from None to EncoderDecoderCache objects.
Code checks "if past_key_values is not None" which now passes, then fails accessing
past_key_values[0][0].shape when cache entries are still None.

Fix: Add null checks before accessing cache tensor shapes.
Backward compatible with all transformers versions.

Changes (8 locations):

Attention shape checks: add "past_key_value[0] is not None and"
Attention elif conditions: add "and past_key_value[0] is not None"
kv_seq_len update: wrap in null check
Decoder forward ternary: add full null check chain
prepare_inputs_for_generation: add full null check chain

Tested on transformers 4.57.1 with both ROCm and CUDA.
Enables ~10-25% speedup from proper KV caching.

Fix KV cache compatibility with transformers 4.50+7c0fd564

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment