Instructions to use microsoft/Florence-2-base-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Florence-2-base-ft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="microsoft/Florence-2-base-ft", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True) model = AutoModelForImageTextToText.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Florence-2-base-ft with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Florence-2-base-ft" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-base-ft", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/Florence-2-base-ft
- SGLang
How to use microsoft/Florence-2-base-ft with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Florence-2-base-ft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-base-ft", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Florence-2-base-ft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-base-ft", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/Florence-2-base-ft with Docker Model Runner:
docker model run hf.co/microsoft/Florence-2-base-ft
Which one should I use, Florence-2-base-ft and Florence-2-base?
In the description, I cant understand the main different between Florence-2-base-ft and Florence-2-base.
My question is, If I want to fine tuning a Florence-2 on specific object detection dataset, which one should I use, ft version or without ft version?
For now I use Florence-2-base-ft, but the results are not good enough because I found CNN based method can beat Florence-2-base-ft very easily, so I guess maybe there is something wrong during my fine-tuning phase.
Hey I got he same question. Did you find the answer and what is the difference between the models?
same question, what's the difference?
Here is what GPT5-thinking summarized from the model description:
Short version: same architecture/size; different training stage.
Florence-2-base= the pre-trained checkpoint trained on Microsoft’s FLD-5B (5.4B annotations on 126M images). It’s the raw foundation model.Florence-2-base-ft= the same model further fine-tuned on a curated mixture of downstream tasks (captioning, grounding/REC, detection, OCR, etc.) to make a single “generalist” model that performs better out-of-the-box on those tasks.
What that means in practice
- Measured gains: Microsoft’s card shows the ft model improving typical benchmarks vs. the base model’s zero-shot numbers—for example, COCO Caption CIDEr 140.0 for
base-ft(after fine-tune) vs. 133.0 forbase(zero-shot); COCO detection mAP 41.4 (base-ft) vs. 34.7 (basezero-shot). - Under the hood: both are the same seq-to-seq VLM; the ft checkpoint is just an extra stage of supervised multi-task tuning layered on top of the FLD-5B pretraining described in the technical report.
Which to pick?
- If you just want plug-and-play captions/OD/OCR, use
Florence-2-base-ft. That’s what it’s for. - If you plan to fine-tune on your own data, starting from
baseis reasonable; some community users also report cases wherebasebehaves better for niche prompts, but that’s anecdotal.
That’s the whole difference: pre-trained vs. the same model after multi-task fine-tuning—same size, same prompts, better out-of-box task performance on common datasets with the ft checkpoint.
If MS team has any comments feel free to pitch in
thanks Novmik, you prompted chatGPT, i couldnt do that. Thanks for the knoweldge, you know a lot -.-