Florence-2 base-ft β€” ONNX (INT8 dynamic-quantized)

ONNX export of microsoft/Florence-2-base-ft with post-export INT8 dynamic quantization applied to all four sub-models. Roughly half the disk and inference RAM of the fp16 variant.

Converted artifact. Training credit: Microsoft Research.

What this repo contains

config.json
generation_config.json
preprocessor_config.json
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
special_tokens_map.json

vision_encoder_quantized.onnx
encoder_model_quantized.onnx
decoder_model_quantized.onnx
embed_tokens_quantized.onnx

Total: ~270 MB. All four ONNX files are required at inference.

How it was produced

  1. Export to fp32 ONNX:
    optimum-cli export onnx \
        --model microsoft/Florence-2-base-ft \
        --task image-to-text \
        --trust-remote-code \
        <fp32-output>
    
  2. Apply dynamic INT8 quantization to each .onnx file using onnxruntime.quantization.quantize_dynamic (weight-only, per-channel).

Toolchain: optimum 1.24.0, transformers 4.45.2, onnxruntime 1.19.x.

When to pick quantized vs fp16

This repo (INT8): CPU, NPU (OpenVINO EP), mobile, browser (transformers.js). ~270 MB. Heliosoph/florence-2-base-ft-fp16-onnx: GPU, maximum quality. ~520 MB.

Known degradation: Dense OCR over small text loses noticeable accuracy at INT8. Captioning and object detection are largely unaffected. Test on your workload before committing.

Task prompts

Identical to the fp16 variant β€” see Heliosoph/florence-2-base-ft-fp16-onnx for the full list.

License

MIT β€” same as upstream. LICENSE file included.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Heliosoph/florence-2-base-ft-quantized-onnx

Quantized
(6)
this model