Phi-3.5-vision-instruct โ ONNX (INT4)
INT4-quantized ONNX export of Phi-3.5-vision-instruct, a 4.2B-parameter multimodal vision-language model from Microsoft. Accepts images and text prompts, generates text output. Optimized for CPU inference with int4 RTN block-32 quantization.
Mirrored for use with inference4j, an inference-only AI library for Java.
Original Source
- Repository: Microsoft
- License: mit
Usage with inference4j
try (VisionLanguageModel vision = VisionLanguageModel.builder()
.model(ModelSources.phi3Vision())
.build()) {
GenerationResult result = vision.describe(Path.of("photo.jpg"));
System.out.println(result.text());
}
Model Details
| Property | Value |
|---|---|
| Architecture | Phi-3.5 Vision (4.2B parameters โ CLIP ViT encoder + MLP projector + Phi-3 decoder) |
| Task | Image description, visual Q&A, multimodal chat |
| Context length | 128K tokens |
| Quantization | INT4 RTN block-32 acc-level-4 |
| ONNX files | 3 models (vision encoder, embedding projector, text decoder) |
| Original framework | PyTorch (transformers) |
License
This model is licensed under the MIT License. Original model by Microsoft.