--- library_name: onnx tags: - image-text-to-text - phi-3 - vision - multimodal - onnx - int4 - cpu - onnx - inference4j license: mit pipeline_tag: image-text-to-text --- # Phi-3.5-vision-instruct — ONNX (INT4) INT4-quantized ONNX export of [Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct), a 4.2B-parameter multimodal vision-language model from Microsoft. Accepts images and text prompts, generates text output. Optimized for CPU inference with int4 RTN block-32 quantization. Mirrored for use with [inference4j](https://github.com/inference4j/inference4j), an inference-only AI library for Java. ## Original Source - **Repository:** [Microsoft](https://huggingface.co/microsoft/Phi-3.5-vision-instruct-onnx) - **License:** mit ## Usage with inference4j ```java try (VisionLanguageModel vision = VisionLanguageModel.builder() .model(ModelSources.phi3Vision()) .build()) { GenerationResult result = vision.describe(Path.of("photo.jpg")); System.out.println(result.text()); } ``` ## Model Details | Property | Value | |----------|-------| | Architecture | Phi-3.5 Vision (4.2B parameters — CLIP ViT encoder + MLP projector + Phi-3 decoder) | | Task | Image description, visual Q&A, multimodal chat | | Context length | 128K tokens | | Quantization | INT4 RTN block-32 acc-level-4 | | ONNX files | 3 models (vision encoder, embedding projector, text decoder) | | Original framework | PyTorch (transformers) | ## License This model is licensed under the [MIT License](https://opensource.org/licenses/MIT). Original model by [Microsoft](https://huggingface.co/microsoft).