|
|
--- |
|
|
library_name: onnx |
|
|
tags: |
|
|
- image-text-to-text |
|
|
- phi-3 |
|
|
- vision |
|
|
- multimodal |
|
|
- onnx |
|
|
- int4 |
|
|
- cpu |
|
|
- onnx |
|
|
- inference4j |
|
|
license: mit |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# Phi-3.5-vision-instruct — ONNX (INT4) |
|
|
|
|
|
INT4-quantized ONNX export of [Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct), a 4.2B-parameter multimodal vision-language model from Microsoft. Accepts images and text prompts, generates text output. Optimized for CPU inference with int4 RTN block-32 quantization. |
|
|
|
|
|
Mirrored for use with [inference4j](https://github.com/inference4j/inference4j), an inference-only AI library for Java. |
|
|
|
|
|
## Original Source |
|
|
|
|
|
- **Repository:** [Microsoft](https://huggingface.co/microsoft/Phi-3.5-vision-instruct-onnx) |
|
|
- **License:** mit |
|
|
|
|
|
## Usage with inference4j |
|
|
|
|
|
```java |
|
|
try (VisionLanguageModel vision = VisionLanguageModel.builder() |
|
|
.model(ModelSources.phi3Vision()) |
|
|
.build()) { |
|
|
GenerationResult result = vision.describe(Path.of("photo.jpg")); |
|
|
System.out.println(result.text()); |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| Architecture | Phi-3.5 Vision (4.2B parameters — CLIP ViT encoder + MLP projector + Phi-3 decoder) | |
|
|
| Task | Image description, visual Q&A, multimodal chat | |
|
|
| Context length | 128K tokens | |
|
|
| Quantization | INT4 RTN block-32 acc-level-4 | |
|
|
| ONNX files | 3 models (vision encoder, embedding projector, text decoder) | |
|
|
| Original framework | PyTorch (transformers) | |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under the [MIT License](https://opensource.org/licenses/MIT). Original model by [Microsoft](https://huggingface.co/microsoft). |
|
|
|