nexaml's picture
Update README.md
44155f2 verified
---
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen3-VL-8B-Instruct
tags:
- mlx
---
# Qwen3-VL-8B-Instruct
Run **Qwen3-VL-8B-Instruct** optimized for **Apple Silicon** on MLX with [NexaSDK](https://github.com/NexaAI/nexa-sdk).
## Quickstart
1. **Install [NexaSDK](https://github.com/NexaAI/nexa-sdk)**
2. Run the model locally with one line of code:
```bash
nexa infer NexaAI/qwen3vl-8B-Instruct-4bit-mlx
```
## Model Description
**Qwen3-VL-8B-Instruct** is an 8-billion-parameter instruction-tuned multimodal large language model developed by the Qwen team at Alibaba Cloud.
It belongs to the **Qwen3-VL** series, designed for seamless understanding and reasoning across text, image, and video. This version combines the visual intelligence of Qwen3-VL with the instruction-following capabilities of Qwen3-LM, enabling natural, grounded conversations around complex visual content.
Compared to the 4B variant, the **8B** model delivers stronger reasoning, richer context retention, and improved performance on visual and multilingual benchmarks while maintaining efficiency for deployment.
## Features
- **Enhanced Visual Understanding**: Handles complex scenes, documents, and multi-image inputs.
- **Instruction-Tuned Dialogue**: Produces coherent and context-aware responses aligned with user intent.
- **Multilingual Support**: Capable of understanding and generating in multiple languages.
- **Extended Context Window**: Supports longer text and multimodal contexts for better reasoning continuity.
- **Optimized Performance**: Balances large-scale reasoning capability with deployability for high-end edge or server environments.
## Use Cases
- Visual chatbots and multimodal assistants
- Document and chart interpretation
- Image-grounded content generation and summarization
- Video frame reasoning and analysis
- Multilingual multimodal tutoring or knowledge assistants
## Inputs and Outputs
**Input:**
- Text, images, or combined multimodal prompts
- Optional video frames or sequential image sets
**Output:**
- Natural-language answers, summaries, captions, or structured reasoning outputs
- Can provide visual explanations or reasoning narratives when prompted
## License
See the [official Qwen license](https://huggingface.co/Qwen) for details on usage and redistribution.