|
|
--- |
|
|
pipeline_tag: image-text-to-text |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-8B-Instruct |
|
|
tags: |
|
|
- mlx |
|
|
--- |
|
|
# Qwen3-VL-8B-Instruct |
|
|
Run **Qwen3-VL-8B-Instruct** optimized for **Apple Silicon** on MLX with [NexaSDK](https://github.com/NexaAI/nexa-sdk). |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
1. **Install [NexaSDK](https://github.com/NexaAI/nexa-sdk)** |
|
|
2. Run the model locally with one line of code: |
|
|
|
|
|
```bash |
|
|
nexa infer NexaAI/qwen3vl-8B-Instruct-4bit-mlx |
|
|
``` |
|
|
|
|
|
## Model Description |
|
|
**Qwen3-VL-8B-Instruct** is an 8-billion-parameter instruction-tuned multimodal large language model developed by the Qwen team at Alibaba Cloud. |
|
|
It belongs to the **Qwen3-VL** series, designed for seamless understanding and reasoning across text, image, and video. This version combines the visual intelligence of Qwen3-VL with the instruction-following capabilities of Qwen3-LM, enabling natural, grounded conversations around complex visual content. |
|
|
|
|
|
Compared to the 4B variant, the **8B** model delivers stronger reasoning, richer context retention, and improved performance on visual and multilingual benchmarks while maintaining efficiency for deployment. |
|
|
|
|
|
## Features |
|
|
- **Enhanced Visual Understanding**: Handles complex scenes, documents, and multi-image inputs. |
|
|
- **Instruction-Tuned Dialogue**: Produces coherent and context-aware responses aligned with user intent. |
|
|
- **Multilingual Support**: Capable of understanding and generating in multiple languages. |
|
|
- **Extended Context Window**: Supports longer text and multimodal contexts for better reasoning continuity. |
|
|
- **Optimized Performance**: Balances large-scale reasoning capability with deployability for high-end edge or server environments. |
|
|
|
|
|
## Use Cases |
|
|
- Visual chatbots and multimodal assistants |
|
|
- Document and chart interpretation |
|
|
- Image-grounded content generation and summarization |
|
|
- Video frame reasoning and analysis |
|
|
- Multilingual multimodal tutoring or knowledge assistants |
|
|
|
|
|
## Inputs and Outputs |
|
|
**Input:** |
|
|
- Text, images, or combined multimodal prompts |
|
|
- Optional video frames or sequential image sets |
|
|
|
|
|
**Output:** |
|
|
- Natural-language answers, summaries, captions, or structured reasoning outputs |
|
|
- Can provide visual explanations or reasoning narratives when prompted |
|
|
|
|
|
## License |
|
|
See the [official Qwen license](https://huggingface.co/Qwen) for details on usage and redistribution. |