NexaAI
/

qwen3vl-8B-Instruct-4bit-mlx

Image-Text-to-Text

Model card Files Files and versions

qwen3vl-8B-Instruct-4bit-mlx / README.md

nexaml's picture

Update README.md

44155f2 verified 3 months ago

|

history blame contribute delete

2.34 kB

	---
	pipeline_tag: image-text-to-text
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	tags:
	- mlx
	---
	# Qwen3-VL-8B-Instruct
	Run Qwen3-VL-8B-Instruct optimized for Apple Silicon on MLX with [NexaSDK](https://github.com/NexaAI/nexa-sdk).

	## Quickstart

	1. Install [NexaSDK](https://github.com/NexaAI/nexa-sdk)
	2. Run the model locally with one line of code:

	```bash
	nexa infer NexaAI/qwen3vl-8B-Instruct-4bit-mlx
	```

	## Model Description
	Qwen3-VL-8B-Instruct is an 8-billion-parameter instruction-tuned multimodal large language model developed by the Qwen team at Alibaba Cloud.
	It belongs to the Qwen3-VL series, designed for seamless understanding and reasoning across text, image, and video. This version combines the visual intelligence of Qwen3-VL with the instruction-following capabilities of Qwen3-LM, enabling natural, grounded conversations around complex visual content.

	Compared to the 4B variant, the 8B model delivers stronger reasoning, richer context retention, and improved performance on visual and multilingual benchmarks while maintaining efficiency for deployment.

	## Features
	- Enhanced Visual Understanding: Handles complex scenes, documents, and multi-image inputs.
	- Instruction-Tuned Dialogue: Produces coherent and context-aware responses aligned with user intent.
	- Multilingual Support: Capable of understanding and generating in multiple languages.
	- Extended Context Window: Supports longer text and multimodal contexts for better reasoning continuity.
	- Optimized Performance: Balances large-scale reasoning capability with deployability for high-end edge or server environments.

	## Use Cases
	- Visual chatbots and multimodal assistants
	- Document and chart interpretation
	- Image-grounded content generation and summarization
	- Video frame reasoning and analysis
	- Multilingual multimodal tutoring or knowledge assistants

	## Inputs and Outputs
	Input:
	- Text, images, or combined multimodal prompts
	- Optional video frames or sequential image sets

	Output:
	- Natural-language answers, summaries, captions, or structured reasoning outputs
	- Can provide visual explanations or reasoning narratives when prompted

	## License
	See the [official Qwen license](https://huggingface.co/Qwen) for details on usage and redistribution.