NexaAI
/

smolVLA-npu

Model card Files Files and versions

smolVLA-npu / README.md

nexaml's picture

Create README.md

0cb0cdc verified 4 months ago

|

history blame contribute delete

2.73 kB

	# SmolVLA

	Run SmolVLA optimized for Qualcomm Dragonwing IQ9 device's NPU with [nexaSDK](https://sdk.nexa.ai).

	## Quickstart

	1. Install NexaSDK and create a free account at [sdk.nexa.ai](https://sdk.nexa.ai)
	2. Activate your device with your access token:

	```bash
	nexa config set license '<access_token>'
	```
	3. Run the model on Qualcomm NPU in one line:

	```bash
	nexa infer NexaAI/smolVLA-npu
	```

	- Input: Enter input folder path,
	- Output: Returns result in npy file, or report error if any required input cannot be found


	## Model Description
	SmolVLA is a lightweight Vision-Language-Action (VLA) model built for efficient multimodal understanding and real-time control.
	Developed by the Hugging Face Smol team, it unifies vision, language, and action into one coherent model that can perceive, reason, and act — enabling autonomous agents and robotics to run entirely on local hardware.

	## Features
	- 🧠 Unified Perception-to-Action — Combines visual understanding, natural language reasoning, and control generation.
	- ⚡ Lightweight & Fast — Designed for real-time inference on laptops, edge boards, and NPUs.
	- 👁️ Grounded Visual Reasoning — Links language instructions with specific visual elements and spatial context.
	- 🧩 Zero-Shot Multimodal Tasks — Performs visual question answering, task planning, and grounding without retraining.
	- 🔧 Extensible & Open — Compatible with robotics frameworks and multimodal datasets for custom fine-tuning.

	## Use Cases
	- Embodied AI: End-to-end perception-action loops for robotics and simulation.
	- On-Device Agents: Multimodal assistants that process camera feeds locally.
	- Autonomous Systems: Real-time visual reasoning in automotive or IoT devices.
	- Research: Alignment studies and grounded reasoning experiments.
	- Simulation Control: Vision-driven policy generation for digital twins or VR.

	## Inputs and Outputs
	Input
	- Image(s) or video frames
	- Optional text instruction or query

	Output
	- Action vector or control command
	- Optional textual reasoning or visual grounding map

	## License
	This repo is licensed under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.
	All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.
	Commercial licensing or enterprise usage requires a separate agreement.
	For inquiries, please contact `dev@nexa.ai`.