AutoNeural / README.md

Update README.md

82b0c74 verified about 2 months ago

4.14 kB

	# Overview

	AutoNeural is a next-generation, NPU-native multimodal vision–language model co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both vision encoding and language modeling for the constraints and capabilities of NPUs—achieving 14× faster latency, 7× lower quantization error, and real-time automotive performance even under aggressive low-precision settings.

	AutoNeural integrates:

	* A MobileNetV5-based vision encoder with depthwise separable convolutions.
	* A Liquid AI hybrid Transformer-SSM language backbone that dramatically reduces KV-cache overhead.
	* A normalization-free MLP connector tailored for quantization stability.
	* Mixed-precision W8A16 (vision) and W4A16 (language) inference validated on real Qualcomm NPUs.

	AutoNeural powers real-time cockpit intelligence including in-cabin safety, out-of-cabin awareness, HMI understanding, and visual + conversational function calls, as demonstrated in the on-device results (Page 6 figure) .

	---

	# Key Features

	### 🔍 MobileNetV5 Vision Encoder (300M)

	Optimized for edge hardware, with:

	* Depthwise separable convolutions for low compute and bounded activations.
	* Local attention bottlenecks only in late stages for efficient long-range reasoning.
	* Multi-Scale Fusion Adapter (MSFA) producing a compact 16×16×2048 feature map.
	* Stable INT8/16 behavior with minimal post-quantization degradation.

	Yields 5.8× – 14× speedups over ViT baselines across 256–768 px inputs.

	---

	### 🧠 Hybrid Transformer-SSM Language Backbone (1.2B)

	Designed for NPU memory hierarchies:

	* 5:1 ratio of SSM layers to Transformer attention layers
	* Linear-time gated convolution layers for most steps
	* Tiny rolling state instead of KV-cache → up to 60% lower memory bandwidth
	* W4A16 stable quantization across layers

	---

	### 🔗 Normalization-Free Vision–Language Connector

	A compact 2-layer MLP using SiLU, deliberately removing RMSNorm to avoid unstable activation ranges during static quantization.

	Ensures reliable deployment on W8A16/W4A16 pipelines.

	---

	### 🚗 Automotive-Grade Multimodal Intelligence

	Trained on 10M Infinity-MM samples plus 200k automotive cockpit samples, covering:

	* AI Sentinel (vehicle security)
	* AI Greeter (identity recognition)
	* Car Finder (parking localization)
	* Passenger safety monitoring

	Ensures robust performance across lighting, demographics, weather, and motion scenarios.

	---

	### ⚡ Real NPU Benchmarks

	Validated on Qualcomm SA8295P NPU:

	\| Metric \| Baseline (InternVL 2B) \| AutoNeural-VL \|
	\| ------------------------- \| ---------------------- \| ----------------- \|
	\| TTFT \| ~1.4 s \| ~100 ms \|
	\| Max Vision Resolution \| 448×448 \| 768×768 \|
	\| RMS Quant Error \| 3.98% \| 0.56% \|
	\| Decode Throughput \| 15 tok/s \| 44 tok/s \|
	\| Context Length \| 1024 \| 4096 \|

	---

	# How to Use

	> ⚠️ Hardware requirement: AutoNeural is optimized for Qualcomm NPUs.

	### 1) Install Nexa-SDK

	Download the SDK，follow the installation steps provided on the model page.

	---

	### 2) Configure authentication

	Create an access token in the Model Hub, then run:

	```bash
	nexa config set license '<access_token>'
	```

	---

	### 3) Run the model

	```bash
	nexa infer NexaAI/AutoNeural
	```

	### Image input

	Drag and drop one or more image files into the terminal window.
	Multiple images can be processed with a single query.

	---

	# License

	The AutoNeural model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license.

	You may:

	* Use the model for non-commercial purposes
	* Modify and redistribute it with attribution

	For commercial licensing, please contact:
	[dev@nexa.ai](mailto:dev@nexa.ai)

	# Overview

	AutoNeural is a next-generation, NPU-native multimodal vision–language model co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both vision encoding and language modeling for the constraints and capabilities of NPUs—achieving 14× faster latency, 7× lower quantization error, and real-time automotive performance even under aggressive low-precision settings.

	AutoNeural integrates:

	* A MobileNetV5-based vision encoder with depthwise separable convolutions.
	* A Liquid AI hybrid Transformer-SSM language backbone that dramatically reduces KV-cache overhead.
	* A normalization-free MLP connector tailored for quantization stability.
	* Mixed-precision W8A16 (vision) and W4A16 (language) inference validated on real Qualcomm NPUs.

	AutoNeural powers real-time cockpit intelligence including in-cabin safety, out-of-cabin awareness, HMI understanding, and visual + conversational function calls, as demonstrated in the on-device results (Page 6 figure) .

	---

	# Key Features

	### 🔍 MobileNetV5 Vision Encoder (300M)

	Optimized for edge hardware, with:

	* Depthwise separable convolutions for low compute and bounded activations.
	* Local attention bottlenecks only in late stages for efficient long-range reasoning.
	* Multi-Scale Fusion Adapter (MSFA) producing a compact 16×16×2048 feature map.
	* Stable INT8/16 behavior with minimal post-quantization degradation.

	Yields 5.8× – 14× speedups over ViT baselines across 256–768 px inputs.

	---

	### 🧠 Hybrid Transformer-SSM Language Backbone (1.2B)

	Designed for NPU memory hierarchies:

	* 5:1 ratio of SSM layers to Transformer attention layers
	* Linear-time gated convolution layers for most steps
	* Tiny rolling state instead of KV-cache → up to 60% lower memory bandwidth
	* W4A16 stable quantization across layers

	---

	### 🔗 Normalization-Free Vision–Language Connector

	A compact 2-layer MLP using SiLU, deliberately removing RMSNorm to avoid unstable activation ranges during static quantization.

	Ensures reliable deployment on W8A16/W4A16 pipelines.

	---

	### 🚗 Automotive-Grade Multimodal Intelligence

	Trained on 10M Infinity-MM samples plus 200k automotive cockpit samples, covering:

	* AI Sentinel (vehicle security)
	* AI Greeter (identity recognition)
	* Car Finder (parking localization)
	* Passenger safety monitoring

	Ensures robust performance across lighting, demographics, weather, and motion scenarios.

	---

	### ⚡ Real NPU Benchmarks

	Validated on Qualcomm SA8295P NPU:

	\| Metric \| Baseline (InternVL 2B) \| AutoNeural-VL \|
	\| ------------------------- \| ---------------------- \| ----------------- \|
	\| TTFT \| ~1.4 s \| ~100 ms \|
	\| Max Vision Resolution \| 448×448 \| 768×768 \|
	\| RMS Quant Error \| 3.98% \| 0.56% \|
	\| Decode Throughput \| 15 tok/s \| 44 tok/s \|
	\| Context Length \| 1024 \| 4096 \|

	---

	# How to Use

	> ⚠️ Hardware requirement: AutoNeural is optimized for Qualcomm NPUs.

	### 1) Install Nexa-SDK

	Download the SDK，follow the installation steps provided on the model page.

	---

	### 2) Configure authentication

	Create an access token in the Model Hub, then run:

	```bash
	nexa config set license '<access_token>'
	```

	---

	### 3) Run the model

	```bash
	nexa infer NexaAI/AutoNeural
	```

	### Image input

	Drag and drop one or more image files into the terminal window.
	Multiple images can be processed with a single query.

	---

	# License

	The AutoNeural model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license.

	You may:

	* Use the model for non-commercial purposes
	* Modify and redistribute it with attribution

	For commercial licensing, please contact:
	[dev@nexa.ai](mailto:dev@nexa.ai)