|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2512.02924"><img src="https://img.shields.io/badge/📄%20arXiv-2512.02924-b31b1b?style=for-the-badge" alt="arXiv"></a> |
|
|
<a href="https://discord.com/invite/nexa-ai"><img src="https://img.shields.io/badge/💬%20Discord-Nexa%20AI-5865F2?style=for-the-badge" alt="Discord"></a> |
|
|
<a href="https://x.com/nexa_ai"><img src="https://img.shields.io/badge/𝕏%20Twitter-nexa__ai-000000?style=for-the-badge" alt="Twitter"></a> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://github.com/NexaAI/nexa-sdk/edit/main/solutions/autoneural/README.md"><b>🌟 Github</b></a> | |
|
|
<a href="https://nexa.ai/solution/intelligent-cockpit"><b>📄 Webpage</b></a> |
|
|
</p> |
|
|
|
|
|
# AutoNeural-VL-1.5B |
|
|
|
|
|
## **Introduction** |
|
|
|
|
|
**AutoNeural** is an NPU-native vision–language model for in-car assistants, co-designed with a MobileNetV5 encoder and a hybrid Liquid AI 1.2B backbone to deliver **real-time multimodal understanding on Qualcomm SA8295P NPU**. It runs 768×768 images, cuts end-to-end latency by up to **14×**, and improves quantization error by **7×** compared to ViT–Transformer baselines on the same hardware. |
|
|
|
|
|
Key Features: |
|
|
- **NPU-native co-design** – MobileNet-based vision encoder + hybrid Transformer–SSM backbone, built for INT4/8/16 and NPU operator sets. |
|
|
- **Real-time cockpit performance** – Up to **14× lower TTFT**, ~3× faster decode, and 4× longer context (4096 vs 1024) on Qualcomm SA8295P NPU. |
|
|
- **High-resolution multimodal perception** – Supports **768×768** images with ~45 dB SQNR under mixed-precision quantization (W8A16 vision, W4A16 language). |
|
|
- **Automotive-tuned dataset** – Trained with **200k** proprietary cockpit samples (AI Sentinel, Greeter, Car Finder, Safety) plus large-scale Infinity-MM instruction data. |
|
|
- **Production-focused** – Designed for always-on, low-power, privacy-preserving deployment in real vehicles. |
|
|
|
|
|
|
|
|
## Use Cases |
|
|
|
|
|
AutoNeural powers real-time cockpit intelligence including **in-cabin detection**, **out-cabin awareness**, **HMI understanding**, and **visual + conversational agent**. |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/a-Rd-eFETHPgf82wOPr4S.png" alt="Use Case" style="width:700px;"/> |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚡ **Benchmarks** |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/_zzNwQpFsGehf-_ASRupM.png" alt="Benchmark" style="width:700px;"/> |
|
|
|
|
|
| Metric | InternVL 2B (baseline) | AutoNeural-VL | |
|
|
| :--------------------- | :--------------------: | :-----------: | |
|
|
| TTFT (1× 512² image) | ~1.4 s | **~100 ms** | |
|
|
| Max image size | 448×448 | **768×768** | |
|
|
| SQNR | 28 dB | **45 dB** | |
|
|
| RMS quantization error | 3.98% | **0.562%** | |
|
|
| Decode throughput | ~15 tok/s | **~44 tok/s** | |
|
|
| Context length | 1024 | **4096** | |
|
|
|
|
|
> 📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
# **How to Use** |
|
|
|
|
|
> ⚠️ **Hardware requirement:** AutoNeural is only available for **Qualcomm NPUs**. |
|
|
|
|
|
### 1) Install Nexa-SDK |
|
|
|
|
|
Download the SDK,follow the installation steps provided on the model page. |
|
|
|
|
|
|
|
|
### 2) Configure authentication |
|
|
|
|
|
Create an access token in the Model Hub, then run: |
|
|
|
|
|
```bash |
|
|
nexa config set license '<access_token>' |
|
|
``` |
|
|
|
|
|
### 3) Run the model |
|
|
|
|
|
```bash |
|
|
nexa infer NexaAI/AutoNeural |
|
|
``` |
|
|
|
|
|
### 4) Image input |
|
|
|
|
|
Drag and drop one or more image files into the terminal window. |
|
|
Multiple images can be processed with a single query. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model architecture |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/> |
|
|
|
|
|
AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P). |
|
|
|
|
|
- **Vision encoder.** A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into **256 visual tokens**, giving strong inductive bias and stable INT8/16 quantization. |
|
|
- **Vision–language connector.** A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization. |
|
|
- **Language backbone.** A 1.2B-parameter **hybrid Transformer–SSM (“Liquid AI”)** model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning. |
|
|
- **Quantization.** The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/GPFXmoOXaF-4M-nne6GPJ.png" alt="Training" style="width:700px;"/> |
|
|
|
|
|
AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset. |
|
|
|
|
|
1. **Image–text alignment.** Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding. |
|
|
2. **General visual understanding.** Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability. |
|
|
3. **Instruction tuning.** Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance. |
|
|
4. **Automotive domain finetuning.** Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware. |
|
|
|
|
|
--- |
|
|
|
|
|
## **License** |
|
|
|
|
|
This model is licensed under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution. |
|
|
|
|
|
All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications. |
|
|
|
|
|
## **Enterprise Deployment** |
|
|
|
|
|
For enterprise deployment, custom integrations, or licensing inquiries: |
|
|
|
|
|
📅 **[Book a Call with Us](https://nexa.ai/book-a-call)** |