NexaAI
/

AutoNeural

Image-Text-to-Text

Model card Files Files and versions

xet

Community

nexaml commited on Dec 2, 2025

Commit

6eda227

verified ·

1 Parent(s): f683126

Update README.md

Browse files

Files changed (1) hide show

README.md +54 -46

README.md CHANGED Viewed

@@ -1,66 +1,27 @@
-# **Overview**
 **AutoNeural** is a next-generation, **NPU-native multimodal vision–language model** co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both **vision encoding** and **language modeling** for the constraints and capabilities of NPUs—achieving **14× faster latency**, **7× lower quantization error**, and **real-time automotive performance** even under aggressive low-precision settings.
 AutoNeural integrates:
 * A **MobileNetV5-based vision encoder** with depthwise separable convolutions.
-* A **Liquid AI hybrid Transformer-SSM language backbone** that dramatically reduces KV-cache overhead.
 * A **normalization-free MLP connector** tailored for quantization stability.
 * Mixed-precision **W8A16 (vision)** and **W4A16 (language)** inference validated on real Qualcomm NPUs.
-AutoNeural powers real-time cockpit intelligence including **in-cabin safety**, **out-of-cabin awareness**, **HMI understanding**, and **visual + conversational function calls**, as demonstrated in the on-device results (Page 6 figure) .
----
-# **Key Features**
-### 🔍 **MobileNetV5 Vision Encoder (300M)**
-Optimized for edge hardware, with:
-* **Depthwise separable convolutions** for low compute and bounded activations.
-* **Local attention bottlenecks** only in late stages for efficient long-range reasoning.
-* **Multi-Scale Fusion Adapter (MSFA)** producing a compact **16×16×2048** feature map.
-* Stable **INT8/16** behavior with minimal post-quantization degradation.
-Yields **5.8× – 14× speedups** over ViT baselines across 256–768 px inputs.
----
-### 🧠 **Hybrid Transformer-SSM Language Backbone (1.2B)**
-Designed for NPU memory hierarchies:
-* **5:1 ratio of SSM layers to Transformer attention layers**
-* **Linear-time gated convolution layers** for most steps
-* **Tiny rolling state** instead of KV-cache → up to **60% lower memory bandwidth**
-* **W4A16 stable quantization** across layers
----
-### 🔗 **Normalization-Free Vision–Language Connector**
-A compact 2-layer MLP using **SiLU**, deliberately **removing RMSNorm** to avoid unstable activation ranges during static quantization.
-Ensures reliable deployment on W8A16/W4A16 pipelines.
 ---
-### 🚗 **Automotive-Grade Multimodal Intelligence**
-Trained on **10M Infinity-MM samples** plus **200k automotive cockpit samples**, covering:
-* AI Sentinel (vehicle security)
-* AI Greeter (identity recognition)
-* Car Finder (parking localization)
-* Passenger safety monitoring
-Ensures robust performance across lighting, demographics, weather, and motion scenarios.
 ---
-### ⚡ **Real NPU Benchmarks**
 Validated on **Qualcomm SA8295P NPU**:
@@ -107,6 +68,53 @@ Multiple images can be processed with a single query.
 ---
 # **License**
 The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.

+# AutoNeural-VL-1.5B
+## **Introduction**
 **AutoNeural** is a next-generation, **NPU-native multimodal vision–language model** co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both **vision encoding** and **language modeling** for the constraints and capabilities of NPUs—achieving **14× faster latency**, **7× lower quantization error**, and **real-time automotive performance** even under aggressive low-precision settings.
 AutoNeural integrates:
 * A **MobileNetV5-based vision encoder** with depthwise separable convolutions.
+* A **hybrid Transformer-SSM language backbone** that dramatically reduces KV-cache overhead.
 * A **normalization-free MLP connector** tailored for quantization stability.
 * Mixed-precision **W8A16 (vision)** and **W4A16 (language)** inference validated on real Qualcomm NPUs.
+AutoNeural powers real-time cockpit intelligence including **in-cabin safety**, **out-of-cabin awareness**, **HMI understanding**, and **visual + conversational function calls**.
 ---
+## Use Cases
 ---
+## ⚡ **Benchmarks on NPU**
 Validated on **Qualcomm SA8295P NPU**:
 ---
+## **Key Features**
+### 🔍 **MobileNetV5 Vision Encoder (300M)**
+Optimized for edge hardware, with:
+* **Depthwise separable convolutions** for low compute and bounded activations.
+* **Local attention bottlenecks** only in late stages for efficient long-range reasoning.
+* **Multi-Scale Fusion Adapter (MSFA)** producing a compact **16×16×2048** feature map.
+* Stable **INT8/16** behavior with minimal post-quantization degradation.
+Yields **5.8× – 14× speedups** over ViT baselines across 256–768 px inputs.
+---
+### 🧠 **Hybrid Transformer-SSM Language Backbone (1.2B)**
+Designed for NPU memory hierarchies:
+* **5:1 ratio of SSM layers to Transformer attention layers**
+* **Linear-time gated convolution layers** for most steps
+* **Tiny rolling state** instead of KV-cache → up to **60% lower memory bandwidth**
+* **W4A16 stable quantization** across layers
+---
+### 🔗 **Normalization-Free Vision–Language Connector**
+A compact 2-layer MLP using **SiLU**, deliberately **removing RMSNorm** to avoid unstable activation ranges during static quantization.
+Ensures reliable deployment on W8A16/W4A16 pipelines.
+---
+### 🚗 **Automotive-Grade Multimodal Intelligence**
+Trained on **10M Infinity-MM samples** plus **200k automotive cockpit samples**, covering:
+* AI Sentinel (vehicle security)
+* AI Greeter (identity recognition)
+* Car Finder (parking localization)
+* Passenger safety monitoring
+Ensures robust performance across lighting, demographics, weather, and motion scenarios.
+---
 # **License**
 The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.