Image-Text-to-Text
nexaml commited on
Commit
6eda227
·
verified ·
1 Parent(s): f683126

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -46
README.md CHANGED
@@ -1,66 +1,27 @@
1
- # **Overview**
 
 
2
 
3
  **AutoNeural** is a next-generation, **NPU-native multimodal vision–language model** co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both **vision encoding** and **language modeling** for the constraints and capabilities of NPUs—achieving **14× faster latency**, **7× lower quantization error**, and **real-time automotive performance** even under aggressive low-precision settings.
4
 
5
  AutoNeural integrates:
6
 
7
  * A **MobileNetV5-based vision encoder** with depthwise separable convolutions.
8
- * A **Liquid AI hybrid Transformer-SSM language backbone** that dramatically reduces KV-cache overhead.
9
  * A **normalization-free MLP connector** tailored for quantization stability.
10
  * Mixed-precision **W8A16 (vision)** and **W4A16 (language)** inference validated on real Qualcomm NPUs.
11
 
12
- AutoNeural powers real-time cockpit intelligence including **in-cabin safety**, **out-of-cabin awareness**, **HMI understanding**, and **visual + conversational function calls**, as demonstrated in the on-device results (Page 6 figure) .
13
-
14
- ---
15
-
16
- # **Key Features**
17
-
18
- ### 🔍 **MobileNetV5 Vision Encoder (300M)**
19
-
20
- Optimized for edge hardware, with:
21
-
22
- * **Depthwise separable convolutions** for low compute and bounded activations.
23
- * **Local attention bottlenecks** only in late stages for efficient long-range reasoning.
24
- * **Multi-Scale Fusion Adapter (MSFA)** producing a compact **16×16×2048** feature map.
25
- * Stable **INT8/16** behavior with minimal post-quantization degradation.
26
-
27
- Yields **5.8× – 14× speedups** over ViT baselines across 256–768 px inputs.
28
-
29
- ---
30
-
31
- ### 🧠 **Hybrid Transformer-SSM Language Backbone (1.2B)**
32
-
33
- Designed for NPU memory hierarchies:
34
-
35
- * **5:1 ratio of SSM layers to Transformer attention layers**
36
- * **Linear-time gated convolution layers** for most steps
37
- * **Tiny rolling state** instead of KV-cache → up to **60% lower memory bandwidth**
38
- * **W4A16 stable quantization** across layers
39
-
40
- ---
41
-
42
- ### 🔗 **Normalization-Free Vision–Language Connector**
43
-
44
- A compact 2-layer MLP using **SiLU**, deliberately **removing RMSNorm** to avoid unstable activation ranges during static quantization.
45
-
46
- Ensures reliable deployment on W8A16/W4A16 pipelines.
47
 
48
  ---
49
 
50
- ### 🚗 **Automotive-Grade Multimodal Intelligence**
51
 
52
- Trained on **10M Infinity-MM samples** plus **200k automotive cockpit samples**, covering:
53
 
54
- * AI Sentinel (vehicle security)
55
- * AI Greeter (identity recognition)
56
- * Car Finder (parking localization)
57
- * Passenger safety monitoring
58
-
59
- Ensures robust performance across lighting, demographics, weather, and motion scenarios.
60
 
61
  ---
62
 
63
- ### ⚡ **Real NPU Benchmarks**
64
 
65
  Validated on **Qualcomm SA8295P NPU**:
66
 
@@ -107,6 +68,53 @@ Multiple images can be processed with a single query.
107
 
108
  ---
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  # **License**
111
 
112
  The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.
 
1
+ # AutoNeural-VL-1.5B
2
+
3
+ ## **Introduction**
4
 
5
  **AutoNeural** is a next-generation, **NPU-native multimodal vision–language model** co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both **vision encoding** and **language modeling** for the constraints and capabilities of NPUs—achieving **14× faster latency**, **7× lower quantization error**, and **real-time automotive performance** even under aggressive low-precision settings.
6
 
7
  AutoNeural integrates:
8
 
9
  * A **MobileNetV5-based vision encoder** with depthwise separable convolutions.
10
+ * A **hybrid Transformer-SSM language backbone** that dramatically reduces KV-cache overhead.
11
  * A **normalization-free MLP connector** tailored for quantization stability.
12
  * Mixed-precision **W8A16 (vision)** and **W4A16 (language)** inference validated on real Qualcomm NPUs.
13
 
14
+ AutoNeural powers real-time cockpit intelligence including **in-cabin safety**, **out-of-cabin awareness**, **HMI understanding**, and **visual + conversational function calls**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ---
17
 
18
+ ## Use Cases
19
 
 
20
 
 
 
 
 
 
 
21
 
22
  ---
23
 
24
+ ## ⚡ **Benchmarks on NPU**
25
 
26
  Validated on **Qualcomm SA8295P NPU**:
27
 
 
68
 
69
  ---
70
 
71
+ ## **Key Features**
72
+
73
+ ### 🔍 **MobileNetV5 Vision Encoder (300M)**
74
+
75
+ Optimized for edge hardware, with:
76
+
77
+ * **Depthwise separable convolutions** for low compute and bounded activations.
78
+ * **Local attention bottlenecks** only in late stages for efficient long-range reasoning.
79
+ * **Multi-Scale Fusion Adapter (MSFA)** producing a compact **16×16×2048** feature map.
80
+ * Stable **INT8/16** behavior with minimal post-quantization degradation.
81
+
82
+ Yields **5.8× – 14× speedups** over ViT baselines across 256–768 px inputs.
83
+
84
+ ---
85
+
86
+ ### 🧠 **Hybrid Transformer-SSM Language Backbone (1.2B)**
87
+
88
+ Designed for NPU memory hierarchies:
89
+
90
+ * **5:1 ratio of SSM layers to Transformer attention layers**
91
+ * **Linear-time gated convolution layers** for most steps
92
+ * **Tiny rolling state** instead of KV-cache → up to **60% lower memory bandwidth**
93
+ * **W4A16 stable quantization** across layers
94
+
95
+ ---
96
+
97
+ ### 🔗 **Normalization-Free Vision–Language Connector**
98
+
99
+ A compact 2-layer MLP using **SiLU**, deliberately **removing RMSNorm** to avoid unstable activation ranges during static quantization.
100
+
101
+ Ensures reliable deployment on W8A16/W4A16 pipelines.
102
+
103
+ ---
104
+
105
+ ### 🚗 **Automotive-Grade Multimodal Intelligence**
106
+
107
+ Trained on **10M Infinity-MM samples** plus **200k automotive cockpit samples**, covering:
108
+
109
+ * AI Sentinel (vehicle security)
110
+ * AI Greeter (identity recognition)
111
+ * Car Finder (parking localization)
112
+ * Passenger safety monitoring
113
+
114
+ Ensures robust performance across lighting, demographics, weather, and motion scenarios.
115
+
116
+ ---
117
+
118
  # **License**
119
 
120
  The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.