Update README.md
Browse files
README.md
CHANGED
|
@@ -37,15 +37,18 @@ AutoNeural powers real-time cockpit intelligence including **in-cabin detection*
|
|
| 37 |
|
| 38 |
## ⚡ **Benchmarks**
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
| Metric
|
| 43 |
-
|
|
| 44 |
-
|
|
| 45 |
-
|
|
| 46 |
-
|
|
| 47 |
-
|
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
|
| 51 |
|
|
@@ -81,26 +84,40 @@ Multiple images can be processed with a single query.
|
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
-
##
|
| 85 |
|
| 86 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
|
| 87 |
|
|
|
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
-
##
|
|
|
|
|
|
|
| 93 |
|
|
|
|
| 94 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
## **License**
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
| 103 |
-
* Modify and redistribute it with attribution
|
| 104 |
|
| 105 |
-
|
| 106 |
-
**[dev@nexa.ai](mailto:dev@nexa.ai)**
|
|
|
|
| 37 |
|
| 38 |
## ⚡ **Benchmarks**
|
| 39 |
|
| 40 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/_zzNwQpFsGehf-_ASRupM.png" alt="Benchmark" style="width:700px;"/>
|
| 41 |
|
| 42 |
+
| Metric | InternVL 2B (baseline) | AutoNeural-VL |
|
| 43 |
+
| :--------------------- | :--------------------: | :-----------: |
|
| 44 |
+
| TTFT (1× 512² image) | ~1.4 s | **~100 ms** |
|
| 45 |
+
| Max image size | 448×448 | **768×768** |
|
| 46 |
+
| SQNR | 28 dB | **45 dB** |
|
| 47 |
+
| RMS quantization error | 3.98% | **0.562%** |
|
| 48 |
+
| Decode throughput | ~15 tok/s | **~44 tok/s** |
|
| 49 |
+
| Context length | 1024 | **4096** |
|
| 50 |
+
|
| 51 |
+
> 📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation.
|
| 52 |
|
| 53 |
|
| 54 |
|
|
|
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
+
## Model architecture
|
| 88 |
|
| 89 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
|
| 90 |
|
| 91 |
+
AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P).
|
| 92 |
|
| 93 |
+
- **Vision encoder.** A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into **256 visual tokens**, giving strong inductive bias and stable INT8/16 quantization.
|
| 94 |
+
- **Vision–language connector.** A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization.
|
| 95 |
+
- **Language backbone.** A 1.2B-parameter **hybrid Transformer–SSM (“Liquid AI”)** model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning.
|
| 96 |
+
- **Quantization.** The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy.
|
| 97 |
|
| 98 |
---
|
| 99 |
|
| 100 |
+
## Training
|
| 101 |
+
|
| 102 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/GPFXmoOXaF-4M-nne6GPJ.png" alt="Training" style="width:700px;"/>
|
| 103 |
|
| 104 |
+
AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset.
|
| 105 |
|
| 106 |
+
1. **Image–text alignment.** Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding.
|
| 107 |
+
2. **General visual understanding.** Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability.
|
| 108 |
+
3. **Instruction tuning.** Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance.
|
| 109 |
+
4. **Automotive domain finetuning.** Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware.
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
|
| 113 |
## **License**
|
| 114 |
|
| 115 |
+
This model is licensed under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.
|
| 116 |
+
|
| 117 |
+
All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.
|
| 118 |
|
| 119 |
+
## **Enterprise Deployment**
|
| 120 |
|
| 121 |
+
For enterprise deployment, custom integrations, or licensing inquiries:
|
|
|
|
| 122 |
|
| 123 |
+
📅 **[Book a Call with Us](https://nexa.ai/book-a-call)**
|
|
|