Image-Text-to-Text
nexaml commited on
Commit
fcdf04d
·
verified ·
1 Parent(s): 0b0ad6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -16
README.md CHANGED
@@ -37,15 +37,18 @@ AutoNeural powers real-time cockpit intelligence including **in-cabin detection*
37
 
38
  ## ⚡ **Benchmarks**
39
 
40
- Validated on **Qualcomm SA8295P NPU**:
41
 
42
- | Metric | Baseline (InternVL 2B) | **AutoNeural-VL** |
43
- | ------------------------- | ---------------------- | ----------------- |
44
- | **TTFT** | ~1.4 s | **~100 ms** |
45
- | **Max Vision Resolution** | 448×448 | **768×768** |
46
- | **RMS Quant Error** | 3.98% | **0.56%** |
47
- | **Decode Throughput** | 15 tok/s | **44 tok/s** |
48
- | **Context Length** | 1024 | **4096** |
 
 
 
49
 
50
 
51
 
@@ -81,26 +84,40 @@ Multiple images can be processed with a single query.
81
 
82
  ---
83
 
84
- ## **Model Architecture**
85
 
86
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
87
 
 
88
 
 
 
 
 
89
 
90
  ---
91
 
92
- ## **Training**
 
 
93
 
 
94
 
 
 
 
 
 
 
95
 
96
  ## **License**
97
 
98
- The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.
 
 
99
 
100
- You may:
101
 
102
- * Use the model for **non-commercial** purposes
103
- * Modify and redistribute it with attribution
104
 
105
- For **commercial licensing**, please contact:
106
- **[dev@nexa.ai](mailto:dev@nexa.ai)**
 
37
 
38
  ## ⚡ **Benchmarks**
39
 
40
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/_zzNwQpFsGehf-_ASRupM.png" alt="Benchmark" style="width:700px;"/>
41
 
42
+ | Metric | InternVL 2B (baseline) | AutoNeural-VL |
43
+ | :--------------------- | :--------------------: | :-----------: |
44
+ | TTFT (1× 512² image) | ~1.4 s | **~100 ms** |
45
+ | Max image size | 448×448 | **768×768** |
46
+ | SQNR | 28 dB | **45 dB** |
47
+ | RMS quantization error | 3.98% | **0.562%** |
48
+ | Decode throughput | ~15 tok/s | **~44 tok/s** |
49
+ | Context length | 1024 | **4096** |
50
+
51
+ > 📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation.
52
 
53
 
54
 
 
84
 
85
  ---
86
 
87
+ ## Model architecture
88
 
89
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
90
 
91
+ AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P).
92
 
93
+ - **Vision encoder.** A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into **256 visual tokens**, giving strong inductive bias and stable INT8/16 quantization.
94
+ - **Vision–language connector.** A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization.
95
+ - **Language backbone.** A 1.2B-parameter **hybrid Transformer–SSM (“Liquid AI”)** model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning.
96
+ - **Quantization.** The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy.
97
 
98
  ---
99
 
100
+ ## Training
101
+
102
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/GPFXmoOXaF-4M-nne6GPJ.png" alt="Training" style="width:700px;"/>
103
 
104
+ AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset.
105
 
106
+ 1. **Image–text alignment.** Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding.
107
+ 2. **General visual understanding.** Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability.
108
+ 3. **Instruction tuning.** Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance.
109
+ 4. **Automotive domain finetuning.** Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware.
110
+
111
+ ---
112
 
113
  ## **License**
114
 
115
+ This model is licensed under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.
116
+
117
+ All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.
118
 
119
+ ## **Enterprise Deployment**
120
 
121
+ For enterprise deployment, custom integrations, or licensing inquiries:
 
122
 
123
+ 📅 **[Book a Call with Us](https://nexa.ai/book-a-call)**