NexaAI
/

AutoNeural

Image-Text-to-Text

Model card Files Files and versions

xet

Community

nexaml commited on Dec 3, 2025

Commit

fcdf04d

verified ·

1 Parent(s): 0b0ad6e

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -16

README.md CHANGED Viewed

@@ -37,15 +37,18 @@ AutoNeural powers real-time cockpit intelligence including **in-cabin detection*
 ## ⚡ **Benchmarks**
-Validated on **Qualcomm SA8295P NPU**:
-| Metric                    | Baseline (InternVL 2B) | **AutoNeural-VL** |
-| ------------------------- | ---------------------- | ----------------- |
-| **TTFT**                  | ~1.4 s                 | **~100 ms**       |
-| **Max Vision Resolution** | 448×448                | **768×768**       |
-| **RMS Quant Error**       | 3.98%                  | **0.56%**         |
-| **Decode Throughput**     | 15 tok/s               | **44 tok/s**      |
-| **Context Length**        | 1024                   | **4096**          |
@@ -81,26 +84,40 @@ Multiple images can be processed with a single query.
 ---
-## **Model Architecture**
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
 ---
-## **Training**
 ## **License**
-The AutoNeural model is released under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license.
-You may:
-* Use the model for **non-commercial** purposes
-* Modify and redistribute it with attribution
-For **commercial licensing**, please contact:
-**[dev@nexa.ai](mailto:dev@nexa.ai)**

 ## ⚡ **Benchmarks**
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/_zzNwQpFsGehf-_ASRupM.png" alt="Benchmark" style="width:700px;"/>
+| Metric                 | InternVL 2B (baseline) | AutoNeural-VL |
+| :--------------------- | :--------------------: | :-----------: |
+| TTFT (1× 512² image)   |         ~1.4 s         |  **~100 ms**  |
+| Max image size         |        448×448         |  **768×768**  |
+| SQNR                   |         28 dB          |   **45 dB**   |
+| RMS quantization error |         3.98%          |  **0.562%**   |
+| Decode throughput      |       ~15 tok/s        | **~44 tok/s** |
+| Context length         |          1024          |   **4096**    |
+> 📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation.
 ---
+## Model architecture
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
+AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P).
+- **Vision encoder.** A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into **256 visual tokens**, giving strong inductive bias and stable INT8/16 quantization.
+- **Vision–language connector.** A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization.
+- **Language backbone.** A 1.2B-parameter **hybrid Transformer–SSM (“Liquid AI”)** model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning.
+- **Quantization.** The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy.
 ---
+## Training
+<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/GPFXmoOXaF-4M-nne6GPJ.png" alt="Training" style="width:700px;"/>
+AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset.
+1. **Image–text alignment.** Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding.
+2. **General visual understanding.** Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability.
+3. **Instruction tuning.** Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance.
+4. **Automotive domain finetuning.** Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware.
+---
 ## **License**
+This model is licensed under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.
+All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.
+## **Enterprise Deployment**
+For enterprise deployment, custom integrations, or licensing inquiries:
+📅 **[Book a Call with Us](https://nexa.ai/book-a-call)**