bevformer / README.md
fangmingguo's picture
Update README.md
b5f3bf4 verified
---
license: mit
language:
- en
base_model:
- zhiqi-li/BEVFormer
pipeline_tag: object-detection
tags:
- Axera
- NPU
- Pulsar2
- BEVFormer
- Autonomous-Driving
- Bird-Eye-View
---
# BEVFormer on Axera NPU
This repository contains the [BEVFormer](https://arxiv.org/abs/2203.17270) model converted for high-performance inference on the Axera NPU. BEVFormer is a paradigm-shifting transformer-based framework for 3D object detection that learns unified spatio-temporal bird's-eye-view (BEV) representations from multi-camera inputs.
This version is optimized with **w8a16** quantization and is compatible with **Pulsar2 version 4.2**.
## Convert Tools Links
For model conversion and deployment guidance:
- [AXera Platform GitHub Repo](https://github.com/AXERA-TECH/bevformer.axera): Sample code and optimization guides for Axera NPU.
- [Pulsar2 Documentation](https://pulsar2-docs.readthedocs.io/en/latest/pulsar2/introduction.html): Guide for converting ONNX models to `.axmodel`.
## Support Platforms
- **AX650**
- [M4N-Dock (爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
| Chips | Model Variant | NPU1 Latency (Per Frame) |NPU3 Latency (Per Frame) |
|---|---|---|---|
| AX650 | BEVFormer-Tiny | 253.966 ms | 91.209 ms |
## How to Use
BEVFormer requires multi-view camera inputs (typically 6 views: front, front-left, front-right, back, back-left, back-right).
### Prerequisites
1. **Environment:** Ensure you have the required Python environment activated (e.g., using Conda or a virtual environment) with the following core packages installed:
* **NPU Runtime:** `axengine` (PyAXEngine)
* **Core Libraries:** `numpy` (>= 1.22.0), `opencv-python` (`cv2`), `tqdm`, and `cffi`.
*
*(Recommended: Use a dedicated Conda environment to manage these dependencies.)*
2. **Model/Data:** Ensure the compiled `.axmodel`, `inference_config.json`, and input data (`inference_data/`) are available on the host.
### Inference Command
Run the inference script by providing the compiled model, configuration, and data directory.
```bash
python inference_axmodel.py compiled.axmodel inference_config.json inference_data/ --output-dir inference_results
```
### Inference with AX650 Host
```
(base) root@ax650:~/data# python inference_axmodel.py compiled.axmodel inference_config.json inference_data/ --output-dir ./inference_results
[INFO] Available providers: ['AXCLRTExecutionProvider', 'AxEngineExecutionProvider']
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 0 (single core)
[INFO] Compiler version: 5.1-patch1 82190926
Processing scene 1/2: fcbccedd61424f1b85dcbf8f897f9754 (40 frames)
Scene fcbccedd61424f1b85dcbf8f897f9754: 28%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 11/40 [00:12<00:33, 1.15s/it]/root/guofangming/inference_axmodel.py:389: RuntimeWarning: invalid value encountered in cast
corners = imgfov_pts_2d[i].astype(np.int32)
Scene fcbccedd61424f1b85dcbf8f897f9754: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 40/40 [00:47<00:00, 1.18s/it]Processing scene 2/2: 325cef682f064c55a255f2625c533b75 (41 frames)
Scene 325cef682f064c55a255f2625c533b75: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 41/41 [00:48<00:00, 1.18s/it]Creating video: fcbccedd61424f1b85dcbf8f897f9754_result.mp4: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 40/40 [00:08<00:00, 4.83it/s]βœ“ Scene fcbccedd61424f1b85dcbf8f897f9754: 40 frames, video: ./inference_results/fcbccedd61424f1b85dcbf8f897f9754/fcbccedd61424f1b85dcbf8f897f9754_result.mp4]Save scene results: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1/2 [00:23<00:23, 23.05s/it]
Creating video: 325cef682f064c55a255f2625c533b75_result.mp4: 7%|β–ˆβ–‹ | 3/41 [00:00<00:07, 4.92it/s]
Creating video: 325cef682f064c55a255f2625c533b75_result.mp4: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 41/41 [00:08<00:00, 4.78it/s]
βœ“ Scene 325cef682f064c55a255f2625c533b75: 41 frames, video: ./inference_results/325cef682f064c55a255f2625c533b75/325cef682f064c55a255f2625c533b75_result.mp4
Save scene results: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:47<00:00, 23.79s/it]
```
### Results
The model generates a 3D detection map projected onto the Bird's-Eye-View plane. Results are saved as images and videos which visualize the ego-vehicle and surrounding detected objects.
**Example Visualization:**
![BEVFormer Detection Result GIF](output.gif)