MonoIA — Towards Intrinsic-Aware Monocular 3D Object Detection
[CVPR 2026] | Paper | Project Page | GitHub
Zhihao Zhang, Abhinav Kumar, Xiaoming Liu Michigan State University · University of North Carolina at Chapel Hill
Model Summary
MonoIA is a unified intrinsic-aware monocular 3D object detection framework that achieves state-of-the-art performance on KITTI, Waymo, and nuScenes benchmarks. The core innovation is treating camera intrinsic variation as a semantic (rather than numeric) signal: focal lengths and camera parameters are encoded into language-grounded embeddings via large language models and vision–language models, then hierarchically injected into the detection network to adapt its features to the specific camera configuration at inference time.
This allows a single model to generalize across cameras with different focal lengths and fields of view, without retraining or per-camera fine-tuning.
Performance
KITTI Val (Car, AP3D R40)
| AP3D Easy | AP3D Mod. | AP3D Hard |
|---|---|---|
| 33.61 | 24.40 | 20.80 |
KITTI Test Leaderboard (Car, AP3D R40)
| AP3D Easy | AP3D Mod. | AP3D Hard |
|---|---|---|
| 29.52 | 19.11 | 17.93 |
MonoIA achieves +1.18% AP3D Mod. improvement on the KITTI leaderboard and +4.46% under multi-dataset training on KITTI Val.
Citation
@inproceedings{zhang2026towards,
title={Towards Intrinsic-Aware Monocular 3D Object Detection},
author={Zhang, Zhihao and Kumar, Abhinav and Liu, Xiaoming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}