MonoIA — Towards Intrinsic-Aware Monocular 3D Object Detection

[CVPR 2026] | Paper | Project Page | GitHub

Zhihao Zhang, Abhinav Kumar, Xiaoming Liu Michigan State University · University of North Carolina at Chapel Hill


Model Summary

MonoIA is a unified intrinsic-aware monocular 3D object detection framework that achieves state-of-the-art performance on KITTI, Waymo, and nuScenes benchmarks. The core innovation is treating camera intrinsic variation as a semantic (rather than numeric) signal: focal lengths and camera parameters are encoded into language-grounded embeddings via large language models and vision–language models, then hierarchically injected into the detection network to adapt its features to the specific camera configuration at inference time.

This allows a single model to generalize across cameras with different focal lengths and fields of view, without retraining or per-camera fine-tuning.


Performance

KITTI Val (Car, AP3D R40)

AP3D Easy AP3D Mod. AP3D Hard
33.61 24.40 20.80

KITTI Test Leaderboard (Car, AP3D R40)

AP3D Easy AP3D Mod. AP3D Hard
29.52 19.11 17.93

MonoIA achieves +1.18% AP3D Mod. improvement on the KITTI leaderboard and +4.46% under multi-dataset training on KITTI Val.


Citation

@inproceedings{zhang2026towards,
    title={Towards Intrinsic-Aware Monocular 3D Object Detection},
    author={Zhang, Zhihao and Kumar, Abhinav and Liu, Xiaoming},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for zhihao406/MonoIA