Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy.
We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions.
Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6× faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable—without LiDAR, stereo, or geometric assumptions.
Asymmetric Augmentation Denoising Distillation (A2D2) is a novel, LiDAR-free knowledge distillation scheme that transfers robust geometric understanding
from a strong teacher model to a compact student model.
It creates an information asymmetry by feeding clean images to the teacher model while giving the smaller student a mixup-augmented image.
This forces the student to learn a feature denoising task, guided by the teacher's precise object-level depth features.
The distillation is further enhanced by a novel feature loss that is dynamically weighted by the teacher's prediction quality and
the feature's importance, ensuring the student focuses on the most reliable and informative cues.
This process strengthens depth reasoning without needing privileged depth information.
3D-aware Consistent Matching (CM3D) significantly improves the crucial task of assigning model predictions to ground truth objects during training.
It enhances the matching score by explicitly integrating a 3D bounding box overlap term. Specifically, the 3D Marginalized Generalized IoU (MGIoU).
This joint 2D and 3D alignment criterion yields more stable and precise supervision, particularly in crowded scenes or during complex data augmentation like mixup.
By directly incorporating 3D geometric quality into the assignment, CM3D enables the model to learn better object localization in 3D space.
Confidence-Gated 3D Inference (CGI3D) is a lightweight inference strategy designed to drastically accelerate detection without sacrificing accuracy.
The method first runs the lightweight classification head across the entire feature map to quickly identify regions with high object confidence.
It then restricts the computationally expensive 3D regression head to only these top-confidence regions, processing them as small local patches.
This process reduces redundant computation across background areas and effectively cuts head-level FLOPs.
Comparison with lightweight M3D methods (< 30M parameters) on the KITTI test set for the car category. Extra: Highlights methods that utilize auxiliary data during training to improve accuracy. Params: Million model parameters. GFLOPs: Giga FLOPs per image during inference time. Time: For a fair comparison, inference time is shown in ms for batch size 1 without TensorRT (hardware: NVIDIA RTX 8000).
Left: Comparison with SOTA monocular methods on the KITTI test set for the car category. Extra: Highlights methods that utilize auxiliary data during training to improve accuracy. Right: Comparison with SOTA monocular methods on the Waymo validation set for the vehicle category. We compare with methods that use the same training and validation set as in DEVIANT.
Qualitative results of LeAD-M3D X on the KITTI dataset. KITTI considers objects with high occlusion or truncation levels or with a 2D height below 25 pixels as background. We follow best practices and only learn the car, pedestrian and cyclist categories.
Qualitative results of LeAD-M3D X on the Waymo validation dataset. Waymo considers vehicles with less than 100 LiDAR points or whose projected 3D center is outside the image as background (DEVIANT style).
Qualitiative results on the Rope3D dataset.
Top: Bird's eye view representation showing
Ground Truth,
LeAD-M3D X (Ours) and
YOLOv10-3D X (Baseline)
Bottom: Predictions of LeAD-M3D (Ours).
@article{meier2025lead-m3d,
author = {Meier, Johannes and Michel, Jonathan and Dhaouadi, Oussema and Yang, Yung-Hsu and Reich, Christoph and Bauer, Zuria and Roth, Stefan and Pollefeys, Marc and Kaiser, Jacques and Cremres, Daniel},
title = {{LeAD-M3D:} Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection},
journal = {arXiv},
year = {2025},
}