LeAD-M3D:
Leveraging Asymmetric Distillation for
Real-Time Monocular 3D Detection

Johannes Meier1,2,3,4    Jonathan Michel3,4    Oussema Dhaouadi1,2,3,4    Yung-Hsu Yang2    Christoph Reich3,4,5,6    Zuria Bauer2    Stefan Roth5,6,7    Marc Pollefeys2,8    Jacques Kaiser1    Daniel Cremers3,4,6

1 DeepScenario     2 ETH Zurich     3 TU Munich     4 MCML     5 TU Darmstadt    
6 ELIZA     7 hessian.AI     8 Microsoft Research
Runtime vs. Accuracy trade-off.

Runtime vs Accuracy on KITTI test, using AP3D|R40 Mod (in %, ↑) and runtime (in ms, ↓). We provide a model family (sizes N to X) to balance runtime and 3D detection accuracy. LeAD-M3D offers a Pareto frontier over existing approaches. Using TensorRT further improves the runtime, enabling >60 FPS real-time inference of even our largest model size (X). Runtime is reported on the same hardware (NVIDIA RTX 8000) wherever possible (i.e., code is publicly available).

Abstract

Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth or sacrifice efficiency to achieve competitive accuracy.

We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is enabled by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a MixUp-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates inference by restricting expensive 3D regression to confident regions.

Together, these contributions set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6× faster than prior high-accuracy models (e.g., MonoDiff). LeAD-M3D demonstrates that high fidelity and real-time monocular 3D detection is simultaneously attainable, without LiDAR, stereo, or strong geometric assumptions. We will release code and weights of LeAD-M3D.


Method

LeAD-M3D Asymmetric Augmentation Denoising Distillation (A2D2)

Asymmetric Augmentation Denoising Distillation (A2D2) is a novel, LiDAR-free knowledge distillation scheme that transfers robust geometric understanding from a strong teacher model to a compact student model. It creates an information asymmetry by feeding clean images to the teacher model while giving the smaller student a mixup-augmented image. This forces the student to learn a feature denoising task, guided by the teacher's precise object-level depth features. The distillation is further enhanced by a novel feature loss that is dynamically weighted by the teacher's prediction quality and the feature's importance, ensuring the student focuses on the most reliable and informative cues. This process strengthens depth reasoning without needing privileged depth information.


LeAD-M3D 3D-aware Consistent Matching (CM3D)


3D-aware Consistent Matching (CM3D) significantly improves the crucial task of assigning model predictions to ground truth objects during training. It enhances the matching score by explicitly integrating a 3D bounding box overlap term. Specifically, the 3D Marginalized Generalized IoU (MGIoU). This joint 2D and 3D alignment criterion yields more stable and precise supervision, particularly in crowded scenes or during complex data augmentation like mixup. By directly incorporating 3D geometric quality into the assignment, CM3D enables the model to learn better object localization in 3D space.


LeAD-M3D Confidence-Gated 3D Inference (CGI3D)


Confidence-Gated 3D Inference (CGI3D) is a lightweight inference strategy designed to drastically accelerate detection without sacrificing accuracy. The method first runs the lightweight classification head across the entire feature map to quickly identify regions with high object confidence. It then restricts the computationally expensive 3D regression head to only these top-confidence regions, processing them as small local patches. This process reduces redundant computation across background areas and effectively cuts head-level FLOPs.

Quantitative Results

LeAD-M3D lightweight comparison

Comparison with lightweight M3D methods (<30 M parameters) on the KITTI test set for the category ``Car'' using AP3D and APBEV (both in %). Extra> indicates the use of auxiliary training data. Params reports no. of model parameters in millions. GFLOPs measured for single image inference. Time is reported in ms for single image inference without TensorRT on an NVIDIA RTX 8000 GPU.

LeAD-M3D SOTA comparison

Left: Comparison with state-of-the-art M3D methods on the KITTI test set for the category "Car" using AP3D|R40 (in %). Extra indicates the use of auxiliary train. data. Right: Comparison with state-of-the-art M3D methods on the Waymo validation set. We report AP3D & APBEV (both in %, ↑) for the "Vehicle" category. We compare with methods following the DEVIANT protocol.

Qualitative Results

Qualitative results of LeAD-M3D X on the KITTI dataset. KITTI considers objects with high occlusion or truncation levels or with a 2D height below 25 pixels as background. We follow best practices and only learn the car, pedestrian and cyclist categories.

Qualitative results of LeAD-M3D X on the Waymo validation dataset. Waymo considers vehicles with less than 100 LiDAR points or whose projected 3D center is outside the image as background (DEVIANT style).

Qualitiative results on the Rope3D dataset. Top: Bird's eye view representation showing Ground Truth,
LeAD-M3D X (Ours) and YOLOv10-3D X (Baseline) Bottom: Predictions of LeAD-M3D (Ours).

References

BibTeX

@article{meier2025lead-m3d,
  author    = {Meier, Johannes and Michel, Jonathan and Dhaouadi, Oussema and Yang, Yung-Hsu and Reich, Christoph and Bauer, Zuria and Roth, Stefan and Pollefeys, Marc and Kaiser, Jacques and Cremres, Daniel},
  title     = {{LeAD-M3D:} Leveraging Asymmetric Distillation for Real-Time Monocular 3D Detection},
  journal   = {arXiv},
  year      = {2026},
}