arXiv preprint, 2026
Modality Forcing turns a pretrained text-to-image model into a joint image-depth generator with a simple post-training recipe: one DiT, separate noise levels per modality, and per-modality decoders that allow training on sparse real-world depth. Depth accuracy scales with T2I pre-training (300M → 3B), and our strongest model is competitive with state-of-the-art monocular depth estimators, reducing AbsRel by 57% over prior joint image-depth generative models.