Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

•

Original Author:Shaocong Xu et al.

•

December 29, 2025

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Image generated by Gemini AI

Researchers have developed TransPhy3D, a synthetic video dataset of 11,000 sequences showcasing transparent and reflective scenes using Blender/Cycles. This dataset aids in training DKT, a video-to-video translator that improves depth and normal estimation for transparent objects. DKT achieves state-of-the-art performance on benchmarks like ClearPose and enhances grasping success rates on complex surfaces, demonstrating the potential of repurposing diffusion models for advanced perception tasks in robotics.

Advancements in Depth Estimation for Transparent Objects Using Video Diffusion

A research team has developed a new model, DKT, which enhances depth and normal estimation for transparent objects using modern video diffusion techniques. This advancement addresses challenges in perception systems that struggle with transparent materials due to refraction and reflection.

To support this, the researchers created TransPhy3D, a synthetic video dataset with 11,000 sequences rendered using Blender/Cycles, featuring static and procedural assets like glass and plastic. The dataset was produced using physically based ray tracing and OptiX denoising to generate RGB images along with depth and normal maps for training DKT.

The DKT model employs a video-to-video translation approach, utilizing lightweight LoRA adapters for enhanced performance. By training on both the TransPhy3D dataset and existing synthetic datasets, DKT learns to concatenate RGB and noisy depth latents within the DiT backbone, allowing for temporally consistent predictions across videos.

In testing, DKT achieved state-of-the-art results in zero-shot scenarios on several benchmarks involving transparent objects, including ClearPose and TransPhy3D-Test. The model demonstrated improved accuracy and temporal consistency compared to established methods, setting new records in video normal estimation on ClearPose.

The compact DKT model operates at approximately 0.17 seconds per frame, showing promise for practical applications. Integrated into a grasping system, DKT enhances success rates for manipulating translucent and reflective surfaces, outperforming previous depth estimation models.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Advancements in Depth Estimation for Transparent Objects Using Video Diffusion

Related Topics:

Share this article