Google DeepMind presents D4RT, a model that lets AI see the world in four dimensions

Google DeepMind announced D4RT (Dynamic 4D Reconstruction and Tracking), a new AI model capable of reconstructing and understanding dynamic scenes in four dimensions—three-dimensional space plus time—from common 2D videos. The goal is to bring machine perception closer to how humans understand the world, maintaining a continuous representation of reality despite motion, occlusions, and viewpoint changes. From a technical standpoint, D4RT addresses one of the greatest challenges in computer vision: reconstructing geometry and motion simultaneously without fragmented pipelines. Instead of using multiple specialized models, the system adopts a unified Transformer encoder-decoder architecture, capable of learning a global scene representation and answering specific queries about position, depth, and motion at different points in time. D4RT's central differentiator is its flexible query mechanism. The model answers questions such as "where is this pixel in 3D space at a given instant and viewpoint?", allowing tasks like point tracking, 3D cloud reconstruction, and camera pose estimation to be solved within a single framework. Because queries are independent, they can be processed in parallel, making the system highly scalable. On technical benchmarks, D4RT showed significant performance gains. According to DeepMind, the model is between 18 and 300 times faster than prior methods, processing a one-minute video in about five seconds on a single TPU chip. Beyond speed, results show higher fidelity in complex scenes, with fast-moving objects, non-rigid deformations, and strong motion blur. These characteristics make D4RT especially relevant for practical applications such as robotics, augmented reality, and spatial computing. Robots can gain more reliable spatial awareness in dynamic environments, while AR devices gain instant understanding of real-world geometry. More broadly, the model represents a concrete step toward "world models"—AI systems capable of understanding and predicting the dynamics of the physical world. By combining accuracy, efficiency, and generalization, D4RT signals a structural shift in how AI perceives reality.