A Deep Learning Approach to improve depth prediction

Existing methods for recovering depth for dynamics, non-rigid object from monocular video impose strong assumptions on the object’s motion and may only recover sparse depth.

Google AI researchers said that they have picked up 2000 YouTube videos in which the people imitate mannequins and used it as a data set to create an AI model that is capable of depth prediction from videos in motion. The applications of such an understanding could help developers craft augmented reality experiences in scenes shot with hand-held cameras and 3D video.

Researchers think that this could provide the data set that helps detect the depth of the field in videos where the camera and people in the video are moving. In these videos, People imitate mannequins by freezing in a wide variety of natural poses, while a hand-held camera tours the scene. And because the entire scene is stationary (only the camera is moving), triangulation-based methods- such as multi-view-stereo (MVS)- work. And the developers get the accurate depth maps for the entire scene including the people in it.

Depth prediction network
Depth prediction network

In the paper named ‘Learning the Depths of Moving People by Watching Frozen People’, the researcher team explained about applying a deep learning-based approach that can generate depth maps from an ordinary video, where both the camera and subjects are freely moving.

‘The model avoids direct 3D triangulation by learning priors on human pose and shape from data. While there is a recent surge in using machine learning for depth prediction, this work is the first to tailor a learning-based approach to the case of simultaneous camera and human motion.

In this work, the team focuses specifically on humans because they are an interesting target for augmented reality and 3D video’ explained in the blog post. Google researchers said the approach outperforms state-of-the-art tools for making depth maps.

“To the extent that people succeed in staying still during the videos, we can assume the scenes are static and obtain accurate camera poses and depth information by processing them with structure-from-motion (SfM) and multi-view stereo (MVS) algorithms,” the paper reads. “Because the entire scene, including the people, is stationary, we estimate camera poses and depth using SfM and MVS, and use this derived 3D data as supervision for training.”

The team trained a neural network which is capable input from RGB images, a mask for human regions, an initial depth of non-human environments in a video in order to produce a depth map and make the human shape and pose predictions

Also Read

Recent Articles

Turning plastic water bottles into affordable prosthetic limbs

Most currently available prosthetic limbs tend to be lacking in functionality or are unaffordable to many who need them. Alongside this need for prosthetic...

Langogo, AI-powered pocket translator that speaks 70 languages

Every outdoor adventure lover might have faced language barriers while traveling. At some point, we were unable to understand anything said to us in...

Mercedes-Benz revealed A-Class and B-Class plug-in hybrids for Europe

Like most other automakers, Mercedes-Benz is also on the mission to electrify its lineup of vehicles in the near future. Now it is adding...

The redesigned Ocean Cleanup device is ready to remove the ocean plastic

With each passing year, we expose the ocean to more pollutants, from trash to chemicals. This wide range of pollution endangers marine ecosystems. Fortunately,...

Aston Martin’s latest mid-engined sportscar has taken its first flight

On Friday, Aston Martin has given the world the first glimpse of its latest mid-engined Valhalla hypercar on the track alongside its sibling Aston...

Stay on top - Get the daily news in your inbox

Related Stories