Niantic Research Presented at the European Conference on Computer Vision 2022

October 25, 2022

Niantic has deep roots in research, creating and applying state of the art technologies in our products and discovering new solutions that push the industry forward. The European Conference on Computer Vision (ECCV) is one of the preeminent gatherings of computer vision researchers and software engineers in the world, and this year we are honored to spotlight three papers from our Research team. These papers present different approaches to moving beyond AR’s contentment with flat planes, and instead help our technology gain sufficient understanding of the shapes and locations of the real world to enable AR characters to move around and make use of the terrain they inhabit. This is significant as the research will ultimately lead to more realistic and compelling AR experiences for explorers of mixed reality.

As one of the Program Co-Chairs, I’ve worked with fellow organizers to marshall the peer review process of over 5,800 papers submitted from talented researchers all over the world. The human effort to find the best validated ideas here is enormous, with 18,000 authors and 4,700 peer-reviewers. What I love about ECCV is the perfect mix of co-operation and competition. It moves science forward, keeps our industry on its toes, and pushes us to innovate and adopt new technologies, constantly improving year over year. And as one of the first times our global community will be together in person since the pandemic, this year in Tel Aviv promises to be more inspiring than ever.

Below is a summary of Niantic’s contributions to ECCV this year. I’m proud and excited for my colleagues to share their groundbreaking work at the conference. Some of this work will directly impact Niantic products and services, while some of it will push us to think differently about where AR can go. We believe all of it will drive innovation and conversations across the industry with our peers, especially at ECCV. We will share more details about our team’s research work in future engineering blog posts.

– Gabriel J. Brostow, Chief Research Scientist

SimpleRecon: 3D Reconstruction Without 3D Convolutions presented at ECCV 2022.

SimpleRecon: 3D Reconstruction Without 3D Convolutions

by Mohamed Sayed, John Gibson, Jamie Watson, Victor Adrian Prisacariu, Michael Firman, and Clément Godard

Traditionally, 3D indoor scene reconstruction from posed images happens in two phases: per image depth estimation, followed by depth merging and surface reconstruction. Recently, a family of methods have emerged that perform reconstruction directly in final 3D volumetric feature space. While these methods have shown impressive reconstruction results, they rely on expensive 3D convolutional layers, limiting their application in resource-constrained environments. In this work, we instead go back to the traditional route, and show how focusing on high quality multi-view depth prediction leads to highly accurate 3D reconstructions using simple off-the-shelf depth fusion. We propose a simple state-of-the-art multi-view depth estimator with two main contributions: 1) a carefully-designed 2D CNN which utilizes strong image priors alongside a plane-sweep feature volume and geometric losses, combined with 2) the integration of keyframe and geometric metadata into the cost volume which allows informed depth plane scoring. Our method achieves a significant lead over the current state-of-the-art for depth estimation and close or better for 3D reconstruction on ScanNet and 7-Scenes, yet still allows for online real-time low-memory reconstruction.
SimpleRecon is fast. Our batch size one performance is 70ms per frame. This makes accurate reconstruction via fast depth fusion possible!

Learn more on GitHub.

Map-free Visual Relocalization: Metric Pose Relative to a Single Image

by Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Áron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann

Can we relocalize in a scene represented by a single reference image? Standard visual relocalization requires hundreds of images and scale calibration to build a scene-specific 3D map. In contrast, we propose Map-free Relocalization, i.e., using only one photo of a scene to enable instant, metric scaled relocalization. Existing datasets are not suitable to benchmark map-free relocalization, due to their focus on large scenes or their limited variability. Thus, we have constructed a new dataset of 655 small places of interest, such as sculptures, murals and fountains, collected worldwide. Each place comes with a reference image to serve as a relocalization anchor, and dozens of query images with known, metric camera poses. The dataset features changing conditions, stark viewpoint changes, high variability across places, and queries with low to no visual overlap with the reference image. We identify two viable families of existing methods to provide baseline results: relative pose regression, and feature matching combined with single-image depth prediction. While these methods show reasonable performance on some favorable scenes in our dataset, map-free relocalization proves to be a challenge that requires new, innovative solutions.

Learn more on GitHub.

Camera Pose Estimation and Localization with Active Audio Sensing

by Karren D. Yang, Michael Firman, Eric Brachmann, and Clément Godard

In this work, we show how to estimate a device’s position and orientation indoors by echolocation, i.e., by interpreting the echoes of an audio signal that the device itself emits. Established visual localization methods rely on the device’s camera and yield excellent accuracy if unique visual features are in view and depicted clearly. We argue that audio sensing can offer complementary information to vision for device localization, since audio is invariant to adverse visual conditions and can reveal scene information beyond a camera’s field of view. We first propose a strategy for learning an audio representation that captures the scene geometry around a device using supervision transfer from vision. Subsequently, we leverage this audio representation to complement vision in three device localization tasks: relative pose estimation, place recognition, and absolute pose regression. Our proposed methods outperform state-of-the-art vision models on new audio-visual benchmarks for the Replica and Matterport3D datasets.

More to come!

Gabriel J. Brostow is Chief Research Scientist at Niantic and a professor in Computer Science at University College of London. He is a program co-chair for the 2022 ECCV conference.