Niantic Presents New Research and Hosts Map-free Visual Relocalization Workshop & Challenge at ECCV 2024

September 26, 2024

Presented and shared by members of the Niantic Research and Mapping teams the week of 29 September in Milan

Researchers at Niantic are to share 4 papers at ECCV, including an oral presentation of “ACE Zero”, a new method for 3D reconstruction using a neural network to efficiently reconstruct high quality scenes.
We are also proud to present DoubleTake: Geometry Guided Depth Estimation, a method for computing state-of-the-art depth maps and meshes; GroundUp: Rapid Sketch-Based 3D City Massing, a sketch based 3D city modeling method; and HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning, a method for accurate real-time estimation of a 3D hand mesh from a camera feed.
The Niantic Research and Mapping teams organize a workshop & challenge at ECCV 2024 on the problem of Map-free Visual Relocalization, where teams and algorithms are tasked with estimating metric camera pose for a query image given a single reference image. The workshop includes presentations from leading researchers at Meta, Google, Naver Labs Europe and prestigious universities such as CMU, CTU Prague and Oxford University, and the awarding of cash prizes based on submissions received since the competition began in May.

Computer vision is at the heart of many Niantic products. Digital characters and environments appear to fit naturally in the real world as Niantic algorithms interpret what mobile phone cameras see and layer in information with high precision as if those virtual characters and objects truly exist in the real world. These AR experiences are enabled by computer vision algorithms including depth estimation, semantic segmentation, object detection and visual positioning system (VPS) that are developed at Niantic and are freely available in Lightship ARDK, our augmented reality developer kit for Unity. Gaussian splatting, a breakthrough in computer vision and graphics, allows Niantic’s Scaniverse to generate a realistic 3D reconstruction of a scene on your phone and add it to the map, driving our mission of mapping the world.

This is why it is important for us to advance the computer vision field by conducting fundamental research, adapting and applying state of the art algorithms to AR as well as publishing scientific articles and participating at top computer vision conferences by serving as area chairs and reviewers. This year, we are proud to have 4 accepted papers at the biennial European Conference on Computer Vision (ECCV), a premier international computer vision conference for top researchers in the field. In addition, this is the first year Niantic researchers organized a workshop and competition at ECCV on the problem of Map-free Visual Relocalization.

MicKey trained on Niantic’s Map-free Relocalization dataset learns to estimate the relative camera pose of two images captured from opposing views.

Map-free Visual Relocalization Workshop at ECCV 2024

Visual positioning systems (VPS) that localize a camera in the real world require an accurate 3D reconstruction or map of the world. Building this map of the world in 3D is a chicken-and-egg problem. In order to add a new location to the map, one needs to know where they are with regard to mapped areas and what parts are missing, so one needs to be able to relocalize. However, to be able to train relocalization algorithms for a location, one needs a map of the location. So, to extend the “edge of the map” we need to be able to relocalize from as little information as possible. In the limit, there is only a single image that has captured the edge of the map, and relocalizing against this single image is the Map-free Visual Relocalization task. This task was initially proposed in our ECCV 2022 paper together with the dataset used for the challenge, Niantic’s Map-free Relocalization Dataset that contains crowd-sourced (and anonymized) scans of points of interests from all over the world, representing a large step forward in user-like realism compared to other academic datasets.

Besides the AR focus of the dataset, we also reflect the quality of AR experience in the way we rank the accuracy of the visual localizers. VCRE (Virtual Correspondence Reprojection Error) score is used to rank the algorithms and it measures what an array of virtual objects at varying distances in front of a “user” would look like given the estimated camera pose. The error measures how far the virtual objects appear in the screen from where they should be. As we evaluate many scenes, we calculate the fraction of the query images for which the VCRE error is within an acceptable threshold.

Based on the growing interest in map-free relocalization approaches, we decided to create a workshop to gather the research community and challenge them to compete on the leaderboard. We also extended the competition to a second task (Multi Frame), where multiple query images and tracking poses can be used over a short amount of time and a narrow baseline, a more realistic relocalization scenario on a mobile device.

Since we submitted the workshop proposal in January, we have seen tremendous progress on the leaderboard, achieved by our own method MicKey earlier this year, and MASt3R of Naver Labs Europe which will also be presented at ECCV. Both methods represent a new generation of image matching methods in 3D, that achieve accurate pairwise image relocalization in scenarios that might have been deemed impossible as little as one year ago. Indeed, the new methods can even relocalize two images that capture an object from opposite views. A large part of this breakthrough is the map-free relocalization task, dataset and challenge that created the environment necessary for these methods to emerge and show their promise.

The workshop also has invited keynote presentations from leading researchers at Meta, Google, Naver Labs Europe and prestigious universities such as CMU, CTU Prague and Oxford University. The challenge winners will be awarded cash prizes co-sponsored by Niantic and Naver Labs Europe.

ACE Zero: “Scene Coordinate Reconstruction Posing of Image Collections via Incremental Learning of a Relocalizer” to be presented at ECCV 2024

ACE Zero

ACE Zero is a novel 3D reconstruction method that estimates accurate camera poses for a collection of images that capture a scene. It builds on Niantic’s ACE method that trains a neural network on images and their camera poses and in a few minutes generates an accurate visual localizer. So, ACE Zero rethinks the traditional 3D map building process and leverages ACE to be able to estimate camera poses for thousands of images in one hour or less on a single GPU machine. We start by training a visual relocalizer using a single image. Then, the relocalizer is used to estimate camera poses for all remaining images and refine their poses. The visual relocalizer is retrained using images that have high-confidence camera pose estimates. This is done incrementally and repeatedly until all images are relocalized. ACE Zero estimates poses that are on par with the state of the art 3D reconstruction methods but in a significantly shorter time.

"DoubleTake: Geometry Guided Depth Estimation” to be presented at ECCV 2024

DoubleTake: Geometry-Guided Depth Estimation introduces a novel approach for estimating the geometry of 3D scenes. Building on Niantic’s earlier work, SimpleRecon, DoubleTake first estimates a depth map for each image and then integrates these maps into a coherent 3D reconstruction. The key innovation in DoubleTake is its ability to use the current 3D reconstruction to improve the accuracy of subsequent depth estimates. This method establishes a new state-of-the-art in both depth estimation and 3D scene reconstruction accuracy.

“GroundUp: Rapid Sketch-Based 3D City Massing” to be presented at ECCV 2024

GroundUp
GroundUp: Rapid Sketch-Based 3D City Massing enables users to create new 3D building designs by simply sketching. GroundUp uses a generative diffusion model to create plausible reconstructions that respect perspective user sketches of building blocks. This is the first such work which allows interactive and iterative sketch-based reconstruction for use by architects and novices.

“HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning” to be presented at ECCV 2024

HandDGP

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning enables accurate 3D hand mesh predictions within the camera’s 3D space. While most state-of-the-art methods focus on making hands look correct in 2D, they often neglect the global positioning of the hand in the 3D scene. However, realistic virtual object interactions in AR/VR require precise 3D predictions. HandDGP addresses the scale-depth ambiguity in estimating 3D geometry from 2D images by introducing an end-to-end differentiable module, which allows neural networks to learn directly in 3D space rather than in relative spaces. HandDGP is the current state of the art in camera-space 3D hand mesh prediction and is the key component of the real time hand mesh tracking feature in Niantic’s 8th Wall WebAR development kit.