State of the ARt in Relocalization with Machine Learning
by Eric Brachmann
Niantic Research lets neural networks learn what the world looks like to enable camera-based relocalization with high accuracy and reduced cost.
ACE relocalizer improves on DSAC by 300x to make it production ready. How do we know? It’s been in production in Lightship VPS for more than a year.
This method is one of five papers submitted by Niantic Research to be presented at CVPR 2023. To explore the code and research in depth see our project page.
At Niantic, we strive for making the real world more magical. By constantly pushing computer vision technology forward, and wrapping it in platform tools and services that leverage ubiquitous smartphone cameras, we unleash the creativity of developers worldwide to make augmented reality (AR) experiences with high levels of believability.
With the ACE relocalizer, we present one crucial building block for such systems. Believable AR needs a highly precise understanding of location and point of view of the end user’s device. When we alter and augment the world with virtual content, we want that content to stick in place, to blend in perfectly - even if a different user revisits the location months later. Phones are equipped with GPS and IMU sensors which, in good conditions, locate a device up to a few meters precision. That’s not good enough for immersive AR. We require estimates precise to the centimeter in position and to the degree in orientation. This is where visual relocalization using the phone’s camera comes into play.
ACE (Accelerated Coordinate Encoding) is a new visual relocalization algorithm, powered by machine learning, that creates maps in minutes, relocalizes in milliseconds, all while pushing the limits of accuracy beyond industry standards.
Visual relocalization works in two stages. Firstly, a 3D map of an environment is built from a collection of images with known poses. This stage is called the #mapping stage#. Secondly, a new query image comes in, potentially from a different user, potentially a long time after the map has been built. This query image is matched against the 3D map to infer its precise pose. This stage is called the relocalization stage.
Visual relocalization has been around for decades. Traditional approaches build maps by identifying recognizable key points across mapping images, such as corners, and project them to 3D. The resulting maps are sparse 3D point clouds, akin to a 3D model of the environment. At the relocalization stage, traditional approaches attempt to identify key points of the map in the query image. The pose is calculated such that image-to-map correspondences align.
Machine learning and neural networks are ubiquitous in computer vision today, and visual relocalization is no exception. Neural networks are frequently used to find better key points, or to improve image-to-map matching.
ACE is different. ACE learns the map.
ACE follows a more radical approach that substitutes the 3D map for a neural network altogether. Instead of reconstructing a point cloud, we ask the neural network to come up with an internal model of the 3D world that is consistent with all our mapping images. Given a new query image, the neural network can tell us precisely and for each individual pixel, what the corresponding point in scene space is, and we can infer the camera pose as before from aligning correspondences.
The neural network is extremely lightweight with a memory footprint of merely 4MB to represent the entire map. The network is also privacy preserving. Given only the network it is currently impossible to tell what the map looks like, not even whether it was recorded indoors or outdoors. All visualizations we show here are only possible because we have access to the original mapping images. ACE is not only fast in mapping but also in relocalization. It runs with 40 frames per second on a GPU - or it can run on modern smartphones with up to 20 frames per second.
ACE builds on a previous learning-based approach: DSAC* (Differentiable Sample Consensus with the asterisk signifying the latest version). DSAC* shines in terms of accuracy in academic benchmarks, but training its network takes hours or days. Clearly this makes it impractical for most AR applications, and very costly to scale. Niantic researchers found a way to speed the network training up by a factor of 300x, all while keeping accuracy on par. Where DSAC* teaches the neural network one image at a time, ACE includes information of all mapping images in each single training iteration. After a few iterations, the network has already built a good representation of the scene.
ACE has already been in production for more than one year.
ACE has been in production since the launch of Lightship VPS (Visual Positioning System) one year ago (May, 2022), and by now, supports relocalization at almost 200,000 VPS locations worldwide and counting. ACE is deployed alongside a traditional relocalizer to combine the strengths of both approaches. VPS returns the relocalization of whatever algorithm has a confident response first. Our analysis shows that ACE was responsible for the majority of VPS relocalizations to date.
We are excited to share the ACE source code on GitHub following our commitment to open science. We greatly welcome feedback from the research community to further improve ACE. In particular, we hope that ACE gives smaller research groups with limited computational resources the opportunity to participate in state-of-the-art learning-based relocalization research.
We will officially present ACE to the research community at CVPR in June. CVPR (Computer Vision and Pattern Recognition Conference) is the premier computer vision conference and the 4th ranked scientific publication in all fields of science. The ACE paper was selected as a “highlight” by CVPR, a special mention attributed to only 2.6% of all submissions. Niantic Research has a strong presence at this year’s CVPR with no less than 5 accepted papers, which we will describe in another blog post coming soon, spanning diverse topics such as depth estimation, neural radiance fields, image matching and visual relocalization.
Eric Brachmann is a staff scientist at Niantic. He developed ACE in collaboration with Tommaso Cavallari, a staff scientist at Niantic, and Victor Prisacarui, Niantic Chief Scientist.