Skip to main content
November 10, 2022
Lightship VPS: Building Our 3D Map From Crowdsourced Scans

In part 1, we covered the beginning steps: data collection and processing. Here, we explain how Niantic is processing data, how we’re building a human-readable map with computer vision, and how it will all ultimately lead to the 3D AR map of the future.

Creating a 3D map of the world and putting virtual objects into real-world settings is complex, entailing many stages that require a network of data pipelines, AI, algorithms and servers. It also requires reams of up-to-date data to create an accurate, useful and appealing platform for the real-world metaverse.

As we discussed in Part 1, the data for our map comes from millions of scans contributed to the VPS platform, from Pokémon GO and Ingress players or developers who use our one-of-a-kind community mapping program Niantic Wayfarer. They add “scans” all of the time at Wayspots, which currently includes 17 million points of interest worldwide chosen by players. The scans are 15- to 30-second video clips each, equivalent to 300 images we use to build the 3D map. We’ve built out more than 120,000 VPS-activated Wayspots so far.

Once the videos are uploaded from a user’s phone, we take a number of steps to protect the privacy of individuals when we collect scans, including automatically anonymizing them once they are uploaded to our secure servers. The videos go into our AI mapping system and are broken into individual frames. From the moment of capture, we collect information including the image, the GPS location and sensor data, and the camera’s position to the objects in the frame. At this step in the process, there is not enough information to accurately determine the position of the scan—that happens later toward the end of building the map.

Process for developing the VPS map

Indexing visual information

Creating 3D maps is a complex endeavor, so we explain them below in five stages that our system takes to provide the building blocks for our 3D maps.

Stage 1 - Splitting scans

Our indexing of visual information involves the scans going through an algorithmic-specific mapping pipeline, which helps us adjust for a complex problem that arises when the position estimate from the mobile phone is incorrect. Sometimes, when the Wayspot location is large, the user can cover a large area. That’s often likely to be subject to position estimation drift (as errors accumulate over time as the phone tracks its location). If that issue arises, the scans are split into multiple nodes to suss out failures.

Stage 2 - Location mapping

Determining location is one of the most complex computer vision problems we are solving as GPS provides only a rough location, whereas with VPS we determine location with centimeter-level accuracy. Within this processing stage, we transform each split scan into special 3D maps that can be used for our various crazy and safe localization approaches. Each of these maps can be used by our localizers to determine the position and orientation of images taken from similar viewpoints.

Stage 3 - Connected components

We then determine the relation of the 3D mapped scans to one another. We limit the combinatorial complexity of such a task by limiting the search radius using the original GPS information.

At this point, we have the equivalent of a very long sequence. What we have at the end is something corresponding to a 5- or 10-minute sequence that is made by combining the scans the users captured at that location. We minimize error and reject outliers via a global bundle adjustment process to ensure each connected component is geometrically coherent.

The point cloud seen here is a map generated by Niantic. The colors of the point cloud correspond to unique user scans submitted by players and developers.

Before we move on to stages 4 and 5, it’s important to understand that we concurrently build two kinds of maps that serve different purposes. One allows you to position yourself in the world that’s captured by the user scan, but it’s data and not a visualization. The other allows you to see the world — the geometry, the texture, and what’s in front of the camera — and is described below in steps 4 and 5. So, just as house architects look at more complicated blueprints than the floor plans they show their clients, the second map is consumable by everyone rather than a highly technical one that can only be understood by computers. Yet, both kinds of maps are necessary to produce the ultimate result.

Stage 4 - 3D dense reconstruction

For every image in our long coherent sequence, we compute a depth map using Many-Depth. This operation computes the distance from the camera to the objects for every pixel. Using the redundancy in our sequence we can filter out noise to obtain a fused 3D mesh. This mesh is composed of a set of triangles as well as a high-resolution texture.

It’s worth noting here that this data is available via the 8th Wall WebAR Platform, which means developers don’t have to build a 3D map from scratch — they can focus on building their apps that work within the same mapped world as Pokémon GO. Developers can also create their own 3D maps of different locations; so, if someone needed to add a Wayspot they could add it to the database via Wayfarer.

Images depict a 3D model of a Niantic Wayspot in Lawrence, Kansas, generated from user scans. On the left we have a textured 3D model which can be used to author content. On the right is our semantic segmentation (Blue: ground, Gray: unlabelled).

Stage 5 - Context + semantics

In addition to depth, we can estimate semantic information, meaning we can classify if a pixel in an image corresponds to a tree, ground, or building. Using similar fusion techniques we can create similar data layers for our 3D model that will be coherent across our sequence.
This is helpful for developers to determine how their application should interact with the real world.

Cloud computing for 3D maps to be used every day

We rely on cloud computing for everything. This allows us to do large-scale, parallel processing of millions of scans simultaneously. We can run all nodes simultaneously and within each node run each pixel through a GPU.

In addition, many of the operations are highly paralyzable and benefit from being run on modern GPUs which has given us significant performance boosts. We need heavy processing power to build the world’s most dynamic 3D map, which will change the stakes for game and app developers for years to come. In the final post of our three-part series, we’ll talk about the VPS API Service and how it works.

The stages you see above ensure everyday people can use the maps. They also show the complex process that is necessary to create a 3D map of the world that’s accurate, useful and appealing for developers. In the finale of our three-part blog series, we’ll describe our API Service and detail how it works. Stay tuned!

We’re excited about our progress so far, but we are just getting started. If you want to join in the effort, check out our jobs page.

For more information about how Niantic protects personal data, please see our privacy policy.

– by Chief Scientist Victor Prisacariu and Director of Engineering Pierre Fite-Georgel


Get the latest