Evolving Niantic AR Mapping Infrastructures

September 6, 2023

Ray, an open-source framework for scaling complex distributed workloads, offers straightforward yet flexible programming abstractions which makes it easy to build and scale distributed applications.
Niantic achieved improvements in scan processing pipeline, mapping pipeline and algorithm benchmark system.
Niantic will present at the Ray Summit in San Francisco, September 18-20

At Niantic, we build and leverage emerging technologies to enrich and enhance our shared experiences as human beings in the physical world. Our games and applications incentivize people to explore the world, and our development of new tools, services and workflows evolve to support our growing platform business. What makes our work truly unique is the contributions our players and users make by scanning locations of their favorite places to play around the globe. Those millions of scans allow us to map the world for everyone, creating human-centric 3D maps of the most interesting places to explore.

With a large number of scans (short videos of public locations, please refer to our blog post for more details) submitted daily from all over the world, we need scalable, high-performance, efficient and reliable pipelines to process and build 3D maps from these scans.

The point cloud of a POI (point of interest) in San Francisco

Engineering Challenges

Research to production
Our computer vision researchers and engineers are actively engaged in researching and developing innovative mapping and localization algorithms. Once a new algorithm is ready, server engineers deploy it in production, ensuring optimal efficiency and performance.

Often these three groups of people use different environments and toolchains for development. For instance, researchers and CV engineers might use their local laptops or desktops, while server engineers work with cloud infrastructures. These inconsistencies have significantly slowed down the journey from research to production. We’ve observed considerable back-and-forth when deploying and debugging new algorithms, resulting in a waste of engineering time and resources. Therefore it is essential to shorten the feedback loop and streamline the research-to-production process, making it easier and more efficient.

Highly distributed workloads with heterogeneous compute

To create AR maps from user scans, we have two pipelines involved - scan processing and map building.

The scan processing pipeline consists of several steps, including metadata validation, running AI models to blur faces and license plates to preserve privacy, scan quality assessment, scan indexing and preview generation.

As different steps have distinct resource requirements, various numbers of CPU and GPU workers may be necessary.

Mapping pipeline consists of a series of steps that create individual maps from anonymized scans, create connected components from the individual maps, apply optimizations and merge maps for best performance and efficiency.

Both pipelines operate as distributed systems, with certain steps further subdivided into sub-distributed systems. For example, each alignment job requires multiple workers.

Due to algorithm requirements, efficiency and performance considerations, the pipeline steps are run on various hardware platforms, e.g. Skylake and CascadeLake CPUs, Nvidia T4 and A100 GPUs. Each pipeline instance may utilize hundreds of Kubernetes nodes, making scalability a crucial factor.

The pipelines are updated frequently due to new requirements from algorithms and products. Consequently, streamlining the development of the distributed pipelines, making it easy for engineers to validate their work on local machines saves engineering time and cost.

Performance and efficiency

Usually AR developers scan a location, upload the scans to the cloud, and await the AR mapping completion before creating AR experiences. The faster scan processing and map building are, the sooner developers can proceed.

Mapping and localization algorithms tend to be resource-intensive. To reduce cost we need fast and efficient resource allocation and scheduling to improve overall utilization.

Scan data sizes typically range from tens of MBs to hundreds of MBs. The data needs to be passed around between machines. Efficient data transfer is important. We noticed that the GPU machine utilization was not high because a significant amount of time was spent on waiting for data to be ready.

Creating Highly Scalable, Performant, Developer-friendly Distributed Systems

As we process more scans, build more maps and support additional use cases, we’ve found the existing systems insufficient and decided to seek novel solutions. A key requirement is to create high-performance, scalable, and developer-friendly distributed systems, ideally allowing our CV engineer to test their improvement using our production framework without the current deployment burden. Ray, an open-source framework for scaling complex distributed workloads, has proven to be a standout option. Ray offers straightforward yet flexible programming abstractions which makes it easy to build and scale distributed applications.

Over the past few months we’ve revamped both the scan processing and mapping pipeline, as well as an algorithm benchmark system employed by algorithm developers which performs both mapping and localization. Compared to legacy systems there are a few major improvements:

One code for all

By leveraging Ray’s programming abstractions and serverless capabilities the new systems are designed to be highly composable. For each system we created a few basic building blocks which can be run both locally and remotely, and easily assembled to achieve different goals.

With this design a developer can debug a component locally and later verify on cloud with only configuration changes, which was not possible before.

Composability

Distributed computing made easy

With Ray’s programming abstractions we can effortlessly develop distributed applications. For example, the process of setting up a cluster for the alignment step, which previously required hundreds lines of YAML configuration, can now be accomplished with a simple Python for loop, which takes less than 30 lines of code.

Thanks to this simplicity, enhancing parallelism becomes a straightforward task. For example, when conducting algorithm benchmarks we pass in a set of POIs and assign each POI an individual compute group with a similar amount of computational resources. However, different POIs often possess varying amounts of data. This discrepancy leads to increased overall completion time due to the imbalance in workload distribution.

Legacy benchmark system

In the new systems a developer thinks about cluster management in terms of resources while Ray handles all the scheduling and autoscaling based on these resource requests. Therefore we are able to utilize a single compute group and also significantly reduce the total time benchmarks take, as larger POIs now receive the necessary resources when needed.

New benchmark system

Non-performance-critical workloads are often scheduled onto spot instances to optimize cost savings. To reduce the impact of potential spot instance failures, we leverage the new systems’ flexibility to easily tune the size of each task. By breaking down tasks into smaller, more manageable parts, we can better adapt to the unpredictable nature of spot instances and ensure minimal disruption in case of a failure.

Efficient data transfer

In the legacy systems, due to the absence of a more efficient communication protocol, various components are connected through pub/sub, file system, database, and cloud storage, which tend to be slower and inefficient.

In contrast, the new system employs Ray with gRPC for seamless communication between components. For processes that demand less availability, such as low-cost steps less susceptible to machine failures, we store scan data in Ray’s in-memory distributed object store and transfer data between machines in a P2P manner. Typically cloud VMs have network bandwidth exceeding 10-20 Gbps, which is much faster than retrieving data from cloud storage, which usually takes more than 10 seconds each scan. For those requiring higher availability we download scans to NFS for sharing among data readers.

The more efficient data transfer also reduces machine idle time and improves overall resource utilization.

Fewer downloads from cloud storage and faster data transfer

More computing / IO overlapping

Often steps are not entirely compute bound. For instance, algorithms may need to download maps or scans during processing which could be IO intensive. In the new systems, multiple tasks can be scheduled onto the same Kubernetes POD (Ray node), rather than one task per Kubernetes POD, increasing the chance of computing / IO overlapping. This approach has led to noticeable improvements in resource utilization. Note that the logical resource requirements for each task need to be carefully chosen to avoid oversubscription.

A simplified example

Lower framework overhead

In our legacy systems each step requires a separate Kubernetes POD. Allocating, scheduling and running a Kubernetes POD has considerable overhead, not to mention the pipelines themselves also have substantial overheads. In some cases, this overhead might surpass the actual work being performed. With Ray, task allocation and scheduling become significantly more efficient, as most of the time it involves merely starting and stopping Linux processes.

Performance gains throughout the systems

The new scan processing pipeline has been successfully deployed to production. On average, scan processing time has been reduced by ~75%, cost per scan by ~60-70%, number of lines of code by more than 85%. The maintenance cost is significantly lower.

While the benchmark system and mapping pipeline are still under development, we have already observed very promising results - for a large benchmark dataset and an expensive algorithm, the total benchmark time has been reduced from 5 days to 1 day, accompanied by a more than 30% cost reduction. For mapping, we tested a list of POIs with varying numbers of scans, and observed a ~20% - 50% reduction in mapping time.

The new systems also have saved a substantial amount of engineering time.

A big thank you to Shiny Singh, Bipeng Zhang, Tianqi Liu and Projit Bandyopadhyay for their invaluable contributions to the projects, making these achievements possible, and to the Ray team at Anyscale for the collaboration and support!

To learn more about how we are using Ray, please join the keynote speech by Niantic SVP Brian McClendon and our deep dive session at the Ray summit 2023.

We’re excited about our progress so far, but we are just getting started. If you want to join in the effort, check out our jobs page.

Qi Zhou

Qi Zhou is a staff software engineer at Niantic, leading AR mapping infrastructure and generative AI initiatives.