Building a Large Geospatial Model to Achieve Spatial Intelligence

November 12, 2024

Eric Brachmann and Victor Adrian Prisacariu

Editor’s note: We use player-contributed scans of public real-world locations to help build our Large Geospatial Model. This scanning feature is completely optional – people have to visit a specific publicly-accessible location and click to scan. This allows Niantic to deliver new types of AR experiences for people to enjoy. Merely walking around playing our games does not train an AI model.

At Niantic, we are pioneering the concept of a Large Geospatial Model that will use large-scale machine learning to understand a scene and connect it to millions of other scenes globally.

When you look at a familiar type of structure – whether it’s a church, a statue, or a town square – it’s fairly easy to imagine what it might look like from other angles, even if you haven’t seen it from all sides. As humans, we have “spatial understanding” that means we can fill in these details based on countless similar scenes we’ve encountered before. But for machines, this task is extraordinarily difficult. Even the most advanced AI models today struggle to visualize and infer missing parts of a scene, or to imagine a place from a new angle. This is about to change: Spatial intelligence is the next frontier of AI models.

As part of Niantic’s Visual Positioning System (VPS), we have trained more than 50 million neural networks, with more than 150 trillion parameters, enabling operation in over a million locations. In our vision for a Large Geospatial Model (LGM), each of these local networks would contribute to a global large model, implementing a shared understanding of geographic locations, and comprehending places yet to be fully scanned.

The LGM will enable computers not only to perceive and understand physical spaces, but also to interact with them in new ways, forming a critical component of AR glasses and fields beyond, including robotics, content creation and autonomous systems. As we move from phones to wearable technology linked to the real world, spatial intelligence will become the world’s future operating system.

What is a Large Geospatial Model?

Large Language Models (LLMs) are having an undeniable impact on our everyday lives and across multiple industries. Trained on internet-scale collections of text, LLMs can understand and generate written language in a way that challenges our understanding of “intelligence”.

Large Geospatial Models will help computers perceive, comprehend, and navigate the physical world in a way that will seem equally advanced. Analogous to LLMs, geospatial models are built using vast amounts of raw data: billions of images of the world, all anchored to precise locations on the globe, are distilled into a large model that enables a location-based understanding of space, structures, and physical interactions.

The shift from text-based models to those based on 3D data mirrors the broader trajectory of AI’s growth in recent years: from understanding and generating language, to interpreting and creating static and moving images (2D vision models), and, with current research efforts increasing, towards modeling the 3D appearance of objects (3D vision models).

Geospatial models are a step beyond even 3D vision models in that they capture 3D entities that are rooted in specific geographic locations and have a metric quality to them. Unlike typical 3D generative models, which produce unscaled assets, a Large Geospatial Model is bound to metric space, ensuring precise estimates in scale-metric units. These entities therefore represent next-generation maps, rather than arbitrary 3D assets. While a 3D vision model may be able to create and understand a 3D scene, a geospatial model understands how that scene relates to millions of other scenes, geographically, around the world. A geospatial model implements a form of geospatial intelligence, where the model learns from its previous observations and is able to transfer knowledge to new locations, even if those are observed only partially.

While AR glasses with 3D graphics are still several years away from the mass market, there are opportunities for geospatial models to be integrated with audio-only or 2D display glasses. These models could guide users through the world, answer questions, provide personalized recommendations, help with navigation, and enhance real-world interactions. Large language models could be integrated so understanding and space come together, giving people the opportunity to be more informed and engaged with their surroundings and neighborhoods. Geospatial intelligence, as emerging from a large geospatial model, could also enable generation, completion or manipulation of 3D representations of the world to help build the next generation of AR experiences. Beyond gaming, Large Geospatial Models will have widespread applications, ranging from spatial planning and design, logistics, audience engagement, and remote collaboration.

Our work so far

Over the past five years, Niantic has focused on building our Visual Positioning System (VPS), which uses a single image from a phone to determine its position and orientation using a 3D map built from people scanning interesting locations in our games and Scaniverse.

With VPS, users can position themselves in the world with centimeter-level accuracy. That means they can see digital content placed against the physical environment precisely and realistically. This content is persistent in that it stays in a location after you’ve left, and it’s then shareable with others. For example, we recently started rolling out an experimental feature in Pokémon GO, called Pokémon Playgrounds, where the user can place Pokémon at a specific location, and they will remain there for others to see and interact with.

Niantic’s VPS is built from user scans, taken from different perspectives and at various times of day, at many times during the years, and with positioning information attached, creating a highly detailed understanding of the world. This data is unique because it is taken from a pedestrian perspective and includes places inaccessible to cars.

Today we have 10 million scanned locations around the world, and over 1 million of those are activated and available for use with our VPS service. We receive about 1 million fresh scans each week, each containing hundreds of discrete images.

As part of the VPS, we build classical 3D vision maps using structure from motion techniques - but also a new type of neural map for each place. These neural models, based on our research papers ACE (2023) and ACE Zero (2024) do not represent locations using classical 3D data structures anymore, but encode them implicitly in the learnable parameters of a neural network. These networks can swiftly compress thousands of mapping images into a lean, neural representation. Given a new query image, they offer precise positioning for that location with centimeter-level accuracy.

Niantic has trained more than 50 million neural nets to date, where multiple networks can contribute to a single location. All these networks combined comprise over 150 trillion parameters optimized using machine learning.

From Local Systems to Shared Understanding

Our current neural map is a viable geospatial model, active and usable right now as part of Niantic’s VPS. It is also most certainly “large”. However, our vision of a “Large Geospatial Model” goes beyond the current system of independent local maps.

An entirely local model might lack complete coverage of their respective locations. No matter how much data we have available on a global scale, locally, it will often be sparse. The main failure mode of a local model is its inability to extrapolate beyond what it has already seen and from where the model has seen it. Therefore, local models can only position camera views similar to the views they have been trained with already.

Imagine yourself standing behind a church. Let us assume the closest local model has seen only the front entrance of that church, and thus, it will not be able to tell you where you are. The model has never seen the back of that building. But on a global scale, we have seen a lot of churches, thousands of them, all captured by their respective local models at other places worldwide. No church is the same, but many share common characteristics. An LGM is a way to access that distributed knowledge.

An LGM distills common information in a global large-scale model that enables communication and data sharing across local models. An LGM would be able to internalize the concept of a church, and, furthermore, how these buildings are commonly structured. Even if, for a specific location, we have only mapped the entrance of a church, an LGM would be able to make an intelligent guess about what the back of the building looks like, based on thousands of churches it has seen before. Therefore, the LGM allows for unprecedented robustness in positioning, even from viewpoints and angles that the VPS has never seen.

The global model implements a centralized understanding of the world, entirely derived from geospatial and visual data. The LGM extrapolates locally by interpolating globally.

Human-Like Understanding

The process described above is similar to how humans perceive and imagine the world. As humans, we naturally recognize something we’ve seen before, even from a different angle. For example, it takes us relatively little effort to back-track our way through the winding streets of a European old town. We identify all the right junctions although we had only seen them once and from the opposing direction. This takes a level of understanding of the physical world, and cultural spaces, that is natural to us, but extremely difficult to achieve with classical machine vision technology. It requires knowledge of some basic laws of nature: the world is composed of objects which consist of solid matter and therefore have a front and a back. Appearance changes based on time of day and season. It also requires a considerable amount of cultural knowledge: the shape of many man-made objects follow specific rules of symmetry or other generic types of layouts – often dependent on the geographic region.

While early computer vision research tried to decipher some of these rules in order to hard-code them into hand-crafted systems, it is now consensus that such a high degree of understanding as we aspire to can realistically only be achieved via large-scale machine learning. This is what we aim for with our LGM. We have seen a first glimpse of impressive camera positioning capabilities emerging from our data in our recent research paper MicKey (2024). MicKey is a neural network able to position two camera views relative to each other, even under drastic viewpoint changes.

MicKey can handle even opposing shots that would take a human some effort to figure out. MicKey was trained on a tiny fraction of our data – data that we released to the academic community to encourage this type of research. MicKey is limited to two-view inputs and was trained on comparatively little data, but it still represents a proof of concept regarding the potential of an LGM. Evidently, to accomplish geospatial intelligence as outlined in this text, an immense influx of geospatial data is needed – a kind of data not many organizations have access to. Therefore, Niantic is in a unique position to lead the way in making a Large Geospatial Model a reality, supported by more than a million user-contributed scans of real-world places we receive per week.

Towards Complementary Foundation Models

An LGM will be useful for more than mere positioning. In order to solve positioning well, the LGM has to encode rich geometrical, appearance and cultural information into scene-level features. These features will enable new ways of scene representation, manipulation and creation. Versatile large AI models like the LGM, which are useful for a multitude of downstream applications, are commonly referred to as “foundation models”.

Different types of foundation models will complement each other. LLMs will interact with multimodal models, which will, in turn, communicate with LGMs. These systems, working together, will make sense of the world in ways that no single model can achieve on its own. This interconnection is the future of spatial computing – intelligent systems that perceive, understand, and act upon the physical world.

As we move toward more scalable models, Niantic’s goal remains to lead in the development of a large geospatial model that operates wherever we can deliver novel, fun, enriching experiences to our users. And, as noted, beyond gaming Large Geospatial Models will have widespread applications, including spatial planning and design, logistics, audience engagement, and remote collaboration.

The path from LLMs to LGMs is another step in AI’s evolution. As wearable devices like AR glasses become more prevalent, the world’s future operating system will depend on the blending of physical and digital realities to create a system for spatial computing that will put people at the center.