December 10, 2024

Splats Change Everything: Why a once-obscure technology is taking the 3D graphics world, and Niantic, by storm

By Kirsten M. Johnson

What is Gaussian splatting?

Today, if someone asked you, a regular Jane or Joe, to create an immersive 3D environment to explore with a virtual reality headset, you might throw your hands up in the air and laugh. Even if you knew where to start, you say, you don’t have the camera equipment or computing power to build something like that.

All that is about to change: The era of Gaussian splatting is here. Anyone, anywhere with a decent smartphone can quickly create highly detailed, photorealistic 3D scenery to share with anyone else in the world. Yes, the technology has a somewhat comical-sounding name—which comes from how it involves digitally tossing a 3D blob at a flat, 2D surface, thereby “splatting” it.

But don’t underestimate splats because they sound silly. They’re taking over 3D computer graphics at breakneck speed, and they’re about to change our relationship to the virtual world.

What are Gaussian splats? On the most basic level, they are a way of representing an image as tiny, elliptical blobs that you can stretch or squish. You also give each blob a color, change how see-through it is, and blend it with neighboring splats until they form a continuous surface. When a machine-learning algorithm does this for millions of splats per image, they come together to create 3D scenes much more realistic than any other form of 3D imagery.

Why are splats so revolutionary? That’s what this story aims to explain, but the most important part is this: Splats allow a computer to render photorealistic 3D scenes in real time, a feat that hasn’t truly been possible before.

On its own, this is exciting for many spatial graphics applications, but splats are opening up new ways for us normies to create and share experiences in 3D, too. There are several smartphone apps available now for creating splats, but this story will focus on Niantic’s Scaniverse app. Unlike other 3D scanning apps, it lets you capture and process splatted 3D scenes right on your smartphone. In addition, Niantic’s new WebXR app Into the Scaniverse, for the Meta Quest headset, allows you to step inside the splats you capture in Scaniverse and walk around.

It doesn’t take long to imagine what you might do with the ability to quickly and easily scan full scenes in 3D, especially as XR headsets gain wider adoption. For example:

On vacation, you scan the Kitano Tenmangu Shrine in Kyoto to share with friends. At home, they put on a headset and walk around in the digital twin of the spot where you were just standing.
Before you move away from your childhood home, you make a 360-degree scan of your hangouts so that you can put on your headset and step back inside the memory or show it to your own kids one day.
You build an augmented reality doorway that teleports you from your sunset view of the beach to a foggy morning at Stonehenge.

And at the risk of serious understatement: This is just the beginning. As the technology around splats improves, you will be able to add new layers to your 3D environments. Imagine it: you invite your friend in Istanbul to search for clues to a scavenger hunt you placed near a statue you scanned in Mexico City. You combine splatted scenes with generative AI to add snow, or a horde of elves, or a Picasso filter to the real world. You build a 3D interactive tour of a place that is meaningful to you. The physical and virtual worlds are blending together in new ways, without any special equipment or training necessary.

How did these futuristic-sounding ideas suddenly become technology that’s just around the corner? Why are computer graphics researchers—a group not usually prone to hyperbole—using phrases like “new era” and “paradigm shift” to talk about Gaussian splatting?

This four-part story will explore these questions by looking at the history of splatting, how machine learning is changing computer graphics, how Niantic is making use of this incredible new technology, and what it means for how we relate to the virtual world.

Part one: Blobs vs. triangles

For the first part of our story, we delve into the history of computer graphics and some of the problems that Gaussian splats solve.

Very broadly speaking, there are two basic approaches to rendering in computer graphics: polygons (typically represented as triangles) and points (or blobs, as they later become with splats). They are the Coca-Cola and Pepsi, the Capulets and the Montagues, the Barça and Real Madrid of the graphics world.

Okay, maybe that’s overstating it a bit, but Marc Levoy, Stanford computer science professor emeritus and creator of Google Street View, playfully described these two approaches as existing in a “creative tension” dating all the way back to the beginning of computer graphics.

Whether you’re using polygons or points, they create what’s called a primitive, a basic shape the computer uses to render the more complex final image. When you connect polygons together, the primitive is called a “mesh;” when you use points, your primitive is a “point cloud.”

Credit: O’Reilly

Polygons and points each have their advantages. Points are better at representing curved or detailed surfaces and things without defined edges, like smoke or clouds. They’re also better at representing dynamic scenes with lots of different objects.

A classic problem with points, however, is that rendering millions of points for complex scenes could get pretty computationally expensive. Another is that they can leave holes in surfaces after rendering if there aren’t enough points in a particular area. Splatting tries to address these holes. The fuzzy blobs do a better job covering object surfaces than individual points, because you can smoosh them to cover gaps.

Splatting, which was invented in 1989, mostly sat on the margins of computer graphics until Matthias Zwicker and his colleagues invented something called EWA splatting in 2001.

Polygons, on the other hand, are easier and more efficient to render and can do a pretty good job at detailed surfaces when you make the polygons small enough—Toy Story 4 had one shot with a trillion polygons in it. Because of these efficiency advantages, polygons came to dominate the computer graphics industry from the 1990s onward. Professional graphics hardware was optimized for triangle-based rendering. EWA splatting was a neat idea, but since industry hardware didn’t support it, it didn’t really catch on.

Point-based rendering didn’t disappear, but it took a back seat. Computer-animated movies, video games, product design, medical imaging, and architecture applications of computer graphics have mostly used polygon meshes for decades.

Triangles fall short

In the last several years, however, meshes have started to run into some ceilings: real-time rendering and novel-view synthesis.

Real-time rendering is the term for images that a computer renders—that is, creates an image from the raw data—immediately as you’re interacting, so the experience feels seamless. Meshes are great for real-time rendering, but there’s a catch: The more polygons in your meshes, the slower your render. Video games will never look as detailed as a Pixar movie, because each scene in a video game has to be rendered as a player walks into it. Pixar movies can be rendered over the course of years in a giant server farm, because they aren’t interactive.

Meshes also have some shortcomings in synthesizing novel views, which is to say reconstructing a new camera view of a scene using only 2D images taken from other camera viewpoints. Novel-view synthesis is also sometimes called image-based rendering (IBR), because it builds 3D representations using only still images as input. (Object-based rendering, in contrast, requires that you create a 3D model using individual 3D assets.)

Credit: Chandan Yeshwanth, et al

Meshes are adequate for novel-view synthesis of things with smooth surfaces under consistent lighting, but points often do a better job with real-world images because they capture more fine details and information about opacity, color, and reflections—all of which can change a lot from one angle to the next, especially if the lighting is uneven. These are called view-dependent effects.

Novel-view synthesis allows computers to do something that humans do almost without effort: form a 3D model of what they’re seeing based on incomplete input. Because it’s so useful for computer vision, it’s a nut researchers have been trying to crack for decades. Without it, a computer can never “see” around a corner or make predictions about objects—it only knows what it sees directly. It’s an important capability for computers to have in applications like robotics, autonomous driving, and extended reality

We’ve seen that meshes are great, but there’s a limit to the detail they can provide in real-time renders, and they aren’t ideal for tackling this big problem in computer vision that researchers have wanted to solve for a long time.

So, how did long-neglected blobs come from behind to do what triangles can’t? The next part of our story chronicles the triumphant return of splats as 3D graphics enters the machine learning age.

Part Two: Blobs to the rescue

For more than two decades, point-based rendering was less popular than meshes. In recent years, however, researchers got a new tool: deep learning. In the context of machine learning algorithms, gaussians turn out to have one very important advantage over triangles: They are differentiable. In the simplest terms, this means a neural network can adjust blobs much more easily than it can adjust triangles.

At first, because meshes were still the dominant way of doing graphics, researchers tried using deep learning on meshes. In 2018, Niantic Chief Research Scientist Gabriel Brostow and his team thought it might be interesting to see if a neural network could help with the artifacts meshes created during novel-view synthesis. “This was one of the first steps into neural networks being used to help with image-based rendering,” Brostow said in an interview. “Image-based rendering ignored machine learning for a long time.”

Brostow, who also works as a computer science professor at University College London, teamed up with French National Institute for Research in Digital Science and Technology (INRIA) senior researcher George Drettakis, along with their graduate students. The team devised a new deep learning approach using meshes to create an immersive 3D environment you could “look around” inside. They had some success, but there were issues with blurring, memory, and preprocessing times. Plus, it took a lot of computational power.

Two years later, in 2020, a new deep learning approach changed computer graphics overnight: Researchers at UC Berkeley invented an algorithm that allowed a neural network to represent a scene as a neural radiance field (NeRF).

Radiance fields are a mathematical way of describing the color and brightness of light at any arbitrary point in a scene. Unlike meshes, a neural radiance field doesn’t need explicit geometry; rather, it encodes appearance information in a volumetric mathematical function. In their view, meshes didn’t work as a useful format, because you couldn’t easily represent triangle meshes in a way the neural network could tweak.

This insight led to spectacular, photorealistic novel views using only 20–50 training images with any ordinary camera—a new way of doing IBR. Overnight, the brightest minds in computer graphics at companies like Microsoft, Google and NVIDIA started racing to improve on this initial algorithm, and papers on NeRFs continue to come out every month. One such researcher was Brostow’s former PhD student Paul Hedman, who worked on an immediate successor to NeRFs, called Mip-NeRF 360, at Google.

NeRFs have a few big problems. They are computationally very expensive, because you have to retrain a neural network for every new scene. It’s also very difficult to do real-time rendering with NeRFs, which makes them impractical for applications that respond to user input like video games or extended reality.

Still, this new approach and its promising results immediately grabbed the attention of the 3D graphics industry. This included Keith Ito, the engineer who built Scaniverse. He saw the potential for making it possible to create immersive 3D scenes from ordinary smartphone pictures, if only you could find a way to render them more efficiently. He began thinking about how to create a “NeRF-ish” pipeline that could render scans on a mobile phone. More on him in part three.

Machine "learning" without the neural network

Meanwhile, Drettakis had been working for a few years on figuring out how to use machine learning algorithms that would dramatically simplify novel-view synthesis for casual users. As it turns out, much of modern AI is run on graphics processing units (GPUs). Drettakis already knew more about the capabilities of GPUs than many of the AI researchers new to the field, because he’d been using them for 25 years as a graphics researcher.

“They were unfamiliar with such equipment and therefore unable to exploit its full potential,” he told an INRIA interviewer as part of the announcement of a major prize he won this year. “My colleagues and I were able to adapt their algorithms for our own purposes and optimize them by taking full advantage of the GPUs’ capabilities.”

He gathered his postdoc Bernhard Kerbl and PhD student Giorgios Kopanas, who were part of his GraphDeco team at INRIA, as well as Thomas Leimkühler, now of the Max-Planck-Institut für Infurmatik. Kopanas had been developing point-based rendering methods as part of his PhD dissertation, because he’d found they could represent radiance fields faster and with better quality than NeRFs.

Crucially, his point-based method was differentiable, meaning that an algorithm could make incremental changes over several rounds of training to better match the original camera images. This point-based method evolved into using 3D Gaussian splats, those adaptable, fuzzy blobs you can stretch and squish and rotate. Unlike NeRFs, which rely on abstracted data to reconstruct how a scene looks from any angle, Gaussian splats directly store visual details for each point in 3D space.

But they weren’t done there.

Here’s the really important part: Drettakis used his years of experience with GPUs to help craft an algorithm that used the techniques of machine learning without the neural network. Their algorithm uses ML techniques like gradient steps and loss functions, but it avoids the complexity and computational demands of hidden layers, which are a hallmark of neural networks. “Our method is not strictly speaking machine learning, but the method uses machine-learning techniques to train and improve rendering quality,” Drettakis said.

This fundamental innovation instantly made 3D Gaussian splats much faster and less computationally costly to process than NeRFs, and they had higher visual quality to boot. “This was revisiting an old representation,” Brostow said, referring to the invention of point-based rendering more than 30 years earlier, “but coming at it with a very new and clever approach.”

When Drettakis and his colleagues published their paper on 3D Gaussian splatting in the summer of 2023, it became an instant worldwide hit in a way that few computer graphics research papers have ever been. A few months after its release, it won Best Paper at SIGGRAPH, the main conference for the world’s computer graphics community. Less than a year after it came out, there are hundreds of new papers coming out each month that cite it.

The header of the seminal paper.

Still, you would have needed to be a computer graphics professional, or at least a pretty intense computer graphics nerd, to understand what this all meant in 2023, let alone make any use of the innovation.

The next part of this story is about how Niantic saw what this algorithm could mean for bringing 3D photography to billions.

Part Three: You, Too, Can Be a 3D Photographer

While all the research on radiance fields was happening, Niantic had been busy thinking about how to build a 3D map of the world. That would be a wild ambition for most, but not CEO John Hanke and SVP of Engineering Brian McClendon, who co-founded Keyhole, which became Google Earth, and then launched and grew Google Maps.

A detailed, photorealistic 3D version of the world forms the foundation for many of their wide-ranging future plans. These include: creating a spatial platform for XR game designers and businesses to build on top of; making a base layer for visual content creators to add AI to real-world scenes; giving 3D cartographers a place to share their maps of world heritage sites or special places of interest; and improving virtual accessibility to difficult-to-reach locations.

Creating photorealistic views requires a whole lot of photography, and from experience with Google Maps, they knew that crowdsourcing was the best way to get the required breadth. For this to work, creating those 3D scans and sharing them to the map has to be incredibly easy, and doable on devices people already own. If the steps get too complicated, casual users give up quickly.

That’s why, in 2021, Niantic acquired Scaniverse, an app offering a simple way to capture, edit, and share 3D content with just a smartphone.

Scaniverse was built to create 3D scans with meshes, and it still does this extremely well. As discussed in part one, meshes have their limitations. In particular, they are not ideal for reconstructing entire 3D scenes that have reflective or highly detailed surfaces and require consistent lighting. They’re better for isolated objects. But of course, if you are building a map of the world, you want people to create 3D images with lots of background details. That’s what makes for a full scene.

So, Niantic went on the hunt for a technology that would make it possible to get better background information in 3D scans. But any such new technology had to meet another crucial requirement: It had to offer real-time rendering. Without it, Scaniverse would slow down and become unusable. One of the people on the hunt for the technology that would unlock full-scene scanning was Keith Ito, who previous to joining Niantic with the Scaniverse acquisition, worked on several innovations including Google Maps Navigation and Terrain Mode.

Ito wanted to make sure that any algorithm he used for Scaniverse would be something that could process 3D scans on a user’s device. All the other 3D scanning apps available required you to process your scans in the cloud, which meant you had to wait for your raw data to upload, join a processing queue, and then download the final scan. This could take anywhere from 20 minutes to days, depending on the app you use, and if you didn’t have a stable internet connection during that time, you wouldn’t be able to see your scans until you got service. They also tend to cost money, because server processing power is expensive. For those of us old enough to remember film cameras, this is the equivalent of dropping off a roll at the photo store, both in terms of the delay and the expense.

As with film, there’s also a practical downside to server-side processing: How frustrating would it be to finally get your processed scan, long after you’ve left the location, only to see that it’s missing a crucial part of the scene, or you didn’t take enough photos to get the quality you wanted? This delay is especially discouraging for people who are new to 3D scanning.

From computer science lab to smartphone

In the summer of 2023, after NeRFs and NVIDIA’s follow-up called Instant NGP came out, Ito decided to start work on a “NeRF-ish” pipeline that could render scans on a user’s device instead of requiring an expensive GPU. The superior background detail they promised was attractive, but NeRFs had major drawbacks, including the computing power they required to train the neural network on each scene. They were also difficult to render, and it wasn’t clear how to make them into useful game assets.

Meanwhile, in London, Niantic spatial computing researcher Charlie Houseago saw the 3D Gaussian splatting paper. He presented it to the Niantic research group in London and shared it on a Niantic internal Slack channel. It immediately captured attention across the company.

According to Nicholas Butko, senior director of engineering, Ito wasn’t initially all that impressed by splats. They required about 20GB of RAM during training, more than a smartphone could handle, making them impractical for on-device processing. The output files were also hundreds of megabytes, which would quickly eat up storage on a user’s phone. Still, they were much easier to render than NeRFs and used graphics hardware pipelines that were similar to what Scaniverse was already using for its meshes.

Ito and his team sat down to try to see if they could use some of Scaniverse’s existing features to make the Gaussian splats work more efficiently. He set a goal of training a splat with only 1GB of RAM, which would make it possible to process on a smartphone.

Within a few months, the team had found numerous ways to cut back on memory usage and processing time. For example, instead of the power-hungry preprocessing step that the paper used to create a set of Gaussians for training, they repurposed Scaniverse’s feature detection and multi-view stereo depth mapping to quickly seed positions and colors for the fuzzy blobs. They also managed to build a compressed format (now open-sourced as SPZ) that drastically reduced the file size of each 3D scan without reducing the visual quality of the scan.

By Thanksgiving 2023, they were able to fully process splats on a smartphone in about a minute. (If server-side processing is film, on-device is Polaroid.) When Scaniverse launched version 3.0 in March 2024, just seven months after the initial paper came out, they gave the world the power to process 3D Gaussian splats locally on iOS devices.

Now that these two pieces of the story of 3D Gaussian splats have come together, the theory behind them and the practical implementation, our next section is about how these innovations are changing the way all of us relate to the virtual world.

Part Four: The World According to Splats

Charles Wheatstone invented stereoscopy in 1838 to demonstrate how humans perceived depth. Stereoscopy uses two photos, taken simultaneously at slightly different angles, to trick the brain into thinking it sees a 3D image. The stereo-viewer that American physician-poet Oliver Wendell Holmes invented in 1861 bears a striking resemblance the XR headsets we use today.

Credit: National Science and Media Museum

Holmes found stereoscopy to be so enthralling, he devoted an entire essay to it in The Atlantic in 1859. His words could just as easily apply to the experience of looking at hyper-real 3D imagery in an XR headset for the first time: “There is such a frightful amount of detail, that we have the same sense of infinite complexity which Nature gives us.”

Imagine what Holmes would have thought of splats! Stereoscopy swiftly became a popular entertainment after its invention. By the end of the 19th century, millions in Europe and the United States had access to a stereoscope and stereo cards with photographs from locations around the world.

In the early 20th century, door-to-door salesmen from the Pennsylvania-based Keystone View Company sold a box set of stereo cards called “Tour of the World.” Keystone spent decades building a collection that included more than 50,000 images from six continents.

Credit: Library of Congress

It was complicated to produce this vast collection. The Keystone Viewing Company employed a staff of stereoscopic photographers who lugged around large wooden cameras with two lenses and glass plates. Good stereographs also required considerable skill on the part of the photographer; many things could go wrong with the depth cues or field of view and ruin the 3D effect.

In contrast, Scaniverse users went out and made 38,000 splats in the first six days after the feature launched.

3D photography, like 2D photography, takes on a whole new level of fun when you get to be the one going out to decide what to capture, instead of waiting for the salesman to show up at your door.

Right now, you can open the Scaniverse app on your phone and, with no photography background at all (except maybe watching the quick tutorial for capturing good splats), create a 360-degree scan of the street you are walking down that will look stunning. You can even create a movie of your splat with different camera moves that mimic tracking and crane shots—visuals that used to require thousands of dollars and years of experience to create.

And now that Into the Scaniverse—Niantic’s new WebXR app for the Meta Quest—is out, you can walk around inside your scan on a VR headset. What will this mean for how we relate to the virtual world?

The future of splats

One person who has thought a lot about how everyday people using 3D scans will change the world is Michael Rubloff, the 3D scanning enthusiast behind RadianceFields.com. It’s a website dedicated to news about advances in 3D scene reconstruction done with radiance fields.

In March 2024, he was driving with a friend from San Jose after NVIDIA GTC, the chipmaker’s annual conference on artificial intelligence, when Scaniverse released version 3.0 with on-device splat processing. When he saw the update, he asked his friend who was driving to pull over. “I tried to go scan his car,” Rubloff said in an interview. “It was the first thing I did.”

Later, he used photographs he’d taken a couple years earlier to create NeRFs and reprocessed them as splats. He was astonished at the new level of detail they produced—the scenes suddenly had more information in them than before. “And it was like, ‘Wow! This is insane,’” he said.

Rubloff, who has made more than 6,000 scans, proselytizes about how 3D scanning will enrich our everyday experiences. He often uploads scans to the Scaniverse map, in part because he wants to share interesting things around where he lives in Manhattan, but also because he hopes posting awesome scans will inspire others to try it out for themselves.

Credit: Michael Rubloff

“If we’re looking at imaging as a way to document human life,” he said, “this provides a better and more accurate experience of a point in time than a 2D photographer could ever capture.”

For him, the advent of radiance field rendering in general, and 3D Gaussian splats in particular, has marked a turning point. “Even understanding that this is possible probably feels like science fiction to most people,” he said. “It doesn’t feel like it should be possible right now to have a hyper-real 3D version of the planet.”

He’s not alone in his excitement. The Scaniverse map already has the world’s largest collection of Gaussian splats. It features more than 10,000 splats from more than 100 countries, each shared by a person who found something interesting enough to capture and share. It’s the Tour of the World, but built by people everywhere adding their own unique perspective.

Niantic keeps pushing the boundaries of what’s possible with this technology, the latest frontier being VR headsets. Into the Scaniverse is a web app that, for the first time, renders user-created splats on Meta Quest. That means you can scan and see your own splats, or view any others shared on the map.

3D Gaussian splats are a new technology, and the incredible images they produce today are just the beginning. For now, the 3D scans they create aren’t able to show motion, but researchers are working on 4D scans that can show motion and change over time. Someday soon, you will be able to play a game in your headset that takes place in a 3D simulation of a plaza halfway across the world—with a person who is actually standing in the real plaza.

More than anything, this technology has the potential to connect people in a physical way over vast distances, including beyond Earth. Imagine a NASA rover splatting the Moon to create an immersive 3D simulation Artemis astronauts at Johnson Space Center use to train for building a lunar base. Using the same scan, at an elementary school in the next town over, third graders survey the same scene in headsets to look out at the vastness of space from the surface of the Moon—then dream about where they will explore next.

Kirsten M. Johnson is a freelance technology writer and former Associated Press journalist.