By Maoyang Li, Sil Tian, and Mingzhou Lyu
Pokémon GO is a game that is best experienced in real time, especially when it comes to engaging and connecting players. After the raid invite feature launched in 2020, we have seen increased complaints on the notification latency which could take more than a few seconds and causes players to miss raid battles, compromising players’ ability to play in real time.
Building a better real-time notification system
Prior to Push Gateway the Pokémon GO client was pulling the player’s inbox to retrieve new in-app notifications (including raid invites, gifting, etc.) on a fixed interval of 15 seconds. This behavior had some drawbacks, including extra load on servers due to redundant polling (the actual rate of in app notification is much lower than the rate of pulling) and high latency for in-game notifications. For example, if a notification was sent to the player right after the client had queried, the player wouldn’t see it until the next query, which was 15 seconds later. The average notification latency was around nine seconds and it was quite noticeable in scenarios of raid invites.
Instead of having each client constantly pull a player’s new in-app notifications, we wanted to build a system that would allow the server to send data to the game client in real time. This would reduce latency for players and help reduce server load and cost by avoiding redundant traffic. In the end, we created a new publish-subscribe service called PushGateway that is able to send data from Niantic backend servers to game clients in real time through WebSocket connection.
The Problem
Managing a large number of concurrent WebSocket connections was proving difficult. We needed to build a WebSocket client that could recover from network errors and disconnection so that it could stay connected to receive downstream messages even if the player had a bad network connection or in instances where the app is set to background and therefore disconnected. We had to develop a cost effective and low latency system that could successfully reconnect to our servers and catch up on anything missed while the player was offline so that we could provide a seamless real-time player experience.
From stateful to stateless to automatically scaling
Originally we designed the service as a stateful service, with each server handling different kinds of the data, which is complicated to manage and scale. So we later added a Redis (in-memory data store) component and moved states there, turning the service into a stateless deployment, which allowed us to automatically scale the service based on load.
There were some exceptions with WebSocket connection after we launched the feature. We have many metrics to track the performance and error rate for both server and client, so we were able to see the issues as soon as they came up. Some errors came from multithreading, some from authentication. We were able to hunt down the edge cases and fix the bug in the following release.
“We had to develop a cost effective and low latency system that could successfully reconnect to our servers and catch up on anything missed while the player was offline so that we could provide a seamless real-time player experience.”
PushGateway is used by Pokémon GO and other services, and is deployed on Google Kubernetes Engine (GKE). We are using Google Cloud Platform (GCP) Private-Service-Connect(PSC) to connect the different services. We worked with GCP support to determine the cause of some issues in traffic that we observed and there were times we observed issues in the traffic but couldn’t determine the cause on our system, which turned out to be caused by some events or issues on GCP. We were able to further improve the service reliability by working with Google’s support team to update several traffic settings of GKE and load balancer.
In order to avoid down time and enable progressive rollout for new versions during version upgrade, we introduced blue-green deployment. It helped us detect bugs and performance degradation in the new version and we could roll back fast and easily.
Improving in-app latency from 9 seconds to 1 second
PushGateway allowed us to improve in-app notification latency from an average of nine seconds to one second. We are also able to show more accurate online status by checking clients’ WebSocket connection. It brings a more real time experience to our players. On our backend, it also helped us reduce the load for querying the player inbox by 85%, making the system more efficient and reliable.
“PushGateway allowed us to improve in-app notification latency from an average of nine seconds to one second … It brings a more real time experience to our players.”
Pokémon GO clients use a similar pulling mechanism to get player’s nearby map objects updates, so we recently applied this push model to some map updates as well. This now allows players to see their nearby lure, raid spawn, gym team change sooner. We are continuing to look out for other features that can benefit from this push service.
If you’re interested in building the infrastructure for games like Pokémon GO, join us!
Maoyang Li
Maoyang is a Senior Software Engineer working with the Pokemon Go team. He primarily focuses on game server optimization.
Sil Tian
Sil is a Senior Software Engineer working on the Pokémon GO client. His interests involve game development and player experience improvements.
Mingzhou Lyu
Mingzhou is a Staff Software Site Reliability Engineer based in Tokyo, Japan. He makes sure that systems at Niantic are reliable and fast.