On Tuesday, May 19th 2026, at 16:25 UTC, we were alerted about game servers not being scheduled in cloud locations. Across all cloud sites, new game servers workloads were stuck in the scheduling phase, i.e., game servers not fully starting up, because the underlying nodes were not being marked eligible for receiving workloads.
This was caused by a Kubernetes node taint that denotes the readiness of a node not being removed. A taint is a key-value pair on a node object. Such taints can act as barriers, and we use them to prevent premature workload scheduling onto nodes while internal services and daemons are still starting up on new hosts. Orthogonally, in order to allow a workload to run on a node despite the node being tainted, a workload needs to tolerate the taint. The selectors for such tolerations may be more or less strict, i.e., equality-based or existence-based, where the former is more strict than the latter. In the past, we have kept those selectors as strict as possible to prevent leaking tolerations for overlapping taint keys, as some taints arise from the underlying platform (GCP, AWS, or baremetal cluster provider) and are thus beyond our immediate control.
That strictness gives rise to a rather strong coupling of service availability to very specific taint names: minuscule changes to the taint’s key or value can prevent the workload from being scheduled, and that dependency is not immediately visible during admission into the cluster. Taints are also only scheduling-time constraints, so already running workloads are not affected by changes to taints. Changes of tolerations however cause a mutation of the workload’s specification, which in turn causes a new rollout to start.
In this particular instance, one such daemon starting up was itself being blocked by not tolerating another taint that was being added as part of an internal platform maintenance in preparation for a GameFabric feature. Due to the convoluted nature of such taint-toleration pairs, we only identified the specific changed taint and by extension the missing toleration at 16:37 UTC, after which a fix was prepared and pushed to production systems at 16:41 UTC. Here, our migration to pull-based Gitops over the last months showed its true value once more, because it saved us from having to apply the fix individually per-cluster. Game servers started up shortly afterwards, as nodes were becoming ready to serve game servers.
That specific daemon is a container image registry cache that runs on game server clusters to speed up game server image pulls: other nodes in the cluster may already be running the specific image revision and have the respective layers much closer to the place where the game server should run. This saves roundtrips to the upstream image registries that might also incur vast latency and potentially costs, depending on the size of the image layers. A game server cluster in Australia or Japan pulling images from a registry in Europe will substantially profit from the cache’s locality and improve game server startup time.
The cache however is an optimistic add-on, i.e., it is not guaranteed to retain all layers indefinitely, because its storage is still coupled its host’s image storage: if the host runs out of disk storage due to having to store too many image layers, layers are being pruned from the cache. Moreover, in environments like cloud clusters, where nodes dynamically scale up and down, the lifetime of a node’s cache is coupled to the lifetime of the node. Nodes scaling up and down dynamically will reduce the effectiveness of the cache in proportion to the churn rate. Stated opposite, the longer a node remains in a cluster, the more efficiently it provides cache hits to the image pulling across nodes, and therefore, it is most effective on (static) baremetal clusters, or cloud clusters with a non-zero resting footprint of game server nodes.
The internal platform maintenance introduced a configuration change to the zones that game server nodes are allowed to run in that populated to the GCP API, and while the effective change was nil (previously, leaving the list of zones empty defaults to all zones, whereas now we were explicitly setting the zones to prevent a singular, nasty edge case in GCP regions with more zones than our clusters there support), the change spawned a new GCE instance template with virtually no changes. GKE then promoted the new instance template, which replaced all nodes to use the new instance template. That churn replaced all the nodes that were already running, and the new ones starting were being blocked by the conflicting constraints explained before, preventing new game servers from being scheduled.
Additionally, the node churn diminished the image cache’s effectiveness to practically zero, and once the scheduling issues were resolved, the image cache misses meant that starting game servers had to pull their images from the upstream, which, in some locations, took visibly longer than usual, also due to thundering-herd-effects. The cache heating up was proportional to the size of the game server replica numbers, i.e., single-replica (development) Armadas were generally taking longer than large-scale production armadas, because of the probability of other nodes already having pulled the image before rises with the number of game servers running that specific image. At around 17:00 UTC, most game servers had returned to operating as usual from our vantage point.
This issue only affected cloud clusters. Baremetal clusters were not affected by the issue. We’ve taken immediate countermeasures to ensure such convoluted scheduling violations do not reoccur, and are working on improving our resilience to such failure modes and alerting engineers on such complex errors.