Scaling Kubernetes to 7,500 Nodes


We have scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for big fashions like GPT-3, CLIP, and DALL·E, but in addition for speedy small-scale iterative analysis equivalent to Scaling Legal guidelines for Neural Language Fashions. Scaling a single Kubernetes cluster to this dimension is never accomplished and requires some particular care, however the upside is an easy infrastructure that permits our machine studying analysis groups to maneuver sooner and scale up with out altering their code.

Since our final submit on Scaling to 2,500 Nodes we have continued to develop our infrastructure to fulfill researcher wants, within the course of studying many further classes. This submit summarizes these classes in order that others within the Kubernetes group can profit from them, and ends with issues we nonetheless face that we’ll be tackling subsequent.

Our Workload

Earlier than we get too far, it’s essential to explain our workload. The purposes and {hardware} we run with Kubernetes are fairly completely different from what you could encounter at a typical firm. Our issues and corresponding options might, or might not, be a very good match to your personal setup!

A big machine studying job spans many nodes and runs most effectively when it has entry to the entire {hardware} sources on every node. This permits GPUs to cross-communicate instantly utilizing NVLink, or GPUs to instantly talk with the NIC utilizing GPUDirect. So for a lot of of our workloads, a single pod occupies the complete node. Any NUMA, CPU, or PCIE useful resource competition aren’t elements for scheduling. Bin-packing or fragmentation just isn’t a typical drawback. Our present clusters have full bisection bandwidth, so we additionally don’t make any rack or community topology concerns. All of because of this, whereas we now have many nodes, there’s comparatively low pressure on the scheduler.

That stated, pressure on the kube-scheduler is spiky. A brand new job might encompass many lots of of pods all being created without delay, then return to a comparatively low price of churn.

Our largest jobs run MPI, and all pods inside the job are collaborating in a single MPI communicator. If any of the collaborating pods die, the complete job halts and must be restarted. The job checkpoints commonly, and when restarted it resumes from the final checkpoint. Thus we think about the pods to be semi-stateful—killed pods may be changed and work can proceed, however doing so is disruptive and must be saved to a minimal.

We don’t depend on Kubernetes load balancing all that a lot. We’ve little or no HTTPS site visitors, without having for A/B testing, blue/inexperienced, or canaries. Pods talk instantly with each other on their pod IP addresses with MPI through SSH, not service endpoints. Service “discovery” is proscribed; we simply do a one-time lookup for which pods are collaborating in MPI at job startup time.

Most jobs work together with some type of blob storage. They often both stream some shards of a dataset or checkpoint instantly from blob storage, or cache it to a quick native ephemeral disk. We’ve a couple of PersistentVolumes for instances the place POSIX semantics are helpful, however blob storage is way extra scalable and doesn’t require gradual detach/connect operations.

Lastly, the character of our work is basically analysis, which suggests the workloads themselves are ever-changing. Whereas the Supercomputing group strives to supply what we’d think about a “manufacturing” high quality stage of compute infrastructure, the purposes that run on that cluster are short-lived and their builders iterate shortly. New utilization patterns might emerge at any time that problem our assumptions about tendencies and applicable tradeoffs. We want a sustainable system that additionally permits us to reply shortly when issues change.


Because the variety of nodes and pods inside our clusters elevated, we discovered that Flannel had difficulties scaling up the throughput required. We switched to utilizing the native pod networking applied sciences for our IP Configurations for Azure VMSSes and the related CNI plugins. This allowed us to get host stage community throughput on our pods.

One more reason we have switched to utilizing alias-based IP addressing is that on our largest clusters, we might presumably have roughly 200,000 IP addresses in use at anyone time. Once we examined route-based pod networking, we discovered there have been important limitations within the variety of routes we might successfully use.

Avoiding encapsulation will increase the calls for on the underlying SDN or routing engine, however it retains our networking setup easy. Including VPN or tunneling may be accomplished with none further adapters. We need not fear about packet fragmentation as a consequence of some portion of the community having a decrease MTU. Community insurance policies and site visitors monitoring is simple; there is not any ambiguity concerning the supply and vacation spot of packets.

We use iptables tagging on the host to trace community useful resource utilization per Namespace and pod. This lets researchers visualize their community utilization patterns. Specifically, since quite a lot of our experiments have distinct Web and intra-pod communication patterns, it is typically helpful to have the ability to examine the place any bottlenecks may be occurring.

iptables mangle guidelines can be utilized to arbitrarily mark packets that match specific standards. Listed here are our guidelines to detect whether or not site visitors is inside or internet-bound. The FORWARD guidelines cowl site visitors from pods, vs INPUT and OUTPUT site visitors from the host:

iptables -t mangle -A INPUT ! -s -m remark --comment "iptables-exporter openai site visitors=internet-in"
iptables -t mangle -A FORWARD ! -s -m remark --comment "iptables-exporter openai site visitors=internet-in"
iptables -t mangle -A OUTPUT ! -d -m remark --comment "iptables-exporter openai site visitors=internet-out"
iptables -t mangle -A FORWARD ! -d -m remark --comment "iptables-exporter openai site visitors=internet-out"

As soon as marked, iptables will begin counters to trace the variety of bytes and packets that match this rule. You may eyeball these counters by utilizing iptables itself:

% iptables -t mangle -L -v
Chain FORWARD (coverage ACCEPT 50M packets, 334G bytes)
 pkts bytes goal     prot choose in     out     supply               vacation spot
1253K  555M            all  --  any    any     anyplace            !           /* iptables-exporter openai site visitors=internet-out */
1161K 7937M            all  --  any    any    !           anyplace             /* iptables-exporter openai site visitors=internet-in */

We use an open-source Prometheus exporter referred to as iptables-exporter to then get these tracked into our monitoring system. This a easy approach to monitor packets matching a wide range of several types of situations.

One considerably distinctive facet of our community mannequin is that we absolutely expose the node, pod, and repair community CIDR ranges to our researchers. We’ve a hub and spoke community mannequin, and use the native node and pod CIDR ranges to route that site visitors. Researchers hook up with the hub, and from there have entry to any of the person clusters (the spokes). However the clusters themselves can not speak to 1 one other. This ensures that clusters stay remoted with no cross-cluster dependencies that may break failure isolation.

We use a “NAT” host to translate the service community CIDR vary for site visitors coming from outdoors of the cluster. This setup permits our researchers important flexibility in selecting how and what sorts of community configurations they’re able to select from for his or her experiments.

API Servers

Kubernetes API Servers and etcd are vital elements to a wholesome working cluster, so we pay particular consideration to the stress on these methods. We use the Grafana dashboards offered by kube-prometheus, in addition to further in-house dashboards. We’ve discovered it helpful to alert on the speed of HTTP standing 429 (Too Many Requests) and 5xx (Server Error) on the API Servers as a high-level sign of issues.

Whereas some of us run API Servers inside kube, we’ve all the time run them outdoors the cluster itself. Each etcd and API servers run on their very own devoted nodes. Our largest clusters run 5 API servers and 5 etcd nodes to unfold the load and decrease influence if one had been to ever go down. We’ve had no notable hassle with etcd since splitting out Kubernetes Occasions into their very own etcd cluster again in our final weblog submit. API Servers are stateless and customarily straightforward to run in a self-healing occasion group or scaleset. We haven’t but tried to construct any self-healing automation of etcd clusters as a result of incidents have been extraordinarily uncommon.

API Servers can take up a good bit of reminiscence, and that tends to scale linearly with the variety of nodes within the cluster. For our cluster with 7,500 nodes we observe as much as 70GB of heap getting used per API Server, so luckily this could proceed to be well-within {hardware} capabilities into the long run.

One huge pressure on API Servers was WATCHes on Endpoints. There are a couple of providers, equivalent to ‘kubelet’ and ‘node-exporter’ of which each node within the cluster is a member. When a node could be added or faraway from the cluster, this WATCH would hearth. And since sometimes every node itself was watching the kubelet service through kube-proxy, the # and bandwidth required in these responses could be $N^2$ and large, sometimes 1GB/s or extra. EndpointSlices, launched in Kubernetes 1.17, had been an enormous profit that introduced this load down 1000x.

Generally we’re very aware of any API Server requests that scale with the scale of the cluster. We attempt to keep away from having any DaemonSets work together with the API Server. In instances the place you do want every node to observe for modifications, introducing an middleman caching service, such because the Datadog Cluster Agent, appears to be a very good sample to keep away from cluster-wide bottlenecks.

As our clusters have grown, we do much less precise autoscaling of our clusters. However we now have run into hassle sometimes when autoscaling an excessive amount of without delay. There are a lot of requests generated when a brand new node joins a cluster, and including lots of of nodes without delay can overload API server capability. Smoothing this out, even simply by a couple of seconds, has helped keep away from outages.

Time-Collection Metrics with Prometheus and Grafana

We use Prometheus to gather time-series metrics and Grafana for graphs, dashboards, and alerts. We began with a deployment of kube-prometheus that collects all kinds of metrics and good dashboards for visualization. Over time we’ve added a lot of our personal dashboards, metrics, and alerts.

As we added an increasing number of nodes, we struggled with the sheer quantity of metrics being collected by Prometheus. Whereas kube-prometheus exposes quite a lot of helpful information, a few of it we weren’t truly ever taking a look at, and a few was simply too granular to gather, retailer, and question successfully. We use Prometheus guidelines to “drop” a few of these metrics from being ingested.

For some time we struggled with an issue the place Prometheus would eat an increasing number of reminiscence till ultimately crashing the container in an Out-Of-Reminiscence error (OOM). This appeared to happen even after throwing monumental quantities of reminiscence capability on the software. What’s worse was, when it did crash, it might take many hours on startup replaying write-ahead-log recordsdata earlier than it was usable once more.

Ultimately we tracked down the supply of those OOMs to be an interplay between Grafana and Prometheus, the place Grafana would use the /api/v1/collection API on Prometheus with a question of {le!=""} (Principally, “give me all of the histogram metrics”). The implementation of /api/v1/collection was unbounded in each time and area—for a question with quite a lot of outcomes, this may proceed to eat ever-more reminiscence and time. It additionally continues to develop even after the requester has given up and closed the connection. For us, there was by no means sufficient reminiscence, and Prometheus would ultimately crash. We patched Prometheus to include this API inside a Context to implement a timeout, which mounted it fully.

Whereas Prometheus crashed far much less typically, in occasions after we did must restart it, WAL replay remained a difficulty. It could typically take many hours to replay by all WAL logs earlier than Prometheus was up amassing new metrics and servicing queries. With assist from Sturdy Notion, we discovered that making use of a GOMAXPROCS=24 had an enormous enchancment. Prometheus tries to make use of all cores when throughout WAL replay, and for servers with a lot of cores, the competition kills all efficiency.

We’re exploring new choices to extend our monitoring capability, described within the “Unsolved issues” part beneath.


With a cluster this massive, we in fact depend on automation to detect and take away misbehaving nodes from the cluster. Over time we now have constructed up quite a few healthcheck methods.

Passive Healthchecks

Some healthchecks are passive, all the time operating on all nodes. These monitor fundamental system sources equivalent to community reachability, dangerous or full disks, or GPU errors. GPUs exhibit issues quite a few other ways, however a simple frequent one is an “Uncorrectable ECC error.” Nvidia’s Knowledge Middle GPU Supervisor (DCGM) instruments make it straightforward to question for this and quite a few different “Xid” errors. A method we monitor these errors is through dcgm-exporter to ingest the metrics into Prometheus, our monitoring system. This can seem because the DCGM_FI_DEV_XID_ERRORS metric and be set to the error code that has most not too long ago occurred. Moreover, the NVML System Question API exposes extra detailed details about the well being and operation of a GPU.

As soon as we detect an error, they will typically be mounted by resetting the GPU or system, although in some instances it does result in the underlying GPU needing to be bodily changed.

One other type of healthcheck tracks upkeep occasions from the upstream cloud supplier. Every of the key cloud suppliers expose a approach to know if the present VM is due for an upcoming upkeep occasion that may ultimately trigger a disruption. The VM might must be rebooted so an underlying hypervisor patch may be utilized or the bodily node swapped out for different {hardware}.

These passive healthchecks run continually within the background on all nodes. If a healthcheck begins failing, the node is robotically cordoned so no new pods are to be scheduled on the node. For extra severe healthcheck failures, we may also try a pod eviction to request all currently-running pods to exit instantly. It’s nonetheless as much as the pod itself, configurable through a Pod Disruption Funds, to determine if it desires to permit this eviction to happen. Ultimately, both in any case pods have terminated, or 7 days has elapsed (a part of our SLA), we are going to forcibly terminate the VM.

Energetic GPU checks

Sadly not all GPU issues manifest as error codes seen by DCGM. We’ve constructed up our personal library of checks that train GPUs to catch further issues and make sure that the {hardware} and driver is behaving as anticipated. These checks can’t be run within the background—they require unique use of a GPU for a number of seconds or minutes to run.

We first run these checks on nodes upon boot, in a system we name “preflight.” All nodes be a part of the cluster with a “preflight” taint and label utilized. This taint prevents regular pods from being scheduled on the node. A DaemonSet is configured to run preflight check pods on all nodes with this label. Upon profitable completion of the check, the check itself removes the taint and label and the node is then out there for common use.

We additionally then run these checks periodically through the lifetime of a node. We run this as a CronJob, permitting it to land on any out there node within the cluster. That is admittedly a bit random and uncontrolled about which nodes get examined, however we’ve discovered that over time it supplies enough protection with minimal coordination or disruption.

Quotas & Useful resource Utilization

As we scaled up our clusters, researchers began to search out themselves having issue getting the entire capability that they had been allotted. Conventional job scheduling methods have quite a lot of completely different options out there to pretty run work between competing groups, which Kubernetes doesn’t have. Over time, we took inspiration from these job scheduling methods and construct a number of capabilities in a Kubernetes-native method.

Group taints

We’ve a service in every cluster, “team-resource-manager” that has a number of features. Its information supply is a ConfigMap that specifies tuples of (node selector, group label to use, allocation quantity) for the entire analysis groups which have capability in a given cluster. It reconciles this with the present nodes within the cluster, tainting the suitable variety of nodes with

“team-resource-manager” additionally has an admission webhook service, such that as every job is submitted, a corresponding toleration is utilized primarily based on the submitter’s group membership. Utilizing taints permits us to constrain the Kubernetes pod scheduler flexibly, equivalent to permitting a “any” toleration for decrease precedence pods, which permits groups to borrow one another’s capability with out requiring heavyweight coordination.

CPU & GPU Balloons

Along with utilizing cluster-autoscaler to dynamically scale our VM-backed clusters, we use it to remediate (take away & re-add) unhealthy members inside the cluster. We do that by setting the “min dimension” of the cluster to zero, and the “max dimension” of the cluster to the capability out there. Nonetheless, cluster-autoscaler, if it sees idle nodes, will try to scale right down to solely wanted capability. For a number of causes (VM spin up latency, pre-allocated prices, the API server impacts talked about above) this idle-scaling is not very best.

So, we launched a balloon Deployment for each our CPU-only and GPU hosts. This Deployment incorporates a ReplicaSet with “max dimension” variety of low-priority pods. These pods occupy sources inside a node, so the autoscaler does not think about them as idle. Nonetheless since they’re low precedence, the scheduler can evict them instantly to make room for precise work. (We selected to make use of a Deployment as a substitute of a DaemonSet, to keep away from the DaemonSet being thought-about idle workload on a node.)

One factor of notice, we use pod anti-affinity to make sure the pods would evenly distribute throughout the nodes. Earlier variations of the Kubernetes scheduler had an $O(N^2)$ efficiency situation with pod anti-affinity. This has been corrected since Kubernetes 1.18.

Gang Scheduling

Our experiments typically contain a number of StatefulSets, every working a distinct portion of the coaching effort. For Optimizers, researchers want all members of the StatefulSet to be scheduled, earlier than any coaching may be accomplished (as we frequently use MPI to coordinate between optimizer members, and MPI is delicate to group membership modifications).

Nonetheless, Kubernetes by default will not essentially prioritize fulfilling all requests from one StatefulSet over one other. For instance if two experiments every requested 100% of the cluster’s capability, as a substitute of scheduling all of 1 experiment or the opposite, Kubernetes would possibly schedule solely half of every experiment’s pods, resulting in a impasse the place neither experiment could make progress.

We tried a couple of issues needing a customized scheduler, however bumped into edge instances that brought about conflicts with how regular pods had been scheduled. Kubernetes 1.18 launched a plugin structure for the core Kubernetes scheduler, making it a lot simpler so as to add options like this natively. We not too long ago landed on the Coscheduling plugin as a great way to unravel this drawback.

Unsolved Issues

There are a lot of issues nonetheless to handle as we scale up our Kubernetes clusters. A number of of them embrace:


At our scale we’ve had many difficulties with Prometheus’s built-in TSDB storage engine being gradual to compact, and needing lengthy occasions wanted to replay the WAL (Write-Forward-Log) any time it restarts. Queries additionally are likely to end in “question processing would load too many samples” errors. We’re within the means of migrating to a distinct Prometheus-compatible storage and question engine. Stay up for a future weblog submit about the way it goes!

Pod Community Site visitors Shaping

As we scale up our clusters, every pod is calculated to have a certain quantity of Web bandwidth out there. The combination Web bandwidth necessities per particular person have turn out to be substantial, and our researchers now have the flexibility to unintentionally put a major useful resource pressure on different areas on the Web, equivalent to datasets for obtain and software program packages to put in.


We’ve discovered Kubernetes to be an exceptionally versatile platform for our analysis wants. It has the flexibility to scale as much as meet essentially the most demanding workloads we’ve placed on it. There are a lot of areas but although the place it wants enchancment, and the Supercomputing group at OpenAI will proceed to discover how Kubernetes can scale. If this type of work appears attention-grabbing, it’s best to think about making use of at OpenAI!


Leave a Reply

Your email address will not be published. Required fields are marked *