Basic and Scalable Parallelization for Neural Networks


Scaling neural networks, whether or not it’s the quantity of coaching knowledge used, the mannequin dimension or the computation being utilized, has been essential for bettering mannequin high quality in lots of real-world machine studying purposes, akin to pc imaginative and prescient, language understanding and neural machine translation. This, in flip, has motivated current research to scrutinize the components that play a essential position within the success of scaling a neural mannequin. Though rising mannequin capability generally is a sound method to enhance mannequin high quality, doing so presents numerous techniques and software program engineering challenges that have to be overcome. As an example, so as to prepare massive fashions that exceed the reminiscence capability of an accelerator, it turns into essential to partition the weights and the computation of the mannequin throughout a number of accelerators. This technique of parallelization will increase the community communication overhead and may end up in machine under-utilization. Furthermore, a given algorithm for parallelization, which generally requires a major quantity of engineering effort, might not work with totally different mannequin architectures.

To handle these scaling challenges, we current “GSPMD: Basic and Scalable Parallelization for ML Computation Graphs”, by which we describe an open-source computerized parallelization system primarily based on the XLA compiler. GSPMD is able to scaling most deep studying community architectures and has already been utilized to many deep studying fashions, akin to GShard-M4, LaMDA, BigSSL, ViT, and MetNet-2, resulting in state-of-the-art-results throughout a number of domains. GSPMD has additionally been built-in into a number of ML frameworks, together with TensorFlow and JAX, which use XLA as a shared compiler.


GSPMD separates the duty of programming an ML mannequin from the problem of parallelization. It permits mannequin builders to jot down packages as in the event that they had been run on a single machine with very excessive reminiscence and computation capability — the consumer merely wants so as to add a couple of traces of annotation code to a subset of essential tensors within the mannequin code to point easy methods to partition the tensors. For instance, to coach a big model-parallel Transformer, one might solely have to annotate fewer than 10 tensors (lower than 1% of all tensors in your entire computation graph), one line of further code per tensor. Then GSPMD runs a compiler cross that determines your entire graph’s parallelization plan, and transforms it right into a mathematically equal, parallelized computation that may be executed on every machine. This permits customers to deal with mannequin constructing as a substitute of parallelization implementation, and allows straightforward porting of current single-device packages to run at a a lot bigger scale.

The separation of mannequin programming and parallelism additionally permits builders to attenuate code duplication. With GSPMD, builders might make use of totally different parallelism algorithms for various use instances with out the necessity to reimplement the mannequin. For instance, the mannequin code that powered the GShard-M4 and LaMDA fashions can apply a wide range of parallelization methods applicable for various fashions and cluster sizes with the identical mannequin implementation. Equally, by making use of GSPMD, the BigSSL massive speech fashions can share the identical implementation with earlier smaller fashions.

Generality and Flexibility

As a result of totally different mannequin architectures could also be higher suited to totally different parallelization methods, GSPMD is designed to help a big number of parallelism algorithms applicable for various use instances. For instance, with smaller fashions that match throughout the reminiscence of a single accelerator, knowledge parallelism is most popular, by which gadgets prepare the identical mannequin utilizing totally different enter knowledge. In distinction, fashions which are bigger than a single accelerator’s reminiscence capability are higher fitted to a pipelining algorithm (like that employed by GPipe) that partitions the mannequin into a number of, sequential phases, or operator-level parallelism (e.g., Mesh-TensorFlow), by which particular person computation operators within the mannequin are cut up into smaller, parallel operators.

GSPMD helps all of the above parallelization algorithms with a uniform abstraction and implementation. Furthermore, GSPMD helps nested patterns of parallelism. For instance, it may be used to partition fashions into particular person pipeline phases, every of which will be additional partitioned utilizing operator-level parallelism.

GSPMD additionally facilitates innovation on parallelism algorithms by permitting efficiency specialists to deal with algorithms that greatest make the most of the {hardware}, as a substitute of the implementation that includes a number of cross-device communications. For instance, for giant Transformer fashions, we discovered a novel operator-level parallelism algorithm that partitions a number of dimensions of tensors on a 2D mesh of gadgets. It reduces peak accelerator reminiscence utilization linearly with the variety of coaching gadgets, whereas sustaining a excessive utilization of accelerator compute attributable to its balanced knowledge distribution over a number of dimensions.

As an instance this, contemplate a simplified feedforward layer in a Transformer mannequin that has been annotated within the above means. To execute the primary matrix multiply on absolutely partitioned enter knowledge, GSPMD applies an MPI-style AllGather communication operator to partially merge with partitioned knowledge from one other machine. It then executes the matrix multiply domestically and produces a partitioned end result. Earlier than the second matrix multiply, GSPMD provides one other AllGather on the right-hand aspect enter, and executes the matrix multiply domestically, yielding intermediate outcomes that may then should be mixed and partitioned. For this, GSPMD provides an MPI-style ReduceScatter communication operator that accumulates and partitions these intermediate outcomes. Whereas the tensors generated with the AllGather operator at every stage are bigger than the unique partition dimension, they’re short-lived and the corresponding reminiscence buffers shall be freed after use, which doesn’t have an effect on peak reminiscence utilization in coaching.

Left: A simplified feedforward layer of a Transformer mannequin. Blue rectangles characterize tensors with dashed purple & blue traces overlaid representing the specified partitioning throughout a 2×2 mesh of gadgets. Proper: A single partition, after GSPMD has been utilized.

A Transformer Instance with Nested Parallelism

As a shared, sturdy mechanism for various parallelism modes, GSPMD permits customers to conveniently swap between modes in several components of a mannequin. That is significantly beneficial for fashions which will have totally different elements with distinct efficiency traits, for instance, multimodal fashions that deal with each photographs and audio. Contemplate a mannequin with the Transformer encoder-decoder structure, which has an embedding layer, an encoder stack with Combination-of-Skilled layers, a decoder stack with dense feedforward layers, and a ultimate softmax layer. In GSPMD, a posh mixture of a number of parallelism modes that treats every layer individually will be achieved with easy configurations.

Within the determine beneath, we present a partitioning technique over 16 gadgets organized as a logical 4×4 mesh. Blue represents partitioning alongside the primary mesh dimension X, and yellow represents partitioning alongside the second mesh dimension Y. X and Y are repurposed for various mannequin elements to realize totally different parallelism modes. For instance, the X dimension is used for knowledge parallelism within the embedding and softmax layers, however used for pipeline parallelism within the encoder and decoder. The Y dimension can be utilized in other ways to partition the vocabulary, batch or mannequin professional dimensions.

Computation Effectivity

GSPMD offers industry-leading efficiency in massive mannequin coaching. Parallel fashions require additional communication to coordinate a number of gadgets to do the computation. So parallel mannequin effectivity will be estimated by inspecting the fraction of time spent on communication overhead — the upper proportion utilization and the much less time spent on communication, the higher. Within the current MLPerf set of efficiency benchmarks, a BERT-like encoder-only mannequin with ~500 billion parameters to which we utilized GSPMD for parallelization over 2048 TPU-V4 chips yielded extremely aggressive outcomes (see desk beneath), using as much as 63% of the height FLOPS that the TPU-V4s provide. We additionally present effectivity benchmarks for some consultant massive fashions within the desk beneath. These instance mannequin configs are open sourced within the Lingvo framework together with directions to run them on Google Cloud. Extra benchmark outcomes will be discovered within the experiment part of our paper.

Mannequin Household Parameter Rely % of mannequin activated* No. of Specialists** No. of Layers No. of TPU FLOPS utilization
Dense Decoder (LaMDA) 137B 100% 1 64 1024 TPUv3 56.5%
Dense Encoder (MLPerf-Bert) 480B 100% 1 64 2048 TPUv4 63%
Sparsely Activated Encoder-Decoder (GShard-M4) 577B 0.25% 2048 32 1024 TPUv3 46.8%
Sparsely Activated Decoder 1.2T 8% 64 64 1024 TPUv3 53.8%
*The fraction of the mannequin activated throughout inference, which is a measure of mannequin sparsity.

**Variety of specialists included within the Combination of Specialists layer. A worth of 1 corresponds to a typical Transformer, with out a Combination of Specialists layer.


The continuing growth and success of many helpful machine studying purposes, akin to NLP, speech recognition, machine translation, and autonomous driving, rely on attaining the very best accuracy potential. As this usually requires constructing bigger and much more advanced fashions, we’re happy to share the GSPMD paper and the corresponding open-source library to the broader analysis neighborhood, and we hope it’s helpful for environment friendly coaching of large-scale deep neural networks.


We want to thank Claire Cui, Zhifeng Chen, Yonghui Wu, Naveen Kumar, Macduff Hughes, Zoubin Ghahramani and Jeff Dean for his or her help and invaluable enter. Particular due to our collaborators Dmitry Lepikhin, HyoukJoong Lee, Dehao Chen, Orhan Firat, Maxim Krikun, Blake Hechtman, Rahul Joshi, Andy Li, Tao Wang, Marcello Maggioni, David Majnemer, Noam Shazeer, Ankur Bapna, Sneha Kudugunta, Quoc Le, Mia Chen, Shibo Wang, Jinliang Wei, Ruoming Pang, Zongwei Zhou, David So, Yanqi Zhou, Ben Lee, Jonathan Shen, James Qin, Yu Zhang, Wei Han, Anmol Gulati, Laurent El Shafey, Andrew Dai, Kun Zhang, Nan Du, James Bradbury, Matthew Johnson, Anselm Levskaya, Skye Wanderman-Milne‎, and Qiao Zhang for useful discussions and inspirations.


Leave a Reply

Your email address will not be published. Required fields are marked *