Decentralized Training Looms

Collaborative Training of foundation models is closer to actualization than broadly understood. The popular view that low bandwidth node-to-node connections render this infeasible is incorrect.

Jul 08, 2024

Introduction

Decentralized training is the creation of foundation models over loosely-connected, heterogenous swarms of consumer-grade compute. If decentralized training is feasible, the trajectory of foundation model development, model governance, and AI safety research are all altered significantly. It is hard to understate the impact. True decentralized governance of models created within protocols becomes feasible, model research is accelerated and no longer restricted to the large labs, the emerging oligopoly of foundation model providers is challenged, base model access is guaranteed, and a path towards the next several orders of magnitude size models is opened.

The consensus view today is that training foundation-scale models using truly decentralized methods is practically infeasible. Skeptics typically support this view with arguments citing the communication intensity of training and the low bandwidth connections between nodes that must be used in the decentralized case. This is a valid criticism when applying existing distributed training approaches unaltered to a different hardware setup. Such systems are designed to exploit the specific properties of the infrastructure they are deployed above. However there is strong, recent work in the academic literature demonstrating alterations to such systems that facilitate billion-parameter training within heterogeneous, low-bandwidth, small node capacity swarms. If current distributed training methods are further altered with the explicit focus on low communication bandwidth, a multiple actor, fully decentralized training fabric appears possible.

In this post, we discuss the technical arguments against decentralized training in a volunteer setting. There is no core technical blocker prohibiting decentralized AI training – it is possible today in certain configurations, and there are numerous immediate research directions pointing toward a general implementation.

Myth 1: Low bandwidth interconnects make training too slow

The Steelman: The node-to-node communication intensity of training results in a massive slowdown when performed over the internet, with bandwidth ~100-1000x lower than what is available within a datacenter. Because distributed training methods become more communication intensive as the models get larger, this problem is especially pronounced for foundation models.

The TL;DR: There is no core technical blocker to communication efficient training, early work has produced extremely promising results, and the research surface has barely been scratched. GPT-1.3B size models have been trained over 200Mb/s networks using Pipeline Parallel variants with a ~2x slowdown. If weight redundancy and extensive compression is introduced, the overhead is negligible. Today, 1000x slower interconnects have less of an effect than algorithmic innovations such as flash attention, and approximately the same effect as training on a generation older hardware. There are multiple immediate and appealing research directions to further reduce this gap.

Naïve applications of distributed training methods such as FSDP or ZeRO-3 to swarm based training results in massive slowdowns, which is to be expected, as these methods are designed to exploit the various levels of data transfer available within a datacenter. In a decentralized setup, nodes must communicate over the internet at ~200Mb/s versus the ~500Gb/s available within a datacenter. Rematerialization approaches involve multiple all-to-all communication primitives such as reduce-scatter or all-gather on both forward and backward passes. Consequently, training is blocked when bandwidth is low, device utilization decreases significantly, and throughput drops. Throughput is the primary metric; low per-device utilization is much less critical in the decentralized case as low-capacity devices can be expected to be used for everyday tasks simultaneously.

The problem of expensive all-to-all communication during training has been well studied in the DDP literature with inner-outer optimization approaches such as Local SGD, Decentralized SGD or, more recently, DiLoCo, achieving similar convergence rates to their standard counterparts, with significantly reduced communication requirements. This is achieved by allowing each device to perform multiple local updates prior to a global synchronization. The setting where each device has a full copy of the model and connections are internet-grade is also the basis of many federated learning settings, typically with additional constraints on privacy and the data distribution. Hence, low-bandwidth training of small to medium sized models on reasonably sized nodes, or large models on large nodes, is well understood today.

Extending these approaches to large models is where the research sits today. Currently, successful approaches fall into the pipeline parallel category as pipeline parallel communicates point to point rather than all to all as in FSDP or similar, and is not bottlenecked by the slowest peer. Swarm parallelism, for example, trained a 1.3B parameter GPT, on mixed capacity nodes, over internet-grade connections. It relies on parameter redundancy, a setup where various nodes hold copies of the same weights and dynamically routing between these nodes, and significant compression of communication between nodes. It is important to note that weight redundancy would be required in the decentralized case anyway in order to ensure weight shards are not lost by nodes leaving the swarm. Very large (trillion-parameter) models could be trained using this approach today simply by making the models very deep. Interestingly, as parameters per node or batch size grows, pipeline parallelism becomes more communication efficient. It’s not clear how scaling up by increasing depth would affect model performance, although early evidence indicates that depth or width matter significantly less than overall parameter count.

The effectiveness of offloading (where parameters activations and gradients are shuffled between VRAM and RAM as needed) in the low bandwidth setting was also demonstrated in Swarm Parallel. If the per-node VRAM to RAM bandwidth is large (much larger than node-to-node bandwidth) this can be effective, and can be combined with both communication-efficient DDP and model parallel approaches.

All prior techniques discuss how to exactly replicate known models and training recipes to a heterogenous swarm with high efficiency. There are two other immediate alternatives; 1) modifications to model architectures such that they do not require communication efficient training at all, and 2) asynchronous training. With respect to architecture, early work such as mixture of depths, sparse mixture-of-experts, and mixture-of-experts with parameter sharing appear to have promise. More speculative are alternate training techniques such as the forward-forward algorithm which do not require backward passes at all. Architecture modifications are appealing as swarm based training can provide properties at the hardware level that must be simulated in the centralized case. Expert dropout, for example, occurs simply by nodes entering and leaving the swarm.

Asynchronous training allows updates computed on older weights to be applied to new ones, removing the locking imposed on synchronous training and allowing much simpler compute-communication overlap. Changing the weights used to compute the forward pass and backward pass clearly can lead to instability, however approaches such as PipeMare have shown significant promise by allowing asynchronous updates and still demonstrating convergence. Such asynchronous approaches are highly communication-efficient and are amenable to gossip-based updates.

Myth 2: Small capacity devices cannot produce foundation-scale models

The Steelman: Foundation models are so large now that even individual layers will not fit within memory of consumer devices and they are only getting bigger. Rematerialization methods like FSDP and Zero-3 hence can’t be applied, even if they can be made communication-efficient, because the layer itself cannot be computed.

The TL;DR: Most layers even in large models can be materialized on consumer GPUs. A single Llama3 70B layer in FP16 is ~2GB. Also, consumer devices with unified memory architectures have comparable memory to large datacenter-grade GPUs with only a marginal decrease in bandwidth. Techniques such as activation checkpointing and offloading hence remain applicable and allow such nodes to store parameters at the block level. Even if device memory is extremely limited, decomposing layers into smaller sub-operations is well understood.

Consumer grade hardware is significantly more competitive with datacenter-grade chips than may be expected. The M3 Max, for example, supports 128GB at 400GB/s in comparison to a H100 SXM with 80 GB at 3.3TB/s. Note also, the M3 consumes ~100W at peak in comparison to H100’s 700W, and does not require energy expenditure on cooling.

The MLP and attention layers within transformer blocks can be split into parallel streams that do not have dependencies on each other. This is the basis of Tensor Parallelism. If a single layer is too big, the computation can be decomposed and either performed sequentially on the same device (memory centric tiling), or over multiple devices. Hence, training methods today can support arbitrarily small operations and a single layer is not required to be fully materialized.

The problem of small devices does introduce challenges related to efficiency (in FLOPs/Watt), where collections of small nodes may be less efficient than large datacenter nodes.

Myth 3: It will always cost more, and that makes it pointless

The Steelman: Overhead required for coordination of multiple actors, and ensuring tolerance to bad actors means a decentralized system can never compete on price. A situation where you control all the hardware and don’t have to worry about arbitrary behavior will always be more efficient. Also using less efficient hardware means you can never be as cheap as a centralized model provider. Consequently, decentralized training may be appealing for ideological reasons but can never be truly competitive, and that is what matters.

The TL;DR: There are several arguments for the cost effectiveness of decentralized training such as the ability to aggregate low-cost, low-capacity power sources, the lack of need for cooling, the relaxation of the high utilization constraint etc. etc. However, the price of training is not the central metric. Far more important is the relative quality of the models that can be trained. Decentralized training can assemble significantly larger computational power than centralized actors, and hence if scaling continues to produce better models, it will also produce the best models.

Decentralized training has several advantages over centralized training, potentially enough to overcome the efficiency gap between its datacenter and consumer hardware (which itself is smaller than may be expected). Not requiring close physical colocation of devices means that the cooling cost, which is typically ~40% of a datacenter operating expense, is completely removed. The ability to co-locate compute to low-capacity, but very low-cost energy sources such as landfill-gas-to-energy projects or renewables, as well as the ability to absorb generation surplus, means the base input cost, energy, may be much lower. Infrastructure permitting and the associated build-out lead times are also reduced.

Importantly, the impact of utilization (the fraction of time hardware spends actually performing operations rather than waiting on data) changes in importance in the decentralized case. Models may be trained on devices that are being used simultaneously for other, everyday tasks - it may even be more beneficial if utilization is lower in this case. It does not matter if a particular device has a high idle time if the throughput of the swarm as a whole is high, the walltime (the physical time spent training) of training is low, and the models produced are of high utility.

Combined, it would appear decentralized training can feasibly compete on price with centralized training. However, it is important is to note the cost of training is not the critical factor. More critical are 1) The upper bound on the utility of the models that can be produced, 2) training throughput (steps per second of the system as a whole) 3) cost of inference, 4) latency of inference.

The quality of the model is most critical as there is a winner take-all-dynamic in the foundation model market due to very slow switching costs and the fact that better models dominate (i.e. are higher utility in all aspects) models of lower utility. Hence, we do not care about the price of training (within reason) if we can produce a better model with the more expensive method, because the increased price of training is amortized over the increase in usage that comes by having one of the best models. Given today, more parameters means higher utility, the size of the models a training system can produce is the deciding factor. Decentralized training may produce the best models simply because the computational power that can be assembled is several orders of magnitude larger than the centralized case.

Myth 4: The Swarm Can Never Get Big Enough

The Steelman: The swarm size that would be required to train a foundation model is enormous and only getting bigger. Because there’s no value until such a model is produced it’s very unlikely such a swarm can ever be assembled.

The TL;DR: There is intermediate value creation with smaller swarms on the path to foundation model training. The quantity of compute assembled under PoW blockchains is evidence that single protocols can assemble computational power significantly beyond what is achievable by a centralized actor.

The scale of centrally controlled compute capacity is indeed perhaps underappreciated. As a single datapoint, Meta announced plans to purchase 350k H100s by year end 2024. This would achieve on the order of 350 TF32 exaFLOPS and consume 240 MW, for devices only, if running at 100% utilization. In contrast, the maximum compute capacity achieved by volunteer computing networks was a temporary peak of 1.2 full-precision exaFLOPS by the Folding at Home Project in March 2020. In short, volunteer network capacity peaked at two orders of magnitude below a single centralized actor's compute purchases in a single year, with normal capacity being much lower.

Compared to volunteer networks and centralized datacenters, incentivized swarms, such as those assembled for Proof-of-Work (PoW) mining in the Bitcoin and Ethereum protocols have achieved orders of magnitude larger capacity. We compare productive capacity here in terms of Watts rather than FLOPS in order to make meaningful comparisons (PoW mining does not involve any floating point operations). Bitcoin PoW mining consumption was estimated at 150 TWh in 2022, or 17.12 GW on average and approximately 0.5% of total worldwide energy consumption. While this figure is approximate, the core fact remains striking; compute two orders of magnitude larger than the largest pools of centrally controlled compute has already been assembled under a single protocol.

Clearly, the path to a swarm capable of facilitating foundation model training is long, however value creation occurs along a gradient as we move towards it. Finetunes of openweights have clear narrow use – domain specific image generation models, whisper forks specific to voice cloning, image upsamplers, domain specific Image2Image models (e.g. sketch to diagram), LoRa adapters that fix variations in style or tone or aim to alter the safety profile set by the base model designers, and removal of political censorship are just some examples. Designing models to be maximally truth-seeking is appealing as well. Orthogonal to narrow model utility, there is value in guaranteeing access to models that can be viewed as a public good, providing a use for otherwise idle commodity hardware, and allowing permissionless innovation at the model level. Providing a trustless infrastructure for model training coalitions also has clear appeal. A large swarm can be bootstrapped with such narrow, but useful models, up to the point where a true foundation model run can be initialized. Importantly, the training infrastructure required is the same in both cases.

Conclusion

Foundation models will create huge economic value. Currently this value will accrue in the hands of a few already enormous companies, which will grow even larger. Our explicit goal is the creation of an open, uncensored, decentralized, publicly owned, maximal utility competitor to the emerging oligopoly of foundation model providers.

In addition to governance benefits, there is the clear possibility of training models that are several orders of magnitude larger than is possible today. There are many more elements (which go beyond the technical requirements of model training we have discussed here) that must be solved in order for such systems to be instantiated. Incentivization, compute verification, economic implications, the problem of training data, the impact on the AI safety landscape, and other risks, we leave to future posts.

The future of AI is not just distributed; it's decentralized. The technical barriers to decentralized training are assailable, and the benefits are immense. The path to truly decentralized, publicly-owned foundation models, created within the protocols themselves is clear.

Many thanks to @jbrukh @ai @AntonvdH @pajak_karol @sgould_au for their thoughtful feedback on early drafts of this article.

Pluralis Research

Discussion about this post