Based on the papers: DiLoCo, Streaming DiLoCo, Scaling Laws for DiLoCo
For all methods, the reduction of gradients or outer gradients is overlapped over the backward pass computation.
Approximate inter-datacenter bandwidth (Gbps) needed to achieve each compute utilization (CU) threshold:
| Method | 25% CU | 50% CU | 75% CU | 90% CU | 95% CU |
|---|---|---|---|---|---|
| Data Parallel | - | - | - | - | - |
| DiLoCo | - | - | - | - | - |
| Streaming DiLoCo | - | - | - | - | - |