DiLoCo Bandwidth Simulator

Based on the papers: DiLoCo, Streaming DiLoCo, Scaling Laws for DiLoCo

Model & Network Configuration

Billion parameters (N)
Seconds
M (≥1)

Data Parallel (H=1)

DiLoCo

Streaming DiLoCo with Overlap

Streaming pattern: H is the period for each fragment. With P fragments, one fragment is sent every 5.0 steps. Fragment size = total_size / P.

Bandwidth Requirements for Target Utilization

Approximate inter-datacenter bandwidth (Gbps) needed to achieve each compute utilization (CU) threshold:

Method 50% CU 75% CU 90% CU 95% CU 99% CU
Data Parallel - - - - -
DiLoCo - - - - -
Streaming DiLoCo - - - - -
Data Parallel
50% CU-
75% CU-
90% CU-
95% CU-
99% CU-
DiLoCo
50% CU-
75% CU-
90% CU-
95% CU-
99% CU-
Streaming DiLoCo
50% CU-
75% CU-
90% CU-
95% CU-
99% CU-

Compute Utilization

For all methods, the reduction of gradients or outer gradients is overlapped over the backward pass computation.