Intensity - MoE Architecture

Visualizing Mixture-of-Experts (MoE) model architecture

MoE Block Architecture

Hardware Configuration

Compute Performance
1.97e14
FLOPs/s (bfloat16)
Memory Bandwidth
8.2e11
bytes/s (HBM)

Arithmetic Intensity Analysis

Training & Prefilling: Processing full sequences of S tokens. Operations are sorted by intensity from highest to lowest.

Hardware Critical Intensity
-
Operations below this threshold are memory-bound; above are compute-bound

FAQ

What is the routing assumption for experts?
We assume uniform routing across all experts. With top-K=K routing, each token is routed to K experts, resulting in B×S×K total expert evaluations during training/prefilling, or B×K evaluations during sampling. Under uniform routing, each of the E experts processes approximately (B×S×K)/E tokens during training, or (B×K)/E tokens during sampling.
What does "arithmetic intensity" mean?
Arithmetic intensity is the ratio of FLOPs (floating-point operations) to bytes transferred from memory, measured in FLOPs/byte. Operations with high intensity perform many computations per byte of data moved, making them compute-bound. Operations with low intensity are limited by memory bandwidth and are memory-bound.
What is the hardware critical intensity threshold?
The critical intensity is computed as (compute performance) / (memory bandwidth). For example, TPU v5e has 197 TFLOPs/s and 820 GB/s, giving a critical intensity of ~240 FLOPs/byte. Operations below this threshold are memory-bound (limited by bandwidth), while operations above are compute-bound (limited by FLOPs).
What assumptions are made about operation fusion?
We assume that the attention softmax operation is fused, meaning QKT, softmax, and multiplication with V are combined into a single kernel. This significantly reduces memory traffic by avoiding intermediate materialization of the full attention matrix. Without fusion, memory requirements would be much higher for long sequences.
Where can I learn more about roofline analysis?
For a comprehensive introduction to roofline analysis and performance modeling, see the JAX Scaling Book's roofline chapter. It provides detailed explanations of arithmetic intensity, hardware constraints, and optimization strategies.