Intensity - MoE Architecture

Arithmetic Intensity Analysis

Training & Prefilling: Processing full sequences of S tokens. Operations are sorted by intensity from highest to lowest.

Hardware Critical Intensity

-

Operations below this threshold are memory-bound; above are compute-bound

Model Dimension (D)

Sequence Length (S)

Batch Size (B)

Vocab Size (V)

Num Experts (E)

Top-K Routing (K)

Num Layers (L)

FAQ

What is the routing assumption for experts?

We assume uniform routing across all experts. With top-K=K routing, each token is routed to K experts, resulting in B×S×K total expert evaluations during training/prefilling, or B×K evaluations during sampling. Under uniform routing, each of the E experts processes approximately (B×S×K)/E tokens during training, or (B×K)/E tokens during sampling.

What does "arithmetic intensity" mean?

Arithmetic intensity is the ratio of FLOPs (floating-point operations) to bytes transferred from memory, measured in FLOPs/byte. Operations with high intensity perform many computations per byte of data moved, making them compute-bound. Operations with low intensity are limited by memory bandwidth and are memory-bound.

What is the hardware critical intensity threshold?

The critical intensity is computed as (compute performance) / (memory bandwidth). For example, TPU v5e has 197 TFLOPs/s and 820 GB/s, giving a critical intensity of ~240 FLOPs/byte. Operations below this threshold are memory-bound (limited by bandwidth), while operations above are compute-bound (limited by FLOPs).

What assumptions are made about operation fusion?

We assume that the attention softmax operation is fused, meaning QK^T, softmax, and multiplication with V are combined into a single kernel. This significantly reduces memory traffic by avoiding intermediate materialization of the full attention matrix. Without fusion, memory requirements would be much higher for long sequences.

Where can I learn more about roofline analysis?

For a comprehensive introduction to roofline analysis and performance modeling, see the JAX Scaling Book's roofline chapter. It provides detailed explanations of arithmetic intensity, hardware constraints, and optimization strategies.

Intensity - MoE Architecture

MoE Block Architecture

Hardware Configuration

Arithmetic Intensity Analysis

FAQ