Allocation Mode#

This document describes AReaL’s allocation mode system, which controls how GPUs are distributed between inference and training backends during distributed RL training.

Overview#

Each engine component (actor, critic, rollout, ref, teacher) has its own backend configuration field that specifies:

  • Which backend to use (SGLang, vLLM for inference; FSDP, Megatron, Archon for training)

  • The parallelization strategy

  • The total number of GPUs required

AReaL parses each backend string into a ModelAllocation object that drives resource allocation for that specific engine.

Configuration#

Per-Engine Backend Fields#

Each engine in the YAML config has its own backend field:

# Rollout (inference) engine
rollout:
  backend: "sglang:d4t2"

# Actor (training) engine
actor:
  backend: "fsdp:d8"

# Critic engine (falls back to actor.backend if empty)
critic:
  backend: ""

# Ref engine (falls back to actor.backend if empty)
ref:
  backend: ""

When critic.backend or ref.backend is empty, it automatically inherits from actor.backend.

Note: The top-level allocation_mode config field is deprecated and only retained for backward compatibility with legacy SPMD launchers (local/ray/slurm). It is ignored by the single-controller scheduler. Use the per-engine backend fields shown above instead.

Backend String Syntax#

<backend>:<parallelism_dims>

For example, fsdp:d4t2 means: use the FSDP backend with data parallelism 4 and tensor parallelism 2.

Parallelism Dimensions#

Dimension

Abbreviation

Description

Valid For

Data

d

Number of model replicas

All backends

Tensor

t

Split operations across GPUs

All backends

Pipeline

p

Split layers across GPUs in stages

Megatron, Archon

Context

c

Split sequence length across GPUs

All backends

Expert

e

Split MoE experts across GPUs

Megatron, Archon

Dimensions are specified as <abbrev><size>, e.g., d4t2 means data parallel size 4 and tensor parallel size 2.

Calculating GPU Requirements#

The total GPUs for a component is computed as:

world_size = dp × tp × pp × cp

Expert parallelism (e) does not increase world size—it redistributes how experts are placed within the existing GPU mesh.

Examples#

Backend String

GPUs per Engine

Notes

fsdp:d8

8

8 data-parallel replicas

sglang:d2t4

8

2 instances × 4 TP GPUs

megatron:d2p2t4

16

2 DP × 2 PP × 4 TP

megatron:d2p2t4e4

16

Same mesh, 4-way expert par

Full Config Example#

# 16-GPU setup: 8 inference + 8 training
rollout:
  backend: "sglang:d2t4"    # 2 × 4 = 8 GPUs
actor:
  backend: "fsdp:d4t2"      # 4 × 2 = 8 GPUs

Backend Selection#

Inference Backends#

Backend

Supported Dimensions

sglang

d, t

vllm

d, t, p

For inference, d represents the number of independent server instances, and each instance uses t × p GPUs.

Note that the internal backend configurations do not affect how AReaL allocates GPUs. Given rollout.backend: "sglang:d4t4", you can also configure sglang.dp_size=4, sglang.ep_size=4, and sglang.enable_dp_attention=True. In this case, we launch 4 model replicas each with 4 GPUs. Within each instance, SGLang will still use DP attention and expert parallelism to distribute computations in attention and expert layers.

Training Backends#

Backend

Supported Dimensions

Use Case

fsdp

d, t, c

Default for simple parallelism

megatron

d, t, p, c, e

Required for pipeline or expert parallel

archon

d, t, p, c, e

Alternative to Megatron (experimental)

Important: An explicit backend prefix is required in all allocation strings. Bare dimension strings (e.g., d4t2) are no longer accepted. Always specify the backend explicitly: fsdp:d4t2, megatron:d2p2t4, sglang:d4t2.

MoE Hybrid Parallelism#

For Mixture-of-Experts models, Megatron/Archon supports different parallelism strategies for attention and FFN (expert) modules using the hybrid syntax:

megatron:(attn:<attn_dims>|ffn:<ffn_dims>)

This enables MoE Parallel Folding, which reduces the minimum GPU requirement for combined context and expert parallelism.

Constraints#

  • Pipeline parallel size (p) must be identical for attn and ffn

  • World size must match (if d is omitted in ffn, it is derived automatically)

  • Expert parallel (e) is only valid in the ffn section

Example#

actor:
  backend: "megatron:(attn:d4p2t2c2|ffn:d2p2t4e2)"

Module

dp

pp

tp

cp

ep

World Size

attn

4

2

2

2

-

32

ffn

2

2

4

-

2

32

See Also#