Installation

Installation#

Prerequisites#

Hardware Requirements#

The following hardware configuration has been extensively tested:

GPU: 8x H800 per node
CPU: 64 cores per node
Memory: 1TB per node
Network: NVSwitch + RoCE 3.2 Tbps
Storage:
- 1TB local storage for single-node experiments
- 10TB shared storage (NAS) for distributed experiments

Software Requirements#

Component	Version
Operating System	CentOS 7 / Ubuntu 22.04 or any system meeting the requirements below
NVIDIA Driver	550.127.08
CUDA	12.8
Git LFS	Required for downloading models, datasets, and AReaL code. See installation guide
Docker	27.5.1
NVIDIA Container Toolkit	See installation guide
AReaL Image	`ghcr.io/inclusionai/areal-runtime:v1.0.3-sglang` (default) or `v1.0.3-vllm`. Includes runtime dependencies and Ray components.

Note: This tutorial does not cover the installation of NVIDIA Drivers, CUDA, or shared storage mounting, as these depend on your specific node configuration and system version. Please complete these installations independently.

Runtime Environment#

For multi-node training: Ensure a shared storage path is mounted on every node (and mounted to the container if you are using Docker). This path will be used to save checkpoints and logs.

Option 1: Docker (Recommended)#

We recommend using Docker with our provided image. The Dockerfile is available in the top-level directory of the AReaL repository.

docker pull ghcr.io/inclusionai/areal-runtime:v1.0.3-sglang
docker run -it --name areal-node1 \
   --privileged --gpus all --network host \
   --shm-size 700g -v /path/to/mount:/path/to/mount \
   ghcr.io/inclusionai/areal-runtime:v1.0.3-sglang \
   /bin/bash
git clone https://github.com/inclusionAI/AReaL /path/to/mount/AReaL
cd /path/to/mount/AReaL
uv pip install -e . --no-deps

A vLLM variant of the Docker image is also available at ghcr.io/inclusionai/areal-runtime:v1.0.3-vllm. Replace the image tag in the commands above if you prefer vLLM as the inference backend.

Option 2: Custom Environment Installation#

Install uv.
Clone the repo:

git clone https://github.com/inclusionAI/AReaL
cd AReaL

(Optional) Specify a custom index URL.

Add this section to your pyproject.toml:

[[tool.uv.index]]
url = "https://mirrors.aliyun.com/pypi/simple/"  # Change to your preferred mirror site
default = true

If your network connection is stable, you can skip this step.

Install dependencies using uv sync:

# Use `--extra cuda` on Linux with CUDA for full functionality
uv sync --extra cuda
# Or without CUDA support
# uv sync
# Or with additional packages for development and testing
# uv sync --group dev

After installation, activate the virtual environment:

source .venv/bin/activate

Activation is required before running pre-commit or git commit. If you use uv run <command> instead, activation is not needed as uv run automatically uses the virtual environment.

This installs CUDA-dependent training packages (Megatron, Torch Memory Saver) plus SGLang as the default inference backend. These packages require Linux x86_64 with CUDA 12.x and compatible NVIDIA drivers.

Using vLLM Instead of SGLang#

SGLang and vLLM pin mutually-incompatible torch / torchao versions, so they live in separate pyproject files. The default pyproject.toml uses SGLang; for vLLM, use pyproject.vllm.toml:

cp pyproject.vllm.toml pyproject.toml
cp uv.vllm.lock uv.lock
uv sync --extra cuda

Alternatively, without replacing pyproject.toml:

uv pip install -r pyproject.vllm.toml --extra cuda

Flash Attention Pre-built Wheels#

Flash Attention v2 is included in --extra cuda, but PyPI only ships source distributions that require CUDA compilation (~30 min). To skip compilation, install a pre-built wheel before running uv sync.

Flash Attention wheels are linked against a specific PyTorch version at compile time. SGLang (the default pyproject.toml) uses torch 2.9 and vLLM (pyproject.vllm.toml) uses torch 2.10, so you must pick the matching wheel. Replace cpXYZ with your Python version (cp311 for 3.11, cp312 for 3.12).

SGLang (default, torch 2.9):

# Python 3.12
uv pip install "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.9-cp312-cp312-linux_x86_64.whl"
# Python 3.11
# uv pip install "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.9-cp311-cp311-linux_x86_64.whl"

uv sync --extra cuda

vLLM (torch 2.10, requires pyproject.vllm.toml):

cp pyproject.vllm.toml pyproject.toml
cp uv.vllm.lock uv.lock
# Python 3.12
uv pip install "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp312-cp312-linux_x86_64.whl"
# Python 3.11
# uv pip install "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl"

uv sync --extra cuda

Browse all available wheels at mjun0812/flash-attention-prebuild-wheels.

The same uv sync command also works on macOS and Linux without CUDA support. CUDA packages (including flash-attn) are automatically skipped via platform markers. However, training and inference features requiring CUDA will not be available. This configuration is suitable only for development, testing, and non-GPU workflows.

You can also install individual extras instead of the full cuda bundle:

sglang: SGLang inference engine (in pyproject.toml)
vllm: vLLM inference engine (in pyproject.vllm.toml)
megatron: Megatron training backend
tms: Torch Memory Saver
flash-attn: Flash Attention v2
kernels: Hugging Face Kernels runtime
cuda-train: Training packages only (megatron + tms + kernels, no inference backend)
cuda: cuda-train + inference backend + flash-attn

Using Hugging Face Kernels in Training#

Installing the kernels extra only makes the Hugging Face Kernels runtime available. Training keeps the existing defaults unless you opt in through configuration.

Use the following fields on your train engine config (for example actor, critic, or teacher):

attn_impl: Select the attention backend. In addition to built-in backends such as sdpa and flash_attention_2, this can point to a Hugging Face kernels repo ID such as kernels-community/flash-attn or kernels-community/flash-attn@main:flash_attn_varlen_func.
use_kernels: Set to true to kernelize the model after creation.

Example:

actor:
  attn_impl: kernels-community/flash-attn
  use_kernels: true

For predictable behavior, prefer an explicit kernels repo ID instead of relying on automatic fallback from flash_attention_*.

Additional CUDA Packages (Optional, Manual Installation)#

The Docker image includes additional compiled packages that are NOT in pyproject.toml. These packages require CUDA and must be compiled from source. If you are using a custom environment (not Docker) and need optimizations from these packages (e.g., FP8 training, fused Adam kernel), install them manually after running uv sync --extra cuda:

Package	Purpose	Installation Command
grouped_gemm	MoE model support in Megatron	`uv pip install --no-build-isolation git+https://github.com/fanshiqing/grouped_gemm@v1.1.4`
NVIDIA apex	Fused Adam, etc. in Megatron	`NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 APEX_CPP_EXT=1 APEX_CUDA_EXT=1 uv pip install --no-build-isolation git+https://github.com/NVIDIA/apex.git`
TransformerEngine	FP8 training, optimized GEMM in Megatron	`uv pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@stable`
flash-attn-3	Flash Attention v3 (Hopper)	Built from source, see Dockerfile

Important: These packages require --no-build-isolation because they need access to the already-installed PyTorch for CUDA compilation. Install PyTorch first via uv sync --extra cuda before attempting to install these packages.

DeepSeek-V3 Optimization Packages (Optional)#

For running DeepSeek-V3 style MoE models with optimal performance, the Docker image also includes the following packages. These packages have complex build requirements and GPU architecture constraints. Refer to the Dockerfile for the exact installation commands and environment variables.

Package	Purpose	GPU Requirement
FlashMLA	Multi-head Latent Attention kernels	SM90+ (Hopper)
DeepGEMM	FP8 GEMM library for MoE	SM90+ (Hopper)
DeepEP	Expert Parallelism communication library	SM80+ (Ampere)
flash-linear-attention	Linear attention with Triton kernels	Any GPU
NVSHMEM	Required for DeepEP internode communication	Any GPU

Notes:

FlashMLA and DeepGEMM require Hopper (H100/H800) or newer GPUs.
DeepEP requires NVSHMEM for internode and low-latency features. The Docker image installs nvidia-nvshmem-cu12 via pip, which DeepEP auto-detects.
flash-linear-attention uses pure Triton kernels and works on any GPU.
These packages are cloned from GitHub and compiled during Docker build. For manual installation, ensure git submodules are initialized (git submodule update --init --recursive) for FlashMLA and DeepGEMM as they depend on CUTLASS.

Validate your AReaL installation:

We provide a script to validate your AReaL installation. Simply run:

uv run python3 areal/tools/validate_installation.py

After the validation passes, you are ready to go!

(Optional) Install SkyPilot#

SkyPilot helps you run AReaL easily on 17+ different clouds or your own Kubernetes infrastructure. For more details about SkyPilot, check the SkyPilot Documentation. Below are the minimal steps to set up SkyPilot on GCP or Kubernetes.

Install SkyPilot with pip#

# In your conda environment
# NOTE: SkyPilot requires 3.7 <= python <= 3.13
pip install -U "skypilot[gcp,kubernetes]"

GCP Setup#

# Install Google Cloud SDK
conda install -y -c conda-forge google-cloud-sdk

# Initialize gcloud and select your account/project
gcloud init

# (Optional) Choose a project explicitly
gcloud config set project <PROJECT_ID>

# Create Application Default Credentials
gcloud auth application-default login

Kubernetes Setup#

See the SkyPilot Kubernetes setup guide for a comprehensive guide on how to set up a Kubernetes cluster for SkyPilot.

Verify#

sky check

If GCP: enabled or Kubernetes: enabled are shown, you’re ready to use SkyPilot with AReaL. See the SkyPilot example for a detailed guide on running AReaL with SkyPilot. For more options and details, see the official SkyPilot installation guide.

(Optional) Launch Ray Cluster for Distributed Training#

On the first node, start the Ray Head:

ray start --head

On all other nodes, start Ray Workers:

# Replace with the actual IP address of the first node
RAY_HEAD_IP=xxx.xxx.xxx.xxx
ray start --address $RAY_HEAD_IP

You should see the Ray resource status displayed when running ray status.

Set the n_nodes argument appropriately in AReaL’s training command, and AReaL will automatically detect the resources and allocate workers to the cluster.

Next Steps#

Check the quickstart section to launch your first AReaL job.