Online RL Training

Online RL Training#

This guide explains how to train language models using the online mode, where the user first launches an AReaL RL service that exposes a proxy gateway, and external applications (agent runtimes, human evaluators, or any OpenAI-compatible client) interact with the model through this gateway. Each interaction is automatically collected as RL training data.

Disclaimer: This API is experimental and subject to change.

Overview#

AReaL supports three execution modes for agent workflows:

Mode	Description	Use Case
`inline`	Agent runs in-process with the rollout worker	Most agent frameworks
`subproc`	Agent runs in a subprocess pool	Non-async or isolation-heavy code
`online`	External users drive the interaction via HTTP APIs	Human feedback, external runtimes

This guide focuses on online mode, which is unique because the agent code lives outside of AReaL. AReaL exposes an OpenAI-compatible HTTP API, and any application that speaks the chat completions protocol can connect to it.

For the offline training guide, see agentic RL guide.

Architecture#

                          External Application
                         (ZeroClaw, scripts, etc.)
                                  |
                      POST /chat/completions
                      POST /rl/set_reward
                                  |
                                  v
                      +-------------------+
                      |  Proxy Gateway    |  (FastAPI, stateless router)
                      |  - Session mgmt   |
                      |  - Key auth       |
                      |  - Load balancing |
                      +-------------------+
                         /        |        \
                        v         v         v
                  +---------+ +---------+ +---------+
                  | Proxy   | | Proxy   | | Proxy   |
                  | Worker  | | Worker  | | Worker  |  (one per rollout worker)
                  +---------+ +---------+ +---------+
                      |           |           |
                      v           v           v
                  +---------+ +---------+ +---------+
                  | SGLang/ | | SGLang/ | | SGLang/ |
                  | vLLM    | | vLLM    | | vLLM    |  (inference servers)
                  +---------+ +---------+ +---------+
                                  |
                      Token-level data collected
                                  |
                                  v
                      +-------------------+
                      |   RL Trainer      |
                      |   (PPOTrainer)    |
                      +-------------------+

Key components:

Proxy Gateway: A lightweight FastAPI server that routes requests from external applications to backend proxy workers. It manages session lifecycle, authentication, and load balancing.
Proxy Workers: Backend servers colocated with rollout workers. Each worker manages sessions, records token-level data (token IDs, log probabilities), and exports trajectories for training.
Inference Servers: SGLang or vLLM servers that perform the actual LLM inference.

Quick Start#

Step 1: Configure Online Mode#

Set rollout.openai.mode to online in your config YAML:

# config.yaml
rollout:
  openai:
    mode: online
    admin_api_key: "my-secret-admin-key"  # Protect management endpoints
    session_timeout_seconds: 3600          # Session timeout (default: 1 hour)

Step 2: Start the RL Service#

python3 examples/openclaw/train.py --config examples/openclaw/config.yaml \
    experiment_name=my-exp trial_name=trial-0 \
    rollout.backend=sglang:d1 actor.backend=fsdp:d1 \
    actor.path=Qwen/Qwen3-0.6B \
    scheduler.type=local \
    rollout.openai.admin_api_key=my-secret-admin-key

After initialization, AReaL prints the gateway address:

(AReaL) RLTrainer INFO: Proxy gateway available at http://x.x.x.x:8090

Step 3: Start a Session#

Use the provided helper script or any HTTP client:

curl -X POST http://<gateway>/rl/start_session \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret-admin-key" \
  -d '{"task_id": "demo-task-0"}'

You should see the current session ID and the API key for this agent session in the output.

Why a unique API key for each agent session? Since there may be many concurrent agent applications running, and they invoke the same endpoint (e.g., “/chat/completions”) in the URL, we need a mechanism to differentiate the trajectories from different agents. Therefore, we allocate unique API keys for each agent session or trajectory, and they have one-to-one relationship. In this way, we can track the interactions within the same trajectory and set rewards as well.

Step 4: Interact with the Model#

Use any OpenAI-compatible client. For example, with curl:

curl http://<gateway>/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-sess-xxxxxxxxxxxx" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is 12 * 15 + 3?"}],
    "temperature": 0.7
  }'

Or any evaluation scripts with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://<gateway>",
    api_key="sk-sess-xxxxxxxxxxxx",
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 12 * 15 + 3?"}],
)
print(response.choices[0].message.content)

Step 5: Assign a Reward and End the Session#

After the interaction, assign a reward to provide the RL training signal:

curl http://<gateway>/rl/set_reward \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-sess-xxxxxxxxxxxx" \
  -d '{"reward": 1.0}'

You can also use the completion ID during agent rollout to set rewards for intermediate steps.

Then, finish the session with:

curl http://<gateway>/rl/end_session \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-sess-xxxxxxxxxxxx" \
  -d '{}'

Step 6: Batched Sampling#

Integrate Steps 3 through 5 into a single bash script, and then run it concurrently with tools like sbatch. You must call /rl/start_session again to obtain a new API key for each agent session.

After enough data has been accumulated in AReaL’s buffer, AReaL will automatically enter the training stage.

FAQ#

Q: When will the updated model be loaded for inference?

The model will be loaded after every training step. In other words, the model used for inference is always the latest. For model saving and checkpointing, see CLI reference

Q: How to control the submission rate of the agent script? Will the RL server be overloaded?

AReaL has its internal rate limit, referred to as staleness control. If too many concurrent requests have been submitted, the gateway will return 429 to the client. See async RL guide for details about staleness control.

Q: Can I use this approach to train OpenClaw?

The approach in this documentation is different from training a personalized agent, because:

OpenClaw assumes single-threaded interaction with the user, meaning that the user cannot open many concurrent sessions that may mutually interfere
OpenClaw requires one-time setup with a fixed URL and API key

The core usage difference is that the OpenClaw example uses a fixed API key all over the interaction. By calling start_session multiple times, the old session is automatically ended, its trajectory exported for training, and a new session starts with the same API key. No reconfiguration of your application is needed between episodes.

For details of training the OpenClaw agent, see OpenClaw example.

Authentication#

Online mode uses a two-tier authentication system:

Auth Type	Token	Used For
Admin API key	`rollout.openai.admin_api_key`	`start_session`, `export_trajectories`
Session API key	Issued by `start_session`	`chat/completions`, `set_reward`, `end_session`

The admin API key is configured in the YAML and protects management endpoints.
The session API key is unique per session and scoped to that session’s interactions.

API Reference#

All endpoints are served by the proxy gateway.

Management Endpoints (Admin Auth)#

`POST /rl/start_session`#

Start a new session or refresh an existing one.

Request body:

{
  "task_id": "my-task-0",
  "api_key": null
}

Pass api_key from a previous session to refresh. Omit or set null for a new session.

Response:

{
  "session_id": "my-task-0",
  "api_key": "sk-sess-xxxxxxxxxxxx"
}

`GET /health`#

Health check. Returns the number of backend workers.

Session Endpoints (Session Auth)#

`POST /chat/completions`#

OpenAI-compatible chat completions endpoint. Tokens and log probabilities are automatically recorded.

`POST /responses`#

OpenAI Responses API endpoint (alternative to chat completions).

`POST /v1/messages`#

Anthropic Messages API endpoint for Claude-compatible clients.

`POST /rl/set_reward`#

Assign a reward to an interaction.

Request body:

{
  "reward": 1.0,
  "interaction_id": null
}

If interaction_id is null, the reward is assigned to the last interaction.

`POST /rl/end_session`#

Explicitly end a session and export its trajectory. Used in the batched sampling pattern where each sample has its own API key. Not needed when using session refresh.

Error Handling#

HTTP Code	Meaning	Action
200	Success	-
401	Missing or invalid authentication	Check your API key
409	API key already bound to a session	End existing session first, or use refresh
429	No capacity available	Retry after a short delay
502	Backend worker unreachable	Check that the RL service is running

For HTTP 429 during refresh, the training pipeline may not have cycled yet. Retry after a few seconds (default timeout is 120 seconds).

How Training Works#

Training runs asynchronously under the hood:

External applications interact with the model through the gateway
Each session’s interactions are recorded with token-level data
When a session ends (via refresh or explicit end), its trajectory is exported
Once enough trajectories are collected (controlled by train_dataset.batch_size), AReaL performs a training step
Updated model weights are transparently served to subsequent sessions

The model improves silently as you collect more episodes. For details on asynchronous training and staleness control, see the Asynchronous RL Guide.

Configuration Reference#

All online mode settings live under rollout.openai:

rollout:
  openai:
    mode: online                    # Required: set to "online"
    admin_api_key: "areal-admin-key"  # Admin key for management endpoints
    session_timeout_seconds: 3600   # Session timeout in seconds
    turn_discount: 1.0              # Reward discount for multi-turn conversations
    export_style: individual        # "individual" or "concat"

Field	Default	Description
`mode`	`inline`	Must be `online` for external access
`admin_api_key`	`areal-admin-key`	Admin API key (change in production!)
`session_timeout_seconds`	`3600`	Auto-cleanup stale sessions after this
`turn_discount`	`1.0`	Geometric discount for multi-turn rewards
`export_style`	`individual`	How to export interactions for training

Limitations#

Scheduler compatibility: Online mode requires local or slurm schedulers. The ray scheduler is not supported.
Single-controller mode: Online mode only works in single-controller mode (scheduler.type=local or scheduler.type=slurm).

Online RL Training

Contents

Online RL Training#

Overview#

Architecture#

Quick Start#

Step 1: Configure Online Mode#

Step 2: Start the RL Service#

Step 3: Start a Session#

Step 4: Interact with the Model#

Step 5: Assign a Reward and End the Session#

Step 6: Batched Sampling#

FAQ#

Authentication#

API Reference#

Management Endpoints (Admin Auth)#

`POST /rl/start_session`#

`GET /health`#

Session Endpoints (Session Auth)#

`POST /chat/completions`#

`POST /responses`#

`POST /v1/messages`#

`POST /rl/set_reward`#

`POST /rl/end_session`#

Error Handling#

How Training Works#

Configuration Reference#

Limitations#

See Also#

Online RL Training

Contents

Online RL Training#

Overview#

Architecture#

Quick Start#

Step 1: Configure Online Mode#

Step 2: Start the RL Service#

Step 3: Start a Session#

Step 4: Interact with the Model#

Step 5: Assign a Reward and End the Session#

Step 6: Batched Sampling#

FAQ#

Authentication#

API Reference#

Management Endpoints (Admin Auth)#

POST /rl/start_session#

GET /health#

Session Endpoints (Session Auth)#

POST /chat/completions#

POST /responses#

POST /v1/messages#

POST /rl/set_reward#

POST /rl/end_session#

Error Handling#

How Training Works#

Configuration Reference#

Limitations#

See Also#

`POST /rl/start_session`#

`GET /health`#

`POST /chat/completions`#

`POST /responses`#

`POST /v1/messages`#

`POST /rl/set_reward`#

`POST /rl/end_session`#