三更灯火五更鸡,正是男儿读书时,黑发不知勤学早,白发方悔读书迟。——颜真卿

⚡ AReaL:把“大模型的强化学习训练”变成一条高速公路(而且路标很清楚)

有些项目一上来就摆出一副“别碰我,很复杂”的姿态;AReaL 不一样——它更像一个把复杂系统打理得井井有条的“训练调度官”:你只要给它一份配置、一段脚本,它就能把面向推理与智能体模型的强化学习训练跑起来,而且主打 fully asynchronous(完全异步)大规模、还尽量让你“改得动、扩得开”。

AReaL 的仓库 description 直接点题:Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.


1. AReaL 是谁?它想解决什么问题?

在 README 里,AReaL 给自己的自我介绍非常清晰:

  • 它是一个开源的、完全异步的 RL 训练系统
  • 面向 reasoning modelsagentic models(推理模型/智能体模型)
  • 基于开源项目 ReaLHF,并强调把复现需要的训练细节、数据、基础设施都开源出来
  • 目标是:让更多人能轻松、低成本地训练自己的 AI agents
  • 甚至还顺手把“奶茶哲学”写进 README:好喝、可定制、价格友好——希望你用起来也同样顺滑
1
2
3
4
5
6
7
8
9
10
11
12
13
<h1 align="center">
<em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System
</h1>

...

AReaL is an open-source **fully asynchronous** reinforcement learning training system
for large **reasoning and agentic models**, developed by members from Tsinghua IIIS and
the AReaL Team at Ant Group. Built upon the open-source project
[ReaLHF](https://github.com/openpsi-project/ReaLHF), we are fully committed to
open-source principles by providing the training details, data, and infrastructure
required to reproduce our results, along with the models themselves. AReaL aims to help
everyone build their own AI agents easily and affordably. ...

2. 它“快”在哪里?异步训练不是噱头,是架构信仰

如果你看过一些 RLHF / Online RL 系统,会发现“rollout(生成)”与“train(训练)”经常互相卡脖子:

  • 生成太慢,训练没数据
  • 训练太慢,生成堆积
  • 多机多卡时,调度更是让人头大

AReaL 选择了fully asynchronous:把流水线彻底异步化,目标是让训练和生成都跑在更高吞吐上,并且“稳”。

在仓库的博客文章中,这条路线写得很硬:AReaL v0.3 强调“异步 RL 训练流水线 + 系统与算法协同设计”,并把 2.77x speedup 作为重要里程碑之一。

1
2
3
4
5
6
7
8
<em>AReaL</em> v0.3: SOTA Coding Models with 2.77x Faster Asynchronous RL Training

## Introduction

We now release AReaL v0.3, featuring three major milestones:

- **A fully asynchronous RL training pipeline with system and RL algorithm co-design**,
achieving over 2.77x speedup without any performance drop

3. “灵活”不是口号:AReaL 把“可替换”设计进了骨架

README 的 Highlights 里,“Flexibility”那条写得非常落地:

  • 想做 agentic RL?有教程
  • 想做 online RL training?有完整示例
  • 甚至只强调一件事:通过替换 base_url 就能接入 RL 服务(把复杂依赖、代码改动压到最小)
1
2
3
**[2026/03/02]** We provide [a complete example](./examples/openclaw/) to train your
own 🦞 OpenClaw agent by simply replacing the `base_url` and `api_key` with AReaL's RL
service - no complicated dependencies, no code changes, works with any agentic runtime!

这类设计的潜台词是:你可以保留自己的 agent runtime、自己的业务工具链,只要把 RL 训练对齐到 AReaL 的服务与配置体系里,就能把系统跑起来。


4. 从 Quickstart 跑一个 GSM8K 训练:AReaL 的上手路线很明确

AReaL 文档里专门写了 Quickstart:以 GSM8K + GRPO + function-based rewards 为例跑一个实验,并把“你需要的最小东西”列出来:

  • 训练脚本:examples/math/gsm8k_rl.py
  • 配置文件:examples/math/gsm8k_grpo.yaml
1
2
3
4
5
6
7
# Quickstart

Welcome to the **AReaL** Quickstart Guide! This guide demonstrates how to run an AReaL
experiment training an LLM on the GSM8K dataset using the GRPO algorithm with
function-based rewards.
...
python3 examples/math/gsm8k_rl.py --config examples/math/gsm8k_grpo.yaml scheduler.type=local experiment_name=<your experiment name> trial_name=<your trial name>

4.1 训练脚本长什么样?(很“框架式”,很干净)

examples/math/gsm8k_rl.py 的风格非常清爽:解析配置、准备数据、把训练交给 Trainer,把流程交给 Workflow。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from areal import PPOTrainer
from areal.api.cli_args import GRPOConfig, load_expr_config
from areal.dataset import get_custom_dataset
from areal.utils.hf_utils import load_hf_tokenizer

...
with PPOTrainer(
config,
train_dataset=train_dataset,
valid_dataset=valid_dataset,
) as trainer:
trainer.train(
workflow="areal.workflow.rlvr.RLVRWorkflow",
workflow_kwargs=workflow_kwargs,
eval_workflow="areal.workflow.rlvr.RLVRWorkflow",
eval_workflow_kwargs=eval_workflow_kwargs,
)

4.2 想把它跑得更“像集群”:Ray / Slurm 一键切换

Quickstart 明确给了分布式入口:通过 scheduler.type=rayscheduler.type=slurm 扩展到多节点训练。

1
2
3
4
5
6
7
8
## Distributed Experiments with Ray or Slurm

# Launch with Ray scheduler. 4 nodes (4 GPUs each), 3 nodes for generation, 1 node for training.
python3 examples/math/gsm8k_rl.py \
--config examples/math/gsm8k_grpo.yaml \
scheduler.type=ray \
experiment_name=<your experiment name> \
trial_name=<your trial name> \

5. 训练后端也不“一刀切”:Archon / Megatron / FSDP 按场景组���

AReaL 文档里专门介绍了 Archon:一个 PyTorch-native 的训练引擎,强调更灵活、更容易做 RL 研究/优化,也更容易 debug 分布式训练问题。

1
2
3
4
# Archon: PyTorch-Native Training Engine

Archon is AReaL's PyTorch-native training backend that provides maximum flexibility for
RL researchers without Megatron-Core dependencies. It supports full 5D parallelism ...

启用方式也很直给:把 Archon 写进 allocation_mode

1
2
3
4
To use Archon as your training backend, specify it in the `allocation_mode`:

```bash
allocation_mode=sglang:d4+archon:d4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
---

## 6. 不止数学:AReaL 也认真对待“智能体 + 工具调用”的训练

如果你做过多轮 agent,会知道“工具调用”这件事很难训:需要模型会发起 tool call、会停、会接住 tool 输出、还要把奖励和整条轨迹绑定。

AReaL 在 `examples/tir` 给了一个完整样板:**Tool-Integrated Reasoning (TIR) Agent**。它不仅解释架构,还把工具调用格式、输出样例都摆在台面上。

```markdown name=examples/tir/README.md url=https://github.com/inclusionAI/AReaL/blob/412d22411ce6cc48c2f776a0910c6fb2834bd1e0/examples/tir/README.md#L1-L59
# Tool-Integrated Reasoning (TIR) Agent

This project implements a Tool-Integrated Reasoning agent ...
trained end-to-end through reinforcement learning.

... TIRWorkflow (`tir_workflow.py`) ...
- Inherits from AReaL's `RolloutWorkflow` base class
- Supports multi-turn tool calling reasoning processes
...

工具调用格式示例(可以直接对齐你自己的 agent runtime):

1
Mathematical calculation

1 + 2 * 3

1
2
```python
output: 6
1
2
3
4
5
6
7
8
9
10
11
---

## 7. 安装与环境:AReaL 把“可验证”也写进流程

在安装文档里,AReaL 对依赖管理给了很明确的推荐路径:使用 **uv**,并通过 extras 安装不同能力组件。中文安装文档里写得很清楚:

```markdown name=docs/zh/tutorial/installation.md url=https://github.com/inclusionAI/AReaL/blob/412d22411ce6cc48c2f776a0910c6fb2834bd1e0/docs/zh/tutorial/installation.md#L71-L92
uv sync --extra cuda
# 或者不带 CUDA 支持
# uv sync
...

英文安装文档还给了“官方自检脚本”的运行方式:

1
uv run python3 areal/tools/validate_installation.py

对应脚本也直说自己要干什么:验证依赖与 CUDA 扩展是否可用。

1
2
3
4
5
6
"""
Dynamic Installation Validation Script for AReaL

This script validates that all dependencies listed in pyproject.toml are properly
installed with correct versions and that CUDA extensions are functional.
"""

8. 一个可以直接照着跑的最小路径(按文档原意整理)

8.1 安装依赖(uv)

1
2
3
4
5
# 带 CUDA 的完整能力(Linux + CUDA 环境)
uv sync --extra cuda

# 可选:运行安装自检
uv run python3 areal/tools/validate_installation.py

8.2 单机跑 GSM8K GRPO 示例

1
2
3
4
5
python3 examples/math/gsm8k_rl.py \
--config examples/math/gsm8k_grpo.yaml \
scheduler.type=local \
experiment_name=<your experiment name> \
trial_name=<your trial name>

8.3 扩到多节点(Ray / Slurm)

scheduler.type 改为 rayslurm,并按 Quickstart 示例补齐集群参数即可。


9. 入口链接(直接开逛)