7 篇博文含有标签「Release」

ABench: An Evolving Open-Source Benchmark

2025年7月8日 · 阅读需 2 分钟

Ant Group

🌟 Overview

ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cross-domain tasks. By targeting current model weaknesses, ABench provides systematic challenges in high-difficulty specialized domains, including physics, actuarial science, logical reasoning, law, and psychology.

🎯 Core Objectives

Address Evaluation Gaps: Design high-differentiation assessment tasks targeting underperforming question types
Establish Unified Standards: Create reliable, comparable benchmarks for multi-domain LLM evaluation
Expand Capability Boundaries: Drive continuous optimization of knowledge systems and reasoning mechanisms through challenging innovative problems

📊 Dataset Release Status

Domain	Description	Status
Physics	500 university/competition-level physics problems (400 static + 100 dynamic parametric variants) covering 10+ fields from classical mechanics to modern physics	✅ Released
Actuary	Curated actuarial exam problems covering core topics: probability statistics, financial mathematics, life/non-life insurance, actuarial models, and risk management	✅ Released
Logic	High-differentiation logical reasoning problems from authoritative tests (LSAT/GMAT/GRE/SBI/Chinese Civil Service Exam)	🔄 In Preparation
Psychology	Psychological case studies and research questions (objective/subjective) evaluating understanding of human behavior and theories	🔄 In Preparation
Law	Authoritative judicial exam materials covering core legal domains: criminal/civil/administrative/procedural/international law	🔄 In Preparation

Ling: A MoE LLM Provided and Open-sourced by inclusionAI

2025年5月8日 · 阅读需 11 分钟

inclusionAI

Ant Group

🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope

Introduction

Ling is a MoE LLM provided and open-sourced by inclusionAI. We introduce two different sizes, which are Ling-lite and Ling-plus. Ling-lite has 16.8 billion parameters with 2.75 billion activated parameters, while Ling-plus has 290 billion parameters with 28.8 billion activated parameters. Both models demonstrate impressive performance compared to existing models in the industry.

Their structure makes it easy to scale up and down and adapt to different tasks, so users can use these models for a wide range of tasks, from processing natural language to solving complex problems. Furthermore, the open-source nature of Ling promotes collaboration and innovation within the AI community, fostering a diverse range of use cases and enhancements.

As more developers and researchers engage with the platform, we can expect rapid advancements and improvements, leading to even more sophisticated applications. This collaborative approach accelerates development and ensures that the models remain at the forefront of technology, addressing emerging challenges in various fields.

Update

[2025-5-10] Ling-lite-1.5 has been released! It achieves significant progress in reasoning ability compared with previous Ling-lite.
[2025-4-15] Ling-lite is upgraded to Ling-lite-0415. The new model demonstrates notable improvements over its predecessor, Ling-lite-0220, especially on code and math.

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model	#Total Params	#Activated Params	Context Length	Download
Ling-lite-base-1.5	16.8B	2.75B	128K	🤗 HuggingFace 🤖 ModelScope
Ling-lite-1.5	16.8B	2.75B	128K	🤗 HuggingFace 🤖 ModelScope
Ling-plus-base	290B	28.8B	64K	🤗 HuggingFace 🤖 ModelScope
Ling-plus	290B	28.8B	64K	🤗 HuggingFace 🤖 ModelScope
Ling-coder-lite-base	16.8B	2.75B	16K	🤗 HuggingFace 🤖 ModelScope
Ling-coder-lite	16.8B	2.75B	16K	🤗 HuggingFace 🤖 ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Evaluation

Ling-lite

Standard Benchmarks

Benchmark	#shots	Ling-lite-1.5	Ling-lite	Qwen3-4B-Instruct	Qwen3-8B-Instruct	Moonlight-16B-A3B-Instruct	LLaMA3.1-8B
MMLU(EM)	5	74.33	71.27	70.09	75.97	70.74	68.67
GPQA(Pass@1)	0	36.55	29.73	40.4	47.10	19.51	27.59
HumanEval(Pass@1)	0	87.27	84.38	81.94	85.29	72.94	67.23
LiveCodeBench 2408-2502 (Pass@1)	0	22.7	18.94	21.8	26.88	14.76	18.41
LCBench(pass@1)	0	60.37	46.57	48.61	60.03	28.39	23.13
Math(EM)	0	82.62	72.80	81.46	82.70	67.1	52.42
AIME2024(pass@1)	0	21.88	10.21	20.62	26.25	6.88	7.29
OlympiadBench(pass@1)	0	52.30	36.44	54.33	56.11	32.85	17.04
BBH(EM)	0	75.75	66.38	78.21	79.33	63.45	68.05
IFEval(Prompt Strict)	0	77.70	77.99	81.06	83.55	49.01	73.01
BFCL_live	0	72.15	67.93	65.35	69.83	47.14	49.98

Context Window

Evaluation results on the Needle In A Haystack (NIAH) tests. Ling-lite-1.5 has improved long text generation capability and performs well across most context window lengths up to 128K.

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-lite-1.5"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:

git clone -b  v0.7.3 https://github.com/vllm-project/vllm.git
cd vllm
git apply Ling/inference/vllm/bailing_moe.patch
pip install -e .

Offline Inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-lite-1.5")

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)

llm = LLM(model="inclusionAI/Ling-lite", dtype='bfloat16')
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-lite \
              --tensor-parallel-size 2 \
              --pipeline-parallel-size 1 \
              --use-v2-block-manager \
              --gpu-memory-utilization 0.90

To handle long context in vLLM using YaRN, we need to follow these two steps:

Add a rope_scaling field to the model's config.json file, for example:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

MindIE

This subject outlines the primary processes for executing a Ling MoE model with specified hardware and the MindIE inference framework.

Configure preparation

Create a model directory on the host for downloading, the directory example is: /root/models', which is used to mount the docker container later.

Download the mindie-related configuration from github:

cd /root/models
git clone git@github.com:inclusionAI/Ling.git

Machine network environment check

# Check the physical link
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Check the links
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check your network health
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# Check whether the detected IP address is correctly configured
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# Check whether the gateway is configured correctly
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# Check the consistency of the underlying TLS verification behavior of the NPU, recommend that all 0 be
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
# The underlying TLS check line of the NPU is set to 0
for i in {0..7}; do hccn_tool -i $i -tls -s enable 0; done

Pull the image

Go to Ascend Community/Development Resources and pull the mindie image

Image version: 1.0.0-800I-A2-py311-openeuler24.03-lts

The versions of each component are as follows:

Component	Version
MindIE	1.0.0
CANN	8.0.0
PTA	6.0.0.beta1
HDK	24.1.0

Container startup and configuration changes

Start the container

Execute the following startup command (reference):

docker run -itd --privileged --name=container name --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/models:/home/HwHiAiUser/Ascend \
mindie: 1.0.0-XXX-800I-A2-arm64-py3.11 (modified according to the name of the loaded image) \
bash

Download the model

In this case, we use ModelScope to download the model, and install ModelScope first:

pip install modelscope

Download the model:

# The model takes a long time to download and can be executed in the background
nohup modelscope download --model inclusionAI/Ling-plus --local_dir /home/HwHiAiUser/Ascend/Ling_plus 2>&1 > /tmp/ling_plus.log &

nohup modelscope download --model inclusionAI/Ling-plus-base --local_dir /home/HwHiAiUser/Ascend/Ling_plus_base 2>&1 > /tmp/ling_plus_base.log &

nohup modelscope download --model inclusionAI/Ling-lite --local_dir /home/HwHiAiUser/Ascend/Ling_lite 2>&1 > /tmp/ling_lite.log &

nohup modelscope download --model inclusionAI/Ling-lite-base --local_dir /home/HwHiAiUser/Ascend/Ling_lite_base 2>&1 > /tmp/ling_lite_base.log &

After the download is completed, you need to change the file permissions, otherwise an error will be reported when MindIE-Service is started:

chmod -R 750 *.json *.py

Model weight format conversion

This section applies to the Ling Lite model, the Ling Plus model does not need to worry about this chapter

mindie supports safetensors format weights, if the download weights are not in safetensors format, you need to convert the weights, take Ling Lite as an example, the conversion command is as follows:

# Convert Ling lite
python /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor.py

cd /home/HwHiAiUser/Ascend/Ling_lite
cp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_safetensor/

# Convert Ling lite base
python /home/HwHiAiUser/Ascend/Ling/inference/mindie/convert_bin_to_safetensor_base.py

cd /home/HwHiAiUser/Ascend/Ling_lite_base
cp README.md configuration.json config.json special_tokens_map.json modeling_bailing_moe.py tokenizer.json tokenizer_config.json ../Ling_lite_base_safetensor/

The path of loading the Ling Lite model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_safetensor', and the path of the Ling Lite Base model is changed to '/home/HwHiAiUser/Ascend/Ling_lite_base_safetensor'

Change the model configuration

The default model configuration file (config.json) mindie cannot be loaded directly, and needs to be changed:

# Adapt to mindie's Ling lite model configuration
cp /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_lite_safetensor/config.json

# Adapt to mindie's Ling lite base model configuration
cp /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/model_base_config.json /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_lite_base_safetensor/config.json

# Adapt to mindie's Ling plus model configuration
cp /home/HwHiAiUser/Ascend/Ling_plus/config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_chat_config.json /home/HwHiAiUser/Ascend/Ling_plus/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_plus/config.json

# Adapt to mindie's Ling plus base model configuration
cp /home/HwHiAiUser/Ascend/Ling_plus_base/config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json.bak
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/plus/model_base_config.json /home/HwHiAiUser/Ascend/Ling_plus_base/config.json
chmod 750 /home/HwHiAiUser/Ascend/Ling_plus_base/config.json

Execute the shell script that adapts the mindie to the Ling model:

bash /home/HwHiAiUser/Ascend/Ling/inference/mindie/patch_atb_llm.sh

Stand-alone Servitization Inference (Ling lite)

Set the underlying environment variables:

source /usr/local/Ascend/atb-models/set_env.sh

Set different mindie configurations according to the model type:

# Ling Lite
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

# Ling Lite base
cp /home/HwHiAiUser/Ascend/Ling/inference/mindie/lite/config.base.json /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

Start the mindie service:

chmod 640 /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

cd $MIES_INSTALL_PATH
nohup ./bin/mindieservice_daemon > /tmp/service.log 2>&1 &

Check /tmp/service.log to check whether the output is Daemon start success!, if so, it means that MindIE-Service has started successfully.

Test if the request is correct:

# Chat model
wget -O- --post-data="{\"messages\":[{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"Who are you?\"}], \"stream\": false, \"max_tokens\":100, \"model\": \"bailing_moe\", \"temperature\":0}" \
--header='Content-Type:application/json' \
'http://127.0.0.1:1025/v1/chat/completions'

# base model

wget -O- --post-data='{"inputs":"My name is Olivier and I","stream":false,"parameters":{"temperature":1,"max_new_tokens":100,"do_sample":false}}' \
--header='Content-Type:application/json' \
'http://127.0.0.1:1025/infer'

Multi-machine service-based inference (Ling plus)

All of the following commands need to be executed simultaneously on all machines.

To enable multi-machine service-based inference, you need to configure a multi-machine ranktable file.

Get the

Agentic Learning

2025年4月1日 · 阅读需 4 分钟

inclusionAI

Ant Group

Introduction

Agent exhibits powerful capabilities by interacting with the external environment and making decisions based on the feedback it receives from the environment. For complex problems, it is often necessary for an agent to have multi-turn interactions with the environment to reach a solution. The complexity and dynamism of environments, coupled with the necessity for multi-turn interactions, pose numerous challenges in training agents.

We introduce AgenticLearning, an open-source agent training paradigm designed to empower researchers to train and evaluate autonomous agents effectively. AgenticLearning offers a framework for multi-turn interactions with the environment, enabling models to learn how to interact with the environment and make decisions based on its feedback, thereby enhancing the models' ability to leverage the environment to solve complex problems.

Advancements	Models	Tools	Environment	Training Framework
RAG-R1	Qwen2.5-7b-instruct	offline retrieval online search	AWorld	LLaMA-Factory verl AReaL
FunReason	Qwen2.5-7b-Coder-instruct	BFCL	AWorld	LLaMA-Factory verl

News

[2025/07/01] 🔥🔥🔥RAG-R1 We propose RAG-R1, a deepsearch training framework that incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism.

[2025/05/16] 🔥🔥🔥FunReason We propose FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss approach.

Advancements

Deepsearch

RAG-R1

Tools: Search Engines (offline or online)
LLM: Qwen2.5-7b-instruct

RAG-R1-framework

Overall framework of RAG-R1.

RAG-R1-result

Performance comparisons on QA benchmarks under the EM metric. The best and second best results are bold and underlined, respectively.

FunctionCall

FunReason

Tools: Real Human Function calling (BFCLv2 live&non-live)
LLM: Qwen2.5-7b-Coder-instruct

FunReason is a framework designed to enhance LLMs' function calling capabilities, achieving GPT-4o-comparable performance on BFCL, surpassing RL-based methods, mitigating catastrophic forgetting on HumanEval and MBPP, and using a data refinement strategy where natural CoT data outperforms artificial ones.

FunReason-Performance

Data refinement pipline of FunReason.

Overview of FunReason's data refinement pipeline. The pipeline consists of five stages: Function Call Classification, Query and Tool Identification, CoT Identification, Function and Parameter Identification, and Format Identification. Each stage ensures specific aspects of data quality, with failing examples either being discarded or regenerated.

FunReason-Performance

Performance of FunReason.

Citation

Please cite our repo if our works are helpful for your research.

@article{RAG-R1,
  title={RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism},
  author={Zhiwen Tan and Jiaming Huang and Qintong Wu and Hongxuan Zhang and Chenyi Zhuang and Jinjie Gu},
  journal={arXiv preprint arXiv:2507.02962},
  year={2025}
}

@article{FunReason,
  title={FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement},
  author={Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang},
  journal={arXiv preprint arXiv:2505.20192},
  year={2025}
}

Contact

For any question or feedback, please reach out to us at ender.tzw@antgroup.com or chenyi.zcy@antgroup.com

License

This project is licensed under the MIT License - see the LICENSE file for details.

AReaL: Ant Reasoning Reinforcement Learning for LLMs

2025年4月1日 · 阅读需 11 分钟

inclusionAI

Ant Group

AReaL (Ant Reasoning RL) is an open-source fully asynchronous reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research. Built upon the open-source project RealHF, we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).

AReaL Highlights

🔥 [NEW] Asynchronous RL: With algorithm-system co-design, AReaL supports fully asynchronous RL for the fastest training! Experimental support for multi-turn agentic RL is also provided.
🛠️ Open & Reproducible: We continuously release all code, datasets, and training recipes for RL training of LLMs.
🚀 Scalability: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
🔪 Cutting-Edge Performance: AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.

News

[2025/06/03] (v0.3, boba²) We release boba² (double-boba) for fully asynchronous RL training, which achieves a 2.77x speedup while obtaining on-par or even better training performance compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out our v0.3 overview blog and the research paper.

[2025/03/31] (v0.2, boba) Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our v0.2 technical blog.

[2025/02/24] (v0.1) Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our v0.1 technical blog.

Release Highlights

In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:

A fully asynchronous RL training pipeline with system and RL algorithm co-design, achieving over 2.77x speedup without any performance drop. Check the benchmark scripts and instructions here.
SOTA coding models, i.e., a 14B model with a 69.1 score on LCB-v5. To reproduce, check the configs and instructions.
Experimental support for multi-turn agentic RL training. Check our complete example.

For the complete system design and more training details, please check our v0.3 blog and our research paper.

Jump to the quickstart section if you want to quickly run an experiment and get your hands dirty! 😈

Overview of Asynchronous RL Training

During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works (DeepCoder, Intellect) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.

Synchronous vs One-step Overlap RL

Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.

AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.

Asynchronous RL Training

Fig 2. Execution timeline of our fully asynchronous RL system.

AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.

We compare the scalability of asynchronous RL training based on our AReaL-boba² system with classical synchronous RL training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.

Scaling Comparison

Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.

SOTA Code Generation Model by AReaL-boba²

We use Qwen3 as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.

Model (8B)	LiveCodeBench v5 (2024.10-2025.2)	Codeforces	CodeContests
Qwen3-8B	58.8	1879/96.7%	31.4
DeepSeek-R1-0528-Qwen3-8B	58.4	1945/97.3%	31.0
🤗 AReaL-boba²-8B-Open	62.0	1933/97.2%	41.4
🤗 AReaL-boba²-8B	63.0	1962/97.5%	40.8

Model (14B)	LiveCodeBench v5 (2024.10-2025.2)	Codeforces	CodeContests
Qwen3-14B	65.4	1978/97.7%	38.3
DeepCoder-14B-Preview	60.6	1936/95.3%	40.1
🤗 AReaL-boba²-14B-Open	67.3	1990/97.8%	46.2
🤗 AReal-boba²-14B	69.1	2044/98.2%	46.1

Larger Models	LiveCodeBench v5 (2024.10-2025.2)	Codeforces	CodeContests
Qwen3-235B	70.7	2056	-
DeepSeek-R1	64.3	2029	-
OpenAI-o3-mini (Medium)	66.3	2036	-

Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.

We highlight the tutorials and code walkthroughs about the following key features for asynchronous training:

RL Training for Multi-turn Agent

AReaL-boba² allows you to independently customize the dataset, rollout behavior, and the training algorithm, without needing to modify the heavy system-level code.

In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the step-by-step guide if you want to implement your own agentic RL project.

Getting Started

Obtain the training data:

Math
Code

For code training data, a simple preprocessing script was provided in examples/data_preprocess/preprocess_training_data.py:

python3 preprocess_training_data.py --data_path $original_data_path --output_path $training_data_path

Train Qwen3 1.7B locally (Remember to modify dataset.path in the script below):

bash examples/run_async_ppo.sh

Evaluation:

cd evaluation
# Evaluate the model
python eval_and_aggregate.py \
  --model_path ${MODEL_PATH} \
  --output_path ${OUTPUT_PATH} \
  --data_names aime24,aime25 \
  --max_gen_tokens 32768 \
  --data_names codeforces,lcb_v5 \
  --prompt_type qwen3-think-pure \
  --temperature 1.0

Resources

Quickstart

Benchmark and Reproduction

Reproduce boba² Code Models
- 🤗 Model weights: 8B-code, 14B-code, 8B-code-open, 14B-code-open
- Evaluation Guide
- Training configs and instructions
Scripts for Benchmark Training Throughput

Customization Guide

System Code Walkthrough

Future Plan

AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also hiring interns and full-time employees with open positions in both the US and China.

For the research and development plan already in place, please see the following list:

System Development

Support for SGLang
RL training with coding problems
Asynchronous generation and RL training
Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
RL for vision-language models (VLM)
Multi-turn agentic RL
Function calling and tool use

Algorithm Development

RL training recipes for 1.5B and 7B models
A complete RL training recipe for 32B models
Sample-efficient multi-task RL algorithms
Agentic capabilities with end-to-end RL
Stable RL training for larger MOE models

Acknowledgement

We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.

Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.

We also appreciate all the pioneering works from the community, particularly the ReaLHF project from OpenPsi Inc. and other projects, including but not limited to DeepScaleR, Open-Reasoner-Zero, OpenRLHF, VeRL, SGLang, QwQ, Light-R1 and DAPO.

Citation

@inproceedings{mei2025real,
  author       = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
  title        = {ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation},
  booktitle    = {Proceedings of the Eighth Conference on Machine Learning and Systems,
                  MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025},
  publisher    = {mlsys.org},
  year         = {2025},
}

@misc{fu2025areal,
      title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
      author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
      year={2025},
      eprint={2505.24298},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.24298},
}

In RL we trust — AReaL v0.2 (Boba) Release

2025年4月1日 · 阅读需 6 分钟

inclusionAI

Ant Group

Originally published on Medium by Ant Open Source.

AReaL v0.2 Boba

We are excited to release AReaL v0.2 (Boba), featuring three major milestones:

SGLang Support: With the addition of SGLang support and a series of engineering optimizations, AReaL v0.2 achieves a speed improvement of 1.5x over AReaL v0.1 on 7B models.
SOTA 7B Model: AReaL's RL training becomes more stable and sample-efficient. We obtain a SOTA 7B model in mathematical reasoning, achieving pass@1 score of 61.9 on AIME24 and 48.3 on AIME25 respectively.
Competitive 32B Model: The highly competitive 32B model was trained with extremely low cost, achieving results comparable to QwQ-32B using only 200 data samples.

Performance comparison table

The table shows performance of AReaL-boba-RL-7B and AReaL-boba-SFT-32B. Note that we obtain SOTA 7B model using RL on math reasoning. We also train a highly competitive 32B model using only 200 data samples, replicating QwQ-32B's inference performance on AIME 2024.

Training Speed Comparison

AReaL-boba throughput comparison with v0.1.0

AReaL-boba throughput comparison with v0.1.0

AReaL v0.2.0 features the following system optimizations:

Upgraded Generation Backend: vLLM 0.6.3 → SGLang v0.4.0

The generation backend has been upgraded leveraging SGLang's radix attention mechanism to significantly improve throughput in scenarios where multiple responses are sampled from the same prompt. SGLang automatically flushes radix caches upon weight updates, ensuring correctness in on-policy RL.

Optimized Training for Variable-Length Sequences & Large Batches

To handle variable sequence lengths efficiently, we eliminate padding and pack sequences into 1D tensors instead. A dynamic allocation algorithm optimally distributes sequences under a maximum token budget, balancing micro-batch sizes while minimizing the number of micro-batches. This approach maximizes GPU memory utilization.

High-Performance Data Transfer for 1K-GPU Scaling

AReaL employs NCCL with GPU-Direct RDMA (GDRDMA) over InfiniBand/RoCE, enabling direct GPU-to-GPU communication that bypasses costly CPU-mediated transfers and PCIe bottlenecks. This keeps generation-to-training data transfer overhead below 3 seconds even in a large 1,000-GPU cluster.

Training Recipe

SOTA 7B model using RL on math reasoning

Base Model

We use R1-Distill-Qwen-7B as our foundation model.

Dataset Curation

Our training dataset (AReaL-boba-106k) combines resources from multiple open-source projects:

We enhanced this with challenging problems from NuminaMath (AoPS/Olympiad subsets) and ZebraLogic.

To maintain an appropriate difficulty level, overly simple questions were filtered out. Specifically, we generate 8 solutions per question using DeepSeek-R1-Distill-Qwen-7B and filter out questions where all solutions were correct.

Reward Function

We adopt a sparse sequence-level reward mechanism. The model is instructed to enclose the final answer within \boxed{}, and the boxed answer is then verified. Correct responses receive a reward of +5, while incorrect ones are penalized with -5.

Notably, we observe that the KL reward can impair performance, particularly in long chain-of-thought training, so we set it to zero.

RL Algorithm

We employ Proximal Policy Optimization (PPO) as our training algorithm and remove the critic model to save compute. We set both the discount factor γ and the GAE parameter λ to 1. Such practices are also adopted by the Open-Reasoner-Zero project.

Token-Level Loss Normalization

Averaging the loss at the sequence level can underweight the overall contribution of longer texts. To address this, we normalize the loss at the token level, as also highlighted in DAPO.

Rollout Strategy

During the rollout phase, we sample 512 questions per batch, and the LLM generates 16 responses per question — resulting in a total batch size of 8,192. To minimize output truncation, we set the maximum generation length to 27K tokens. In our experiment, the truncation rate remained below 5%.

Key Hyperparameters

Key hyperparameters table

This configuration balances convergence speed with training stability.

Approaching QwQ-32B's performance using only 200 data samples

For the 32B model size, we further refine the training data and release AReaL-boba-SFT-200, a high-quality dataset with only 200 data points. Accompanied by relevant training scripts, we replicated QwQ-32B's inference performance on AIME 2024 via Supervised Fine-Tuning (SFT).

Evaluation Best Practices

During evaluation, we use vLLM v0.6.3 as the generation framework. We recommend manually configuring the following options:

enforce_eager=True
enable_chunked_prefill=False
disable_custom_all_reduce=True
disable_sliding_window=True

Following the practice of DeepSeek models, we incorporate a directive in the prompt: "Please reason step by step, and enclose your final answer in \boxed{}." To encourage long context reasoning, we also enforce that the model begins each response with \n.

To ensure reliable pass@1 estimation, we:

Sample 32 answers per problem
Use temperature=0.6 and top_p=0.95 for SFT models
Maintain training temperature (1.0) for RL models

Conclusion & Future Work

Our results demonstrate that high-quality data is equally critical as algorithmic innovations. When conducting RL training on a powerful base model, we require more challenging problems to facilitate learning. A straightforward strategy for data filtering involves removing problems that the base model consistently solves correctly across multiple sampling attempts.

AReaL delivers stable and fast training with cutting-edge model performances. Since initial release, we've continuously improved system efficiency, training stability, and accessibility.

Looking ahead, the AReaL team will:

Further optimize system performance
Introduce new features
Continue open-sourcing training data
Expand to broader reasoning tasks

We believe these contributions lower the barrier for high-quality RL training while pushing the boundaries of reasoning capabilities. We welcome community feedback and collaboration to drive further progress.

PromptCoT & PromptCoT-Mamba: Advancing the Frontiers of Reasoning

2025年4月1日 · 阅读需 5 分钟

inclusionAI

Ant Group

News

May 30, 2025: PromptCoT-Mamba released! Introducing an attention-free foundation model for reasoning tasks.
Apr 11, 2025: PromptCoT-QwQ-32B model and its training data released, achieving new state-of-the-art results.
Mar 7, 2025: PromptCoT project launched, including the problem generation model, distilled models (PromptCoT-DS series), and associated datasets.

Overview

This repository unifies two synergistic projects aimed at advancing the frontiers of mathematical and code reasoning in Large Language Models (LLMs): PromptCoT and PromptCoT-Mamba.

PromptCoT (Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models) addresses the critical challenge of acquiring high-quality, complex problems for training advanced LLMs. It introduces a novel methodology to systematically generate Olympiad-level mathematical problems by modeling the rationale behind expert problem design. This approach not only enhances problem diversity and difficulty but also ensures logical consistency in problem construction, providing a scalable solution for creating robust training datasets.

PromptCoT-Mamba (Scaling Reasoning without Attention) leverages the problem generation capabilities of the PromptCoT pipeline to train PromptCoT-Mamba-7B, the first attention-free foundation model based on the Mamba-2 architecture. This model demonstrates that structured training curricula can enable attention-free models to surpass strong Transformer baselines on a wide array of competition-level math and code reasoning tasks, all while maintaining constant-memory inference without KV caching.

Together, these projects offer a powerful suite of tools, models, and datasets for researchers and developers working on the cutting edge of AI reasoning.

Highlights & Key Results

1. PromptCoT: Problem Generation & Distilled Models

✨ The Missing Piece for Test-Time Scaling: A lightweight yet powerful problem generation model enabling the construction of prompt sets at any scale with sufficient quality, perfect for SFT or RL post-training.
📖 A Fully Open Project: All models (generation, distilled LLMs) and datasets (generation inputs, SFT data) are open-sourced.
🏆 Superior Performance of Distilled Models:
- PromptCoT-DS-7B consistently surpasses its base model, DeepSeek-R1-Distill-Qwen-7B, with significant gains:
  - +0.9% on MATH-500 (93.7%)
  - +3.2% on AIME2024 (58.7%)
  - +9.2% on AIME2025 (49.2%)
- PromptCoT-DS-7B (7B parameters) achieves results comparable to larger 32B models like S1-32B and LIMO-32B.
- PromptCoT-QwQ-32B sets a new standard, outperforming other 32B models by a significant margin:
  - MATH-500: 96.7% ± 0.5%
  - AIME2024: 83.8% ± 2.8%
  - AIME2025: 75.4% ± 4.7%
- PromptCoT-DS-1.5B demonstrates competitive performance against RL-based models purely through distillation.
⚡ Efficiency Without Compromise: PromptCoT-DS-1.5B achieves 40+% AIME scores using over 15× fewer A100 GPU hours compared to models like DeepScaleR-1.5B-Preview.

2. PromptCoT-Mamba: Attention-Free Reasoning

🚀 First Attention-Free SOTA: PromptCoT-Mamba-7B is the first attention-free model (Mamba-2 architecture) to outperform strong Transformer baselines in math and code reasoning.
🧠 Trained with PromptCoT Pipeline: Utilizes a structured, two-stage curriculum with data generated by PromptCoT.
💪 Strong General Performance: PromptCoT-Mamba-7B consistently outperforms 7B-scale Transformer and hybrid Mamba-Transformer baselines.
- MATH-500: 84.6%
- AIME 2024: 35.2%
- AIME 2025: 24.6%
- Livecodebench: 29.9%
🎯 Math Specialization: The math-specialized variant, PromptCoT-Mamba-Math-7B, further boosts math performance:
- MATH-500: 88.0%
- AIME 2024: 42.9% (+7.7% over generalist)
- AIME 2025: 30.8% (+6.2% over generalist)
⚡ Inference Efficiency: Offers substantial speedups (e.g., 3.66× faster on 24GB GPU for long sequences) and constant-memory inference, ideal for cost-sensitive or long-context workloads.

Performance Details

PromptCoT Series Performance

Model	GSM8K	MATH-500	AIME2024	AIME2025
🔹 1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B	-	83.9%	28.9%	28.1%
STILL-3-1.5B-preview	-	85.5%	39.3%	-
DeepScaleR-1.5B-Preview	-	🟢 87.8%	🟢 43.1%	🟢 37.1%
PromptCoT-DS-1.5B (ours)	🟢 87.6% ± 0.5%	85.3% ± 1.1%	41.2% ± 6.9%	36.7% ± 6.2%
🔹 7B Models
DeepSeek-R1-Distill-Qwen-7B	-	92.8%	55.5%	40.0%
Qwen2.5-7B-SimpleRL	-	82.4%	26.7%	-
OpenThinker-7B	-	89.6%	30.0%	33.3%
OpenR1-Qwen-7B	-	90.6%	36.7%	40.0%
PromptCoT-DS-7B (ours)	🔥 92.8% ± 0.5%	🔥 93.7% ± 0.7%	🔥 58.7% ± 3.1%	🔥 49.2% ± 7.9%
🔹 32B Models
DeepSeek-R1-Distill-Qwen-32B	-	94.3%	72.6%	-
S1-32B	-	93.0%	56.7%	26.6%
LIMO-32B	-	94.8%	57.1%	46.6%
QwQ-32B	-	-	82.1%	70.8%
PromptCoT-QwQ-32B (ours)	🔥🔥 96.4% ± 0.2%	🔥🔥 96.7% ± 0.5%	🔥🔥 83.8% ± 2.8%	🔥🔥 75.4% ± 4.7%

PromptCoT-Mamba Performance

General Performance:

Model	MATH-500	AIME 24	AIME 25	OlympiadBench	HumanEval	HumanEval+	Livecodebench
PromptCoT-Mamba-7B	84.6	🔥🔥35.2	🔥🔥24.6	50.7	81.7	75.0	🔥🔥29.9
Gemma3-27B	89.0	32.6	24.0	54.2	86.0	78.0	26.9
Gemma3-12B	83.8	22.9	19.2	49.9	81.1	73.2	22.2
Sky-T1-7B	85.0	19.2	19.2	49.2	41.5	37.2	18.3
S1.1-7B	82.0	19.2	17.5	43.1	64.0	56.7	13.3
Bespoke-Stratos-7B	81.2	18.3	16.3	45.0	73.2	68.3	8.6
Nemotron-H-8B	77.6	--	--	--	79.3	74.4	--
M1-3B	81.7	23.0	22.0	43.6	--	--	--

Math Specialization vs. Generalist:

Model	MATH-500	AIME 24	AIME 25	OlympiadBench	HumanEval	HumanEval+	Livecodebench
PromptCoT-Mamba-Math-7B	🔥🔥88.0	🔥🔥42.9	🔥🔥30.8	🔥🔥52.1	71.3	66.5	20.3
PromptCoT-Mamba-7B	84.6	35.2	24.6	50.7	81.7	75.0	29.9

Citation

If you find PromptCoT or PromptCoT-Mamba useful in your research, please consider citing the respective papers:

For PromptCoT:

@article{zhao2025promptcot,
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Kong, Lingpeng},
  title     = {PromptCoT: Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models},
  year      = {2025},
  journal   = {arXiv preprint arXiv:2503.02324},
  url       = {http://arxiv.org/abs/2503.02324}
}

For PromptCoT-Mamba:

@article{zhao2025scaling,
  author    = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
  title     = {Scaling Reasoning without Attention},
  journal   = {arXiv preprint arXiv:2505.22425},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.22425}
}

Ring: A Reasoning MoE LLM Provided and Open-sourced by inclusionAI

2025年4月1日 · 阅读需 2 分钟

inclusionAI

Ant Group

🤗 Hugging Face | 🤖 ModelScope

News

[2025-06]:🎉 Add Ring-lite Model
[2025-04]:🎉 Add Ring-lite-linear-preview Model

Introduction

Ring is a reasoning MoE LLM provided and open-sourced by inclusionAI, derived from Ling. We introduce Ring-lite-distill-preview, which has 16.8 billion parameters with 2.75 billion activated parameters. This model demonstrates impressive reasoning performance compared to existing models in the industry.

Model Downloads

Model	#Total Params	#Activated Params	Context Length	Download
Ring-lite-distill-preview	16.8B	2.75B	64K	🤗 HuggingFace 🤖 ModelScope
Ring-lite	16.8B	2.75B	128K	🤗 HuggingFace 🤖 ModelScope

Quickstart

🤗 Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ring-lite"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

🤖 ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.

Deployment

Please refer to Ling

Finetuning

Please refer to Ling

License

This code repository is licensed under the MIT License.

Citation

[TBD]

🌟 Overview​

🎯 Core Objectives​

📊 Dataset Release Status​

Introduction​

Update​

Model Downloads​

Evaluation​

Ling-lite​

Standard Benchmarks​

Context Window​

Quickstart​

🤗 Hugging Face Transformers​

🤖 ModelScope​

Deployment​

vLLM​

Environment Preparation​

Offline Inference:​

Online Inference:​

MindIE​

Configure preparation​

Machine network environment check​

Pull the image​

Container startup and configuration changes​

Start the container​

Download the model​

Model weight format conversion​

Change the model configuration​

Stand-alone Servitization Inference (Ling lite)​

Multi-machine service-based inference (Ling plus)​

Introduction​

News​

Advancements​

Deepsearch​

Overall framework of RAG-R1.

Performance comparisons on QA benchmarks under the EM metric. The best and second best results are bold and underlined, respectively.

FunctionCall​

Data refinement pipline of FunReason.

Performance of FunReason.

Citation​

Contact​

License​

News​

Release Highlights​

Overview of Asynchronous RL Training​

SOTA Code Generation Model by AReaL-boba²​

RL Training for Multi-turn Agent​

Getting Started​

Resources​

Quickstart​

Benchmark and Reproduction​

Customization Guide​

System Code Walkthrough​

Future Plan​

System Development​

Algorithm Development​

Acknowledgement​

Citation​

Training Speed Comparison​

Training Recipe​

SOTA 7B model using RL on math reasoning​

Approaching QwQ-32B's performance using only 200 data samples​

Evaluation Best Practices​

Conclusion & Future Work​

News​

Overview​

Highlights & Key Results​

1. PromptCoT: Problem Generation & Distilled Models​

2. PromptCoT-Mamba: Attention-Free Reasoning​

Performance Details​

PromptCoT Series Performance​

PromptCoT-Mamba Performance​

Citation​

News​

Introduction​

Model Downloads​

Quickstart​

🤗 Hugging Face Transformers​

🤖 ModelScope​

Deployment​

Finetuning​

🌟 Overview

🎯 Core Objectives

📊 Dataset Release Status

Introduction

Update

Model Downloads

Evaluation

Ling-lite

Standard Benchmarks

Context Window

Quickstart

🤗 Hugging Face Transformers

🤖 ModelScope

Deployment

vLLM

Environment Preparation

Offline Inference:

Online Inference:

MindIE

Configure preparation

Machine network environment check

Pull the image

Container startup and configuration changes

Start the container

Download the model

Model weight format conversion

Change the model configuration

Stand-alone Servitization Inference (Ling lite)

Multi-machine service-based inference (Ling plus)

Introduction

News

Advancements

Deepsearch

FunctionCall

Citation

Contact

License

News

Release Highlights

Overview of Asynchronous RL Training

SOTA Code Generation Model by AReaL-boba²

RL Training for Multi-turn Agent

Getting Started

Resources

Quickstart

Benchmark and Reproduction

Customization Guide

System Code Walkthrough

Future Plan

System Development

Algorithm Development

Acknowledgement

Citation

Training Speed Comparison

Training Recipe

SOTA 7B model using RL on math reasoning

Approaching QwQ-32B's performance using only 200 data samples

Evaluation Best Practices

Conclusion & Future Work

News

Overview

Highlights & Key Results

1. PromptCoT: Problem Generation & Distilled Models

2. PromptCoT-Mamba: Attention-Free Reasoning

Performance Details

PromptCoT Series Performance

PromptCoT-Mamba Performance

Citation

News

Introduction

Model Downloads

Quickstart

🤗 Hugging Face Transformers

🤖 ModelScope

Deployment

Finetuning

License

Citation