CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

•

Original Author:Zhen Zhang et al.

•

February 12, 2026

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Image generated by Gemini AI

Researchers have introduced CM2, a reinforcement learning framework designed for multi-turn interactions with AI agents. CM2 replaces traditional verifiable rewards with checklist-based criteria, enabling more stable performance assessments. Trained in a simulated environment, CM2 demonstrated significant improvements over existing models, achieving higher scores on benchmarks like tau^-Bench and ToolSandbox. This approach offers a scalable method for enhancing AI tool use without the need for extensive engineering on reward systems. The code is available for public use at GitHub.

CM2: A New Framework for Reinforcement Learning in Multi-Turn Tool Use

Researchers have introduced CM2, a reinforcement learning (RL) framework designed to enhance AI agents' performance in multi-turn interactions and tool use. This approach addresses critical challenges in RL, particularly the complexities of building and maintaining executable tool environments.

Checklist Rewards and Evaluation Criteria

CM2 replaces conventional outcome rewards with checklist rewards, allowing for a systematic assessment of agent performance. It decomposes intended behavior into detailed binary criteria, transforming performance evaluations into stable, classification-style decisions. The framework employs sparse reward assignment while maintaining dense evaluation criteria.

Performance Outcomes

In testing, CM2 demonstrated substantial improvements over supervised fine-tuning techniques. Using an 8 billion parameter base model and an 8,000-example RL dataset, CM2 achieved:

An 8-point increase on the tau^-Bench evaluation.
A 10-point improvement on the BFCL-V4 benchmark.
A 12-point gain on ToolSandbox.

These results indicate superior performance compared to traditional methods and position CM2 on par with, or exceeding, the capabilities of similarly sized open-source models.

CM2's framework is accessible through the open-source community: CM2-RLCR-Tool-Agent on GitHub.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

CM2: A New Framework for Reinforcement Learning in Multi-Turn Tool Use

Checklist Rewards and Evaluation Criteria

Performance Outcomes

Related Topics:

Share this article