RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation

1Department of Computer Science, University of Virginia
2National Library of Medicine, National Institutes of Health
3Department of Computer Science, University of Illinois Urbana-Champaign
4Medical Oncology, Dana-Farber Cancer Institute
5Surgery, University of Alabama at Birmingham
6Department of Neurology, Yale School of Medicine
*Equal Contribution, †Co-correspondence

Abstract

Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework.

We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose Re²Search, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps.

Together, these findings lead to the optimized Re²Search++ agent, which surpasses most recent methods like Search-R1 by a relative increase of 3.2% to 11.6% in average F1. Finally, we examine the impact of different reward sources and analyze scaling properties in training and inference, offering practical insights for agentic RAG optimization.

RAG-Gym Framework

Here is an overview of the RAG-Gym framework. RAG-Gym employs a modular design, comprising prompt engineering, actor tuning, and critic training, to systematically optimize agentic RAG performance. By leveraging all three components, RAG-Gym improves the F1 score of the ReAct agent on HotpotQA from 41.09% to 60.19%.

RAG-Gym Overview

🏋️ Dimension 1: Prompt Engineering

Effective prompts are key to guiding LLM behavior. Building on a summary of existing agent functions—such as answer generation, question reasoning, and query generation—RAG-Gym introduces a novel agent architecture called Re²Search (Reasoning, Reflection, and Search). The core innovation of Re²Search lies in its unique "reasoning reflection" mechanism. Before making a final decision, the agent will:

  1. Construct an initial reasoning process and answer based on all currently available information.
  2. Reflect on its reasoning chain to identify which statements lack support from current information or are unverified claims.
  3. Generate highly targeted search queries based on these "uncertainties" to acquire missing key information and improve the answer.
This design tightly integrates the search process with answer construction. Here is the comparison of the functional components in various agent architectures. Re²Search is the only agent that incorporates all six key components, including "reasoning reflection".

Agent Comparison

Extensive tests across models and datasets show that prompt engineering alone gives Re²Search a clear advantage over standard prompts. For example, on HotpotQA, zero-shot ReAct achieves 41.09% F1, while Re²Search reaches 44.91%.

💪 Dimension 2: Actor Tuning

The "actor" refers to the LLM itself, and optimizing its parameters is crucial for improving decision quality. RAG-Gym enables fine-grained process supervision for actor tuning, meaning we not only consider the correctness of the final answer but also evaluate and reward each intermediate decision—such as generated search queries or reasoning steps. We systematically evaluate three mainstream LLM post-training algorithms: supervised fine-tuning (SFT), direct preference optimization (DPO), and proximal policy optimization (PPO). Our experiments show that for agents like ReAct, Search-o1, and Re²Search—which require multi-step reasoning and interaction with the environment—DPO and PPO generally outperform SFT, delivering more significant performance gains across most tasks. DPO, in particular, leverages preference comparisons between positive and negative actions to more effectively guide the model in generating high-quality intermediate steps. For example, after DPO tuning, Re²Search improves its F1 score on HotpotQA from 44.91% (zero-shot) to 55.22%.

Actor Tuning

🧐 Dimension 3: Critic Training

Beyond optimizing the actor, RAG-Gym also introduces a "critic" model—an external evaluator trained to predict the process reward for each state-action pair. The critic assesses the quality of actions (such as generated search queries) produced by the agent at each step. During inference, the actor generates multiple candidate actions for the current state. The critic scores these candidates, and the system selects the action with the highest score for execution. This mechanism offers several key advantages:

  • Improved generalizability: Our experiments show that integrating a trained critic consistently boosts performance across various models—including base LLMs (like Llama-3.1-8B), DPO-tuned LLMs, and even models such as GPT-4o-mini—on multiple datasets.
  • Plug-and-play enhancement: The critic can be used as a standalone module to enhance LLMs that cannot be directly fine-tuned (e.g., closed-source models), providing an effective way to improve their RAG capabilities.

Critic Training

🏆 Re²Search++: Optimized Agent Across All Dimensions

By integrating the best practices from all three optimization dimensions—adopting the Re²Search agent architecture, tuning the actor with DPO, and leveraging a Critic for action selection during inference—we developed the optimized Re²Search++ agent. Compared to recent reinforcement learning methods that rely on outcome supervision (such as Search-R1 and R1-Searcher, which typically require thousands of training questions), Re²Search++ demonstrates clear advantages. It not only matches or surpasses these methods on their reported training domains (e.g., HotpotQA), but also achieves substantial improvements on out-of-domain datasets (e.g., Bamboogle), with a relative F1 increase of 3.2% to 11.6% on average. This highlights the strong generalization capability enabled by RAG-Gym's fine-grained process supervision, effectively mitigating the overfitting issues that can arise from relying solely on outcome-based rewards.

Main Results

Analysis and Discussion

Comparison of Different Reward Sources

Process rewards can be collected from different sources. We evaluated their effectiveness in guiding agent actions toward correct answers and their alignment with human preferences, comparing GPT-4o annotations, Llama-3.1-8B annotations, and rollout-based annotations using Math-Shepherd, alongside human expert annotations on MedQA. The results in the table below show that the GPT-4o-trained reward model delivers the highest performance across all datasets, providing precise, fine-grained rewards for agent optimization and achieving the strongest agreement with human experts (85.85%). Although Llama-3.1-8B and rollout-based annotations outperform random baselines, they remain less effective than GPT-4o annotations and can even underperform on general-domain questions. These findings highlight the limitations of rollout-based methods—originally designed for math reasoning—in complex reasoning and search tasks, and underscore the need for tailored approaches in agentic RAG.

Type Outcome Source Process Source HotpotQA
(EM / F1)
2WikiMultihopQA
(EM / F1)
Bamboogle
(EM / F1)
MedQA
(Acc / Agree)
ORM Truth -- 41.10 / 53.35 47.70 / 55.59 43.20 / 57.46 66.77 / --
PRM (Random) -- -- 32.20 / 42.83 35.70 / 42.00 38.40 / 47.86 68.26 / 50.00
PRM (Rollout) Truth Rollout 39.60 / 51.85 42.94 / 49.57 48.80 / 56.05 68.34 / 71.03
PRM (Llama) Truth Llama-3.1-8B 40.30 / 51.74 40.70 / 48.22 44.80 / 54.36 68.50 / 65.99
PRM (GPT) Truth GPT-4o 44.10 / 56.84 50.20 / 57.94 51.20 / 63.15 71.96 / 85.85

Training Time Scaling

This figure displays how the performance of Re²Search agents scales with the availability of more training data across four datasets. In general, the performance of Re²Search improves with an increasing number of training samples, but the gains tend to converge as the sample size grows. Notably, on HotpotQA, 2WikiMultihopQA, and Bambooglem, even a small amount of process reward data (250 samples) yield significant performance gains.

Training Time Scaling

Inference Time Scaling

The results below show how the agent performance changes with the increasing number of sampled actions at each time step. We observe a consistent trend across multiple benchmarks, where increasing the number of sampled actions generally improves performance. However, performance gains gradually diminish, indicating that the agent reaches a point where additional sampled actions contribute less to improvement.

Inference Time Scaling

BibTeX

@article{xiong2025rag,
  title={Rag-gym: Optimizing reasoning and search agents with process supervision},
  author={Xiong, Guangzhi and Jin, Qiao and Wang, Xiao and Fang, Yin and Liu, Haolin and Yang, Yifan and Chen, Fangyuan and Song, Zhixing and Wang, Dengyu and Zhang, Minjia and others},
  journal={arXiv preprint arXiv:2502.13957},
  year={2025}
}