RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision

1Department of Computer Science, University of Virginia
2National Library of Medicine, National Institutes of Health
3Department of Computer Science, University of Illinois Urbana-Champaign
4Medical Oncology, Dana-Farber Cancer Institute
5Surgery, University of Alabama at Birmingham
6Department of Neurology, Yale School of Medicine
*Equal Contribution, †Co-correspondence

Abstract

Retrieval-augmented generation (RAG) has shown great potential for knowledge-intensive tasks, but its traditional architectures rely on static retrieval, limiting their effectiveness for complex questions that require sequential information-seeking. While agentic reasoning and search offer a more adaptive approach, most existing methods depend heavily on prompt engineering.

In this work, we introduce RAG-Gym, a unified optimization framework that enhances information-seeking agents through fine-grained process supervision at each search step. We also propose ReSearch, a novel agent architecture that synergizes answer reasoning and search query generation within the RAG-Gym framework.

Experiments on four challenging datasets show that RAG-Gym improves performance by up to 25.6% across various agent architectures, with ReSearch consistently outperforming existing baselines. Further analysis highlights the effectiveness of advanced LLMs as process reward judges and the transferability of trained reward models as verifiers for different LLMs. Additionally, we examine the scaling properties of training and inference in agentic RAG.

Retrieval-Augmented Generation Gymnasium (RAG-Gym)

Here is the overview of RAG-Gym: (a) RAG-Gym formulates the knowledge-intensive question-answering (QA) task as a nested Markov Decision Process (MDP), where the outer MDP governs high-level action generation through interactions with the information retrieval (IR) environment, while the inner MDP controls token generation within LLM. (b) Different process supervision methods are implemented in RAG-Gym, including Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO), and Process Reward Modeling (PRM).

RAG-Gym Overview

Reasoning and Search (ReSeach) Agent

We also propose ReSearch, which synergizes Reasoning and Search by integrating history knowledge summarization, answer reasoning, and query generation to iteratively resolve missing information in constructing the final answer.

ReSearch Agent Architecture

Here is a comparison of different agent architectures in handling a multi-hop question constructed from Wikipedia. ReSearch explicitly aligns reasoning with query generation, leading to more targeted retrieval and improved answer quality.

ReSearch Agent Example

Main Results

The table below shows the performance of various agents and their tuned versions using different process supervision methods in RAG-Gym. Process supervision consistently improves performance across all agents compared to the zero-shot learning (ZSL) baseline, demonstrating its effectiveness in enhancing intermediate reasoning and query generation.

Main Results Table

Comparison of Process Supervision Methods

Among the three process supervision algorithms, PRM achieves the best results overall, outperforming ZSL baselines by up to 25.6% (ReAct; Average F1). While PRM outperforms the other methods, both DPO and SFT show significant improvements over the ZSL baseline. Interestingly, SFT slightly outperforms DPO on the Direct, CoT, and RAG agents, where the tuning focuses exclusively on the answer generation step. In contrast, DPO significantly surpasses SFT on ReAct, Search-o1, and ReSearch, where the tuning process also involves learning to generate high-quality queries by contrasting positive and negative samples.

Comparison of ReSearch and other Agents

ReSearch consistently outperforms other agents, both in the ZSL setting and in settings with process supervision. Without tuning, ReSearch achieves strong zero-shot performance, demonstrating the effectiveness of explicitly aligning answer reasoning with query generation. Using process reward models, ReSearch achieves state-of-the-art performance, with an average EM score of 54.31% and an average F1 score of 62.41% across different datasets. Furthermore, ReSearch exhibits superior generalization, achieving top scores on 2WikiMultihopQA and Bamboogle without task-specific fine-tuning.

Reward Model Transferability

This figure highlights the performance improvements of the ReSearch agent with GPT-4o-mini using Llama-3.1-8B-based process reward models. The action selection with reward models leads to consistent gains across all tasks, demonstrating the transferability of PRM to effectively select high-quality actions in different LLMs. This result also highlights the potential of using process reward models as a plug-and-play module to enhance the reasoning and search capabilities of proprietary LLMs, where direct fine-tuning is not feasible due to restrictions on model access.

Transferability

Analysis and Discussion

Comparison of Different Reward Sources

To evaluate the effectiveness of different process reward sources in training reward models, we conducted experiments on MedQA and compared their alignments with domain expert preferences as well as their impact on downstream accuracy. The results are shown below. The reward model trained with GPT-4o annotations achieved the highest agreement with human preferences (85.85%), significantly outperforming the rollout-based method (71.03%) introduced in Math-Shepherd. Furthermore, the model trained with GPT-4o annotations achieved the highest accuracy (71.96%), highlighting its effectiveness in knowledge-intensive tasks.

Type Source Agreement (%) Accuracy (%)
Outcome Reward Model Truth -- 66.77
Process Reward Model Random 50.00 68.26
Process Reward Model Rollout 71.03 68.34
Process Reward Model GPT-4o 85.85 71.96

Training Time Scaling

This figure displays how the performance of ReSearch agents scales with the availability of more training data across four datasets. In general, the performance of ReSearch improves with an increasing number of training samples, but the gains tend to converge as the sample size grows. Notably, on HotpotQA, 2WikiMultihopQA, and Bambooglem, even a small amount of process reward data (250 samples) yield significant performance gains.

Training Time Scaling

Inference Time Scaling

The results below show how the agent performance changes with the increasing number of sampled actions at each time step. We observe a consistent trend across multiple benchmarks, where increasing the number of sampled actions generally improves performance. However, performance gains gradually diminish, indicating that the agent reaches a point where additional sampled actions contribute less to improvement.

Inference Time Scaling

BibTeX

@article{xiong2025raggym,
    title={RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision}, 
    author={Guangzhi Xiong and Qiao Jin and Xiao Wang and Yin Fang and Haolin Liu and Yifan Yang and Fangyuan Chen and Zhixing Song and Dengyu Wang and Minjia Zhang and Zhiyong Lu and Aidong Zhang},
    journal={arXiv preprint arXiv:2502.13957},
    year={2025}
}