Comparison of Process Supervision Methods
Among the three process supervision algorithms, PRM achieves the best results overall, outperforming ZSL baselines by up to 25.6% (ReAct; Average F1). While PRM outperforms the other methods, both DPO and SFT show significant improvements over the ZSL baseline. Interestingly, SFT slightly outperforms DPO on the Direct, CoT, and RAG agents, where the tuning focuses exclusively on the answer generation step. In contrast, DPO significantly surpasses SFT on ReAct, Search-o1, and ReSearch, where the tuning process also involves learning to generate high-quality queries by contrasting positive and negative samples.