Icon TDRM
THUDM Icon

TDRM

Smooth Reward Models with Temporal Difference for LLM RL and Inference

Process Reward Model + Online Temporal Difference Training
to Train Smooth and Strong Reward Models.

Get Started
660
23
Diagram placeholder

Overview of TDRM framework and improvement over baseline method.

Dan Zhang*1, Min Cai*2, Jonathan Li3, Ziniu Hu3,

Yisong Yue3, Yuxiao Dong1, Jie Tang1,4,

* Equal Contribution

1 Department of Computer Science and Technology, Tsinghua University    2 University of Alberta   
3 California Institute of Technology    4 School of Electronics and Computer Science, University of Southampton   


Abstract

Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. Incorporating TDRM into the actor-critic style online RL loop yields consistent empirical gains. It is worth noting that TDRM is a supplement to verifiable reward methods, and both can be used in series. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with verifiable rewards, TDRM-trained PRMs lead to more data-efficient RL---achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain---and yield higher-quality language model policies on 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B).

TDRM Algorithm

Reward Model Comparison

The TDRM framework aims to train a smooth reward model by leveraging trajectory-level optimization.

Training Smooth Reward Models

-

Bootstrapping Values from Upcoming States

Given a trajectory, the model bootstraps values from upcoming states to improve its current state estimates. For example for question:

What is the arithmetic mean of the integers from –4 through 5, inclusive? Express your answer as a decimal to the nearest tenth.

In below, value estimate of state 1 is calculated by bootstrapping from state 2:

🤖
Step 1: To find the arithmetic mean of a set of numbers, I need to add up all the numbers and divide by the number of numbers.
State Value: 0.4999
Step 2: I can use a formula to do this, or I can just write out the numbers and add them up.
↩ This value is used to update V(s₁)
State Value: 0.4997
Value of state 1 updated to 0.4999 + 0.8 × 0.4997 = 0.8997

Applying TDRM to RL (Policy) Training

After obtaining smooth reward models, we can apply TDRM to reinforcement learning (RL) training. The key idea is to leverage the learned reward model to guide the policy optimization process by providing scalar rewards. In addition, we leverage both rewards from a RM and task-specific rules.

For example, when using GRPO as the RL algorithm: GRPO Algorithm

Results

Result 1: Policy Training with TDRM

In below, we show the results of applying TDRM to policy training. Results are shown across different models, including Qwen2.5(-Math) series, GLM(-z1) series and DS-R1-Distill-Qwen series, showcasing their performance improvements with TDRM.

Results Overview

Result 2: Inference-Time Scaling

For Inference-time verification, we evaluate under two settings: Best-of-N sampling and Tree-search (greedy search).

Analysis

Smoothness and Training Dynamics

To demonstrate that the RMs trained with TDRM are smoother, we analyze their behavior across various dimensions.

First, we calculate a metric inspired by local Lipschitz continuity to quantify the smoothness of the RMs.

It is shown that TDRM (0.2741) is significantly smoother than the ScalarPRM (0.3331) according to this metric.


Furthermore, we analyze the smoothness by comparing TD error across steps, and TD error vs. absolute value difference.

By examining the TD error across steps, we observe that TDRM consistently outperforms ScalarPRM, indicating its superior smoothness. And we also show the training dynamics of RL using TDRM, showcasing that TDRM can also improve the sample efficiency of RL training.

Citation

If you find TDRM useful in your research, we would appreciate it if you consider citing our work:

@misc{zhang2025tdrmsmoothrewardmodels,
                title={TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference}, 
                author={Dan Zhang and Min Cai and Jonathan Li and Ziniu Hu and Yisong Yue and Yuxiao Dong and Jie Tang},
                year={2025},
                eprint={2509.15110},
                archivePrefix={arXiv},
                primaryClass={cs.LG},
                url={https://arxiv.org/abs/2509.15110}, 
          }