Confidence-Based Reward Design.
R-TAP introduces two complementary confidence-driven rewards:
-
Recursively Confidence Increase Reward: To reward meaningful refinement across recursive steps, this reward encourages the confidence score to improve from one cycle to the next: RIncrease = (1/(M-1)) Σ 𝟙[Conf(t+1) > Conf(t)], where M is the effective recursion depth.
-
Final Answer Confidence Reward: The final answer must be sufficiently confident: RFinal = 𝟙[Conf(M) ≥ τ], where τ is a preset threshold.
The total reward combines these with conventional rewards: R = R
Increase + R
Final + R
Format + R
Answer + R
Length.