Recursive Think-Answer Process
for LLMs and VLMs

Byung-Kwan Lee*†   
KAIST
    Youngchae Chee*   
KAIST
    Yong Man Ro
KAIST

*Equal Contribution    Currently Research Scientist at NVIDIA
CVPR 2026 Findings
Abstract. Think–Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think–Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a Confidence Generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards—Recursively Confidence Increase Reward and Final Answer Confidence Reward—we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP–applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning.
▶️ Overview Video
🔍 Click to zoom in

Figure 1: Overall accuracy (%) of numerous large language models (LLMs) on five evaluation benchmarks—AIME25, HMMT Feb 25, OmniMath, GPQA, and LiveCodeBench.
Motivation. Current Think–Answer models almost always rely on a single-pass reasoning trajectory. After generating one Think–Answer pair, the model stops its inference process—even when the reasoning is inaccurate, inconsistent, or clearly uncertain. Models often produce self-reflective cues such as "Oops!" or "Let me try again" which show their uncertainty. However, these signals are not used: the model outputs its final answer without any method for self-evaluation or additional refinement. R-TAP addresses this by introducing a confidence-driven iterative reasoning framework that enables dynamic re-engagement of reasoning cycles and self-corrective refinement internally.
🔍 Click to zoom in

Figure 2: Overall accuracy (%) of numerous vision language models (VLMs) on five evaluation benchmarks—MMMU, MathVista, OlympiadBench, MathVision, and MMMU-Pro.
Confidence Generator. A central component of R-TAP is the Confidence Generator Cϕ, which provides the reliability of each response in the recursive Think-Answer trajectory. Given a question q and a Think-Answer response o(t), the Confidence Generator outputs a scalar confidence score from zero to one. It is built from the reference model structure but replaces the language head with a confidence head followed by a sigmoid activation. Importantly, the Confidence Generator is used only during training and removed at inference, so R-TAP introduces no additional inference-time cost.
🔍 Click to zoom in

Figure 3: Qualitative example of recursive think–answer process on a combinatorics question. The model iteratively refines its solution across multiple reasoning cycles, successfully correcting initial misconceptions such as off-by-one errors.
Confidence-Based Reward Design. R-TAP introduces two complementary confidence-driven rewards:
  • Recursively Confidence Increase Reward: To reward meaningful refinement across recursive steps, this reward encourages the confidence score to improve from one cycle to the next: RIncrease = (1/(M-1)) Σ 𝟙[Conf(t+1) > Conf(t)], where M is the effective recursion depth.
  • Final Answer Confidence Reward: The final answer must be sufficiently confident: RFinal = 𝟙[Conf(M) ≥ τ], where τ is a preset threshold.
The total reward combines these with conventional rewards: R = RIncrease + RFinal + RFormat + RAnswer + RLength.
🔍 Click to zoom in

Figure 4: Recursive Think-Answer Process. Given a question q, base LLM/VLM πθ recursively generates multiple Think-Answers o(t) until the answer is correct. A pre-trained Confidence Generator Cϕ assesses each question and Think-Answer pair then generates confidence score Conf(t). This confidence score is used to formulate confidence-based reward—RIncrease and RFinal—which serves as a sufficient reinforcement signal to train the model to recursively generate higher confidence Think-Answers.
Training Overview. R-TAP proceeds in two stages. Stage 1 performs supervised learning on the Confidence Generator using binary correctness labels for each reasoning trajectory produced by the target model. Stage 2 applies RL with GRPO to optimize the model's reasoning behavior under recursive rewards, enabling the model to generate progressively more accurate and confident reasoning across cycles. During this stage, the Confidence Generator is trained simultaneously to predict reliable confidence scores for the updated model's responses in real time. Notably, the Confidence Generator is used only during training, so R-TAP introduces no additional inference-time cost.
🔍 Click to zoom in

Figure 5: Training curves showing the progression of three reward signals—recursively confidence increase reward, last answer's confidence reward, and accuracy reward—over iterations during GRPO. All rewards show consistent upward trends, indicating effective recursive refinement.
Key Results. R-TAP delivers strong and consistent performance improvements across diverse language and vision-language reasoning benchmarks. R-TAP-applied models show dramatic improvements, closing the gap to OpenAI o1 and o3 models. Moreover, recursive refinement leads to a substantial reduction in "Oops!"-style self-corrections during inference, indicating that R-TAP achieves more reliable yet faster inference-time reasoning with fewer failures along the trajectory. These results demonstrate that confidence-guided recursion training is a powerful mechanism for enhancing both the accuracy and inference speed of modern reasoning models.
🔍 Click to zoom in

Figure 6: Impact of R-TAP on reducing the number of "Oops"-style words—which corresponds to the number of erroneous reasoning—and its effect on substantially reducing inference time. (Left) Negative correlation between the number of erroneous reasoning and R-TAP train iterations. (Center) Evaluation result on the significant reduction of erroneous reasoning by applying R-TAP. (Right) Evaluation result on the substantial reduction of inference time due to the reduction of erroneous reasoning by applying R-TAP.