T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2605.02178 |
| License | ArXiv |
| Provider | hf |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2605.02178,
author = {Haixin Wang, Hejie Cui, Chenwei Zhang},
title = {T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2605.02178}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
⚖️ Nexus Index V2.0
💬 Index Insight
FNI V2.0 for T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning: Semantic (S:50), Authority (A:0), Popularity (P:48), Recency (R:100), Quality (Q:65).
Verification Authority
📝 Executive Summary
❝ Cite Node
@article{Unknown2026T^2PO:,
title={T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2605.02178},
year={2026}
} Abstract & Analysis
T2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
Title:
Content selection saved. Describe the issue below:
Description:
License: arXiv.org perpetual non-exclusive license
arXiv:2605.02178v1 [cs.AI] 04 May 2026
T 2 PO: Uncertainty-Guided Exploration Control for Stable
Multi-Turn Agentic Reinforcement Learning
Haixin Wang
Hejie Cui
Chenwei Zhang
Xin Liu
Shuowei Jin
Shijie Geng
Xinyang Zhang
Nasser Zalmout
Zhenyu Shi
Yizhou Sun
Abstract
Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs’ performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization ( T 2 PO ), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level , T 2 PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level , T 2 PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T 2 PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO .
Agentic AI, Reinforcement Learning
1
Introduction
Recent advances in self-evolving agents are deeply rooted in multi-turn reinforcement learning (RL) (Liu et al. , 2024 ; Team, 2025 ; Team et al. , 2025 ; Wang et al. , 2026 ) , which provides the foundational mechanism for training agents to reason, act, and self-evolve through iterative interactions with the environments. Despite this progress, the community still lacks a stable and scalable training paradigm. Current multi-turn RL pipelines face intertwined challenges in both effectiveness and efficiency. On the one hand, long-horizon interactions combined with sparse reward signals make credit assignment inherently difficult (Zhou et al. , 2024 ; Wang et al. , 2025b ) . On the other hand, rollout collection is computationally expensive, driving the adoption of acceleration techniques such as low-precision inference (Liu et al. , 2025a ) and asynchronous sampling (Fu et al. , 2025a ) . Yet these efficiency-oriented solutions inevitably introduce off-policy drift and stale policy effects (Zheng et al. , 2025a ) . Both of these issues tend to amplify training instability and frequently lead to the notorious training collapse.
Figure 1 : Training instability of SOTA baselines under different environment initialization random seeds. We can observe that success rate drops while internal signals like KL divergence and gradient norms explode (shown in orange background).
To mitigate training instability, prior work has explored a variety of strategies, including fine-grained credit assignment (Feng et al. , 2025 ) , internal or process-based reward modeling (Wang et al. , 2025a ; Dong et al. , 2025 ) , and trajectory-level filtering of failed interactions (Yu et al. , 2025 ; Xue et al. , 2025 ) . These approaches aim to provide denser learning signals or remove void rollouts, and have shown partial success in stabilizing optimization. However, most existing solutions operate either at a coarse trajectory level or through implicit control via reward shaping. In the inherently complex multi-turn setting, such coarse or indirect interventions make the training dynamics highly sensitive to hyperparameters and rollout distributions. As a result, they often lead to training collapse , the phenomenon characterized by rapidly degrading performance or complete failure of policy optimization, as illustrated in Figure 1 .
Our key insight. To understand its origin, we analyze representative training trajectories and identify insufficient exploration as the underlying cause, reflecting a systematic violation of the exploration–exploitation trade-off (Mehlhorn et al. , 2015 ) . We refer to this failure mode as hesitation . At the token level , LLM agents frequently exhibit over-thinking, generating long sequences of tokens whose information gain rapidly saturates, while their sampling noise continues to accumulate. At the turn level , LLM agents may deviate from the successful action space at an early stage, yet continue executing numerous repetitive and unproductive turns, leaving little chance of recovery within a limited budget. Hesitation is defeat! Such behaviors introduce substantial noise into credit assignment, resulting in unstable gradients and high variance in policy updates.
Training effectiveness and efficiency need not be at odds; they can be jointly optimized once the root cause of instability is properly identified. We aim to overcome hesitation by controlling exploration through the capture of intrinsic signals before exploration becomes inefficient. First, we construct a self-calibrated uncertainty signal by fusing entropy and confidence, which serves as a monitoring signal during rollouts. Then, we observe that continued token generation without a noticeable reduction in uncertainty indicates token-level hesitation, while repeated turns exhibiting similar uncertainty patterns indicate turn-level hesitation.
In this work, we propose T 2 PO to explicitly and finely control exploration. At the token level , T 2 PO monitors uncertainty dynamics and triggers thinking intervention once marginal uncertainty change falls below a threshold. At the turn level , T 2 PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. By explicitly reducing inefficient exploration rather than introducing additional reward shaping, T 2 PO restores a balanced exploration–exploitation regime. Besides, we employ rejection-based fine-tuning (RFT) (Wei et al. , 2025 ) for cold-start, introduce a memory context window to alleviate training pressure, enforce a strict format penalty for structural compliance, and finally adopt SOTA policy update methods for optimization.
Extensive experiments on challenging multi-turn agentic benchmarks demonstrate the superiority of T 2 PO, and comprehensive ablations and analyses further verify its effectiveness in improving exploration efficiency.
2
Related Works
2.1
Agentic RL Training
Early work on LLM agents focused on modular infrastructures for interaction and evaluation. RAGEN (Wang et al. , 2025b ) established a unified framework for training and benchmarking agentic RL systems. Subsequent efforts sought to stabilize multi-turn training through trajectory curation and sampling. SimpleTIR (Xue et al. , 2025 ) filters rollouts containing void turns, while rStar2-Agent (Shang et al. , 2025 ) oversamples rollout groups and retains only high-quality trajectories, improving training stability via heuristic data selection. However, these methods rely on external filtering and do not explicitly regulate reasoning dynamics within trajectories. Meanwhile, group-based critic-free optimization has emerged as an efficient paradigm for long-horizon agent training. GiGPO (Feng et al. , 2025 ) extends group-based advantage estimation to multi-turn settings, achieving strong performance without auxiliary value networks. Yet existing multi-turn group-based methods still lack principled mechanisms to suppress redundant reasoning within and across turns, resulting in inefficient exploration and high rollout cost.
2.2
RL with Internal Rewards
To address sparse rewards in long-horizon agentic RL, recent work leverages model-generated internal feedback to provide denser supervision. Most approaches derive unsupervised rewards from uncertainty, typically measured by policy entropy. However, entropy plays conflicting roles: some methods minimize entropy to encourage confident predictions, while others promote high-entropy exploration by incorporating it into advantage estimation, as in SEED-GRPO (Chen et al. , 2025 ) and related designs. Beyond entropy, DeepConf (Fu et al. , 2025b ) exploits model-internal confidence to filter low-quality reasoning traces. While these studies show that internal signals can guide exploration, existing methods rely on single-scale heuristics or static reward shaping, lacking principled mechanisms to regulate reasoning across both token and turn levels.
3
Preliminaries
Figure 2 : Overview of the proposed Uncertainty-Guided Exploration Control at both token and turn levels.
We introduce an agentic RL framework that enables an LLM-based agent to interact with external environments and perform multi-turn reasoning to solve complex tasks. Each task begins with a user prompt q q , which specifies the task description and proceeds over multiple turns k = { 1 , 2 , … , K } k={1,2,\ldots,K} .
At each turn k k , the agent interacts with the environment to obtain an observation represented as the state 𝐬 k ∈ 𝒮 \mathbf{s}^{k}\in\mathcal{S} , where 𝒮 \mathcal{S} denotes the environment-defined state space. Based on this state, the agent generates an action 𝐚 k ∈ 𝒱 n \mathbf{a}^{k}\in\mathcal{V}^{n} , 𝒱 n \mathcal{V}^{n} is the action space formed by the LLM tokenizer vocabulary 𝒱 \mathcal{V} .
Typically, base LLMs fine-tuned with chain-of-thought (CoT) post-training produce both
thinking tokens a c k a^{k}{c} and action tokens a o k a^{k}{o} ,
wrapped in special tags ( … ,
… ). Thus, 𝐚 k \mathbf{a}^{k} can be expressed as: { a 1 k , a 2 k , … , a t k , … , a T k } {a^{k}{1},a^{k}{2},\ldots,a^{k}{t},\ldots,a^{k}{T}} , T T is the max response length.
The agent’s behavior is governed by a policy π θ ( 𝐚 k | 𝐬 k , q ) \pi_{\theta}(\mathbf{a}^{k}|\mathbf{s}^{k},q) , which specifies a distribution over possible outputs conditioned on the current state and the initial user prompt.
After each action, the environment provides feedback in the form of a scalar reward r k ∈ ℝ r^{k}\in\mathbb{R} and the next state 𝐬 k + 1 \mathbf{s}^{k+1} , unless the maximum number of turns K K is reached. Once turn K K is completed, a full trajectory is obtained as τ = { ( 𝐬 1 , 𝐚 1 , r 1 ) , ( 𝐬 2 , 𝐚 2 , r 2 ) , … , ( 𝐬 K , 𝐚 K , r K ) } . \tau={(\mathbf{s}^{1},\mathbf{a}^{1},r^{1}),(\mathbf{s}^{2},\mathbf{a}^{2},r^{2}),\ldots,(\mathbf{s}^{K},\mathbf{a}^{K},r^{K})}.
In many real-world scenarios, rewards are sparse or delayed, which makes credit assignment particularly challenging given the thousands of tokens generated by LLMs.
4
Method
4.1
Self-calibrated Uncertainty Signal for Control
Limitations in typical RL setups. Token entropy and confidence are commonly used to measure the uncertainty in the token generation distribution. At decoding step t t in each turn, the policy LLM model π θ \pi_{\theta} outputs a categorical probability vector p t = π θ ( ⋅ | ℛ
H t = − ∑ i = 1 V p t ( i ) log p t ( i ) , C t = − 1 j ∑ i = 1 j log p t ( i ) H_{t}=-\sum_{i=1}^{V}p_{t}^{(i)}\log p_{t}^{(i)},\quad C_{t}=-\frac{1}{j}\sum_{i=1}^{j}\log p_{t}^{(i)}
(1)
Low token entropy indicates a sharply peaked distribution and higher certainty, while high confidence likewise reflects greater model certainty.
However, both of them exhibit inherent limitations. Entropy reflects the overall smoothness of the token distribution, but shows limited discriminability at the two extremes, when the distribution is nearly uniform or highly peaked. This limitation becomes particularly pronounced when the vocabulary size is large, such as 152K in Qwen3 (Team, 2025 ) . Since the entropy range scales with [0, log V \log V ], the entropy gap between two highly different predictions, for example, (1, 0, 0, … \ldots ) and (0.5, 0.5, 0, … \ldots ), is only log 2 \log 2 . Such a difference is negligible compared with the full entropy scale.
Thus, entropy alone may fail to distinguish between genuinely uncertain predictions and extremely sharp ones. Confidence, in contrast, depends only on the probability of the arg-max token and therefore ignores how the remaining probability mass is distributed. Thus, very different token distributions may yield identical confidence despite different levels of uncertainty (Fu et al. , 2020 ) . As shown in Figure 3 , both have blind regions:
Figure 3 : Contour of H t H_{t} fails to discriminate highly uncertain distributions near uniformity, while C t C_{t} ignores variations in tail probabilities. The proposed signal M t M_{t} integrates both measures, producing non-degenerate contour geometry that distinguishes distributions sharing identical top- k k probability but differing residual mass.
Figure 4 : (a) Uncertainty dynamics of self-calibrated signal M t M_{t} on response length. (b) Word cloud of tokens with the highest uncertainty. (c) Colormap of the uncertainty signal aggregated by the sliding window. When the signal falls below ϵ \epsilon (corresponding to the brightest token ‘Then’), thinking cutoff is triggered.
Self-calibrated uncertainty signal . Based on the above analysis, C t C_{t} and H t H_{t} are complementary in covering both smooth and non-smooth distributions. To obtain a scalar indicator of local distributional stability, we first normalize both entropy and confidence across the decoding trajectory:
H ~ t = H t − H min H max − H min , C ~ t = C t − C min C max − C min \tilde{H}_{t}=\frac{H_{t}-H_{\min}}{H_{\max}-H_{\min}},\qquad\tilde{C}_{t}=\frac{C_{t}-C_{\min}}{C_{\max}-C_{\min}}
(2)
And we construct a self-calibrated stability signal:
M t = α ( H ~ t ) + ( 1 − α ) ( 1 − C ~ t ) , α ∈ [ 0 , 1 ] M_{t}=\alpha(\tilde{H}_{t})+(1-\alpha)(1-\tilde{C}_{t}),\qquad\alpha\in[0,1]
(3)
The contour lines of M t M_{t} are no longer piecewise-linear and degenerate under the max operator compared with C t C_{t} . M t M_{t} preserves the top-1–driven stratification while introducing curvature within each stratum, enabling it to distinguish distributions with identical max ( p ) \max(p) but different residual mass allocations. Besides, compared with H t H_{t} , whose contours concentrate around the uniform distribution, M t M_{t} produces high-uncertainty regions that align more closely with the existence of a dominant class. Meanwhile, it retains entropy’s sensitivity to tail dispersion, yielding uncertainty patterns that better match practical class-confusion behaviors.
4.2
Token-Level Thinking Intervention (TTI)
Motivation. Reasoning LLMs, inspired by the aha moment (Liu et al. , 2024 ) , tend to generate elaborate CoT before generating the action. While such reasoning improves decision quality, excessively long internal sequences introduce computational overhead and amplify policy-gradient variance during agent training. Therefore, we continually ask a central question:
How can we explicitly and adaptively discard redundant CoT tokens while preserving reasoning performance?
Our first intuition was to monitor token-level uncertainty signals. As shown in Figure 4 (a), we aggregate trajectories from SOTA baseline and observe that confidence first decreases and then increases, while entropy first rises and then falls. Meanwhile, these most uncertain tokens are precisely the ones the model should generate in a shopping scenario, namely tokens related to product information. More importantly, the tokens generated after these peak points only slow down exploration efficiency. Therefore, we propose TTI to finely and adaptively terminate reasoning once the predictive distribution exhibits convergence behavior.
When should we stop? Higher values of M t M_{t} reflect both higher confidence and lower entropy. Therefore, as token generation progresses, the dynamics of M t M_{t} serve as a reliable indicator of exploration efficiency. We then monitor the temporal variation of the token t t at turn k k as Δ t k = | M t k − M t − 1 k | \Delta_{t}^{k}=|M_{t}^{k}-M^{k}{t-1}| . It starts only after a minimum prefix length L min L{\min} is generated to avoid premature truncation. A non-hesitation event is declared when the average variation over a trailing window of size N N falls below a tolerance ε \varepsilon :
1 N + 1 ∑ i = 0 N Δ t − i k < ε \frac{1}{N+1}\sum_{i=0}^{N}\Delta^{k}_{t-i}<\varepsilon
(4)
We denote the first such time as t ∗ t^{\ast} . Intuitively, this marks the point at which the predictive distribution ceases to change meaningfully, indicating that additional reasoning contributes little new information, as shown in Figure 4 (c).
Why not truncate at the peak? In Figure 4 (b), the highly exploratory tokens can be broadly categorized into two types. The first type consists of connective or discourse tokens, which are closely related to the model’s internal reasoning transitions and often coincide with “aha moments” in reasoning-style generation. The second type corresponds to task-specific tokens, such as product names or attribute descriptors, which carry essential semantic information required for successful task completion. If we directly follow the trend in Figure 4 (a) and truncate each response at the peak point, the truncation would likely occur on task-specific tokens. This would not only fail to improve efficiency, but could also hinder effective exploration by prematurely removing critical semantic content.
Since task-relevant tokens are typically distributed across contiguous segments of the response, the sliding-window aggregation smooths local uncertainty fluctuations and prevents spurious threshold triggers at isolated task tokens. As a result, truncation is activated only when sustained high uncertainty is detected, enabling efficiency gains without obstructing meaningful exploration.
How to stop? Once non-hesitation event occurs, decoding does not terminate immediately. Instead, at the step ( t ∗ + 1 t^{\ast}+1 ), we explicitly intervene in the model output by forcing the reasoning termination token (suppose the token id is 151668 ). Let z t ∈ ℝ | 𝒱 | z_{t}\in\mathbb{R}^{|\mathcal{V}|} denote the pre-softmax logits at step t t . We overwrite the logits as follows:
z t ∗ + 1 ( v ) = { + ∞ , v = 151668 , − ∞ , v ≠ 151668 , z_{t^{\ast}+1}(v)=\begin{cases}+\infty,&v=\texttt{151668},\\
-\infty,&v\neq\texttt{151668},\end{cases}
(5)
which yields p θ ( y t ∗ + 1 = ∣ y ≤ t ∗ ) = 1 . p_{\theta}(y_{t^{\ast}+1}=\texttt{ }\mid y_{\leq t^{\ast}})=1.
This operation deterministically terminates the reasoning phase and eliminates
stochasticity at the stopping point.
How is the action generated after stopping?
Following the forced emission of the reasoning terminator, we inject a fixed
deterministic token queue
𝒬 = \mathcal{Q}=
[ ,
\n ,
],
starting at step t ∗ + 1 t^{\ast}+1 .
Let 𝒬 [ j ] \mathcal{Q}[j] denote the j j -th token in the queue.
For j = { 1 , … , | 𝒬 | } j={1,\dots,|\mathcal{Q}|} , we enforce y t ∗ + j = 𝒬 [ j ] y_{t^{\ast}+j}=\mathcal{Q}[j]
without sampling from the model distribution.
This explicitly delineates the boundary between the reasoning and execution
phases, ensuring structured outputs.
Is there any constraints?
(1) One-time activation constraint. To avoid repeated triggering, the stopping mechanism is allowed to activate only once per generation.
Let 𝕀 stop ∈ { 0 , 1 } \mathbb{I}{\mathrm{stop}}\in{0,1} be a binary indicator initialized as 0 .
The stopping rule is applied only if
𝕀 stop = 0 ∧ y t ∈ 𝒱 reason , \mathbb{I}{\mathrm{stop}}=0\quad\land\quad y_{t}\in\mathcal{V}{\mathrm{reason}},
after which we set 𝕀 stop ← 1 \mathbb{I}{\mathrm{stop}}\leftarrow 1 and disable further
checks.
(2) Global thinking budget.
To guarantee termination, we impose a maximum decoding length L max L_{\max} .
If t = L max , t=L_{\max},
we again enforce deterministic termination by overwriting the logits.
Definition 4.1
(TTI Rule) .
TTI is triggered if:
1 N + 1 ∑ i = 0 N Δ t − i k < ε ∨ t ≥ L max , \frac{1}{N+1}\sum_{i=0}^{N}\Delta^{k}_{t-i}<\varepsilon\quad\lor\quad t\geq L_{\max},
(6)
4.3
Turn-Level Dynamical Sampling
Motivation. Agentic interaction unfolds over multiple turns along a trajectory. At turn level, once the model’s perception of the environment stabilizes, it may repeatedly produce semantically similar but failed reasoning traces across turns, leading to redundant interactions and reduced exploration efficiency. A natural inspiration comes from DAPO’s dynamical sampling (Yu et al. , 2025 ) , which improves sample efficiency by filtering out trivial prompt groups whose accuracy saturates at 0 or 1. However, directly adopting this strategy in multi-turn agentic RL is non-trivial. Unlike single-turn settings where accuracy can be readily computed per prompt group, multi-turn trajectories typically lack dense process rewards and do not admit a well-defined per-turn “accuracy” signal for dynamic filtering. To regularize interaction dynamics at the turn level under this constraint, we introduce a complementary turn-level dynamical sampling (TDS) mechanism, which identifies and down-weights redundant turns based on trajectory-level interaction signals.
Turn-Level control signal. To measure whether 𝐚 k + 1 \mathbf{a}^{k+1} is engaging in meaningless exploration compared to 𝐚 k \mathbf{a}^{k} , we first aggregate all token-level self-calibrated uncertainty signals M t k M_{t}^{k} within a single turn. Specifically, the turn-level observation signal is Φ k = ( ∏ t = 1 T M t ) 1 T \Phi^{k}=\left(\prod_{t=1}^{T}M_{t}\right)^{\frac{1}{T}} . We can monitor the temporal variation between consecutive turns as Γ k = | Φ k − Φ k − 1 | \Gamma^{k}=|\Phi^{k}-\Phi^{k-1}| . Intuitively, Γ k \Gamma^{k} measures how significantly the model’s internal confidence and uncertainty structure have shifted from turn k − 1 k-1 to turn k k . Large values indicate evolving beliefs or problem-solving states, whereas small values indicate that the agent is repeatedly generating similar, low-information reasoning content.
When to dynamically sample?
Similarly, we introduce a tolerance threshold η > 0 \eta>0 controlling the sensitivity of turn-level adaptation. A regeneration event is triggered at turn k k when Γ k
How to dynamically sample?
Specifically, when Γ k
Definition 4.2
(TDS Rule) .
TDS is defined as follows:
𝐚 new k ← { Re - generate ( 𝐚 k ) , if Γ k = | Φ k − Φ k − 1 | < η , 𝐚 k , otherwise . \mathbf{a}^{k}_{\texttt{new}}\leftarrow\begin{cases}\textnormal{Re\text{-}generate}(\mathbf{a}^{k}),&\text{if }\Gamma^{k}=|\Phi^{k}-\Phi^{k-1}|<\eta,\\[6.0pt]
\mathbf{a}^{k},&\text{otherwise}.\end{cases}
(7)
where Re - generate ( ⋅ ) \textnormal{Re\text{-}generate}(\cdot) denotes a fresh rollout under the same state. This procedure repeats until Γ k ≥ η \Gamma^{k}\geq\eta or the resampling budget B max B_{\max} is exhausted. The turn-level control signal Φ k \Phi^{k} is then recomputed after regeneration terminates.
4.4
Policy Update
Memory context window. Since each task requires multiple interactions with the environment, directly concatenating the entire trajectory τ \tau for optimization would result in excessively long sequences, which significantly increases computational overhead and memory consumption. Therefore, we adopt a memory context window that includes only the interaction history from the most recent P P turns. Concretely, the current state 𝐬 K \mathbf{s}^{K} contains information from 𝐬 K − 1 \mathbf{s}^{K-1} to 𝐬 K − P \mathbf{s}^{K-P} and the corresponding actions 𝐚 K − 1 \mathbf{a}^{K-1} to 𝐚 K − P \mathbf{a}^{K-P} , rather than the full trajectory history.
Credit Assignment. In practice, reward signal across turns are extremely sparse. To mitigate this issue, a standard approach in multi-turn RL is to introduce a discounted return over turns. Let β ∈ ( 0 , 1 ) \beta\in(0,1) denote the turn-level discount factor. The effective training signal is defined as its discounted return R ( τ k ) = ∑ j = k K β j − k r j R(\tau^{k})=\sum_{j=k}^{K}\beta^{,j-k}r^{j} . This formulation propagates supervision from terminal outcomes back to earlier decisions, allowing each action 𝐚 k \mathbf{a}^{k} to be optimized based on its long-term impact on future rewards rather than relying solely on immediate feedback.
Table 1 : Comparison with different policy optimization methods on WebShop and ALFWorld.
Method
WebShop
ALFWorld
Task Score
Success Rate
Success rate
Pick
Look
Clean
Heat
Cool
Pick2
Prompting
GPT-4o ( Achiam et al. , 2023 )
31.8
23.7
48.0
75.3
60.8
31.2
56.7
21.6
49.8
Gemini-2.5-Pro ( Comanici et al. , 2025 )
42.5
35.9
60.3
92.8
63.3
62.1
69.0
26.6
58.7
Claude Sonnet 4 ( Anthropic , 2025 )
45.63
39.82
63.71
90.13
65.34
66.77
70.14
29.80
61.36
Qwen3-32B ( Team , 2025 )
25.17
5.89
25.63
63.53
18.33
18.70
24.31
10.08
10.11
Instruction Tuning
Qwen3-4B + SFT
70.91
26.56
64.06
89.29
66.67
64.12
59.26
35.71
54.54
RL Training (Based on Qwen3-4B-RFT )
PPO ( Schulman et al. , 2017 )
70.34 ±8.63
61.93 ±5.93
75.39 ±3.81
83.34 ±9.47
75.09 ±6.25
74.50 ±7.90
62.57 ±1.56
84.21 ±0.00
58.33 ±7.38
GRPO ( Shao et al. , 2024 )
80.02 ±7.94
68.56 ±4.11
77.35 ±0.62
85.32 ±6.77
64.59 ±4.34
91.16 ±0.79
90.18 ±7.15
73.87 ±9.64
60.20 ±5.31
GiGPO ( Feng et al. , 2025 )
86.03 ±4.18
73.83 ±3.04
80.47 ±2.43
87.94 ±8.91
77.31 ±8.36
87.95 ±6.87
86.88 ±4.26
79.09 ±4.68
71.41 ±7.08
GiGPO + DAPO ( Yu et al. , 2025 )
86.54 ±9.81
74.02 ±8.18
80.86 ±1.37
89.94 ±8.06
72.08 ±0.08
93.05 ±0.43
79.05 ±7.45
83.08 ±7.75
65.55 ±9.12
T 2 PO (Ours)
93.84 ±0.22
81.64 ±0.39
90.23 ±1.38
97.36 ±6.94
87.77 ±4.89
98.33 ±2.77
85.11 ±7.64
85.84 ±2.57
80.35 ±2.86
RL Training (Based on Qwen3-8B-RFT )
GRPO ( Shao et al. , 2024 )
79.56 ±9.67
69.47 ±8.01
80.67 ±6.36
90.59 ±4.27
72.12 ±7.37
83.33 ±6.12
70.58 ±3.67
88.91 ±5.38
62.39 ±4.72
GiGPO ( Feng et al. , 2025 )
88.76 ±5.63
77.92 ±4.87
85.15 ±4.77
92.10 ±9.36
84.65 ±2.84
89.47 ±8.36
81.25 ±7.59
80.76 ±4.02
75.03 ±6.94
GiGPO + DAPO ( Yu et al. , 2025 )
87.95 ±4.52
78.40 ±5.12
89.06 ±4.76
94.73 ±3.08
75.01 ±6.37
98.72 ±1.33
93.75 ±3.76
79.64 ±8.25
75.01 ±6.37
T 2 PO (Ours)
91.65 ±0.84
82.42 ±0.61
92.41 ±1.42
99.15 ±2.05
90.91 ±4.37
96.67 ±3.77
80.45 ±7.79
90.91 ±4.15
85.71 ±1.46
Policy Loss. T 2 PO performs hierarchical advantage estimation.
Following GRPO, we first group together G G full trajectories collected under the same task and identical initial environment states.
Then we compute the relative advantage as A ( τ i k ) = R ( τ i k ) − mean ( { R ( τ j k ) } j = 1 G ) F norm ( { R ( τ j k ) } j = 1 G ) A(\tau^{k}{i})=\frac{R(\tau^{k}{i})-\text{mean}\left({R(\tau^{k}{j})}{j=1}^{G}\right)}{F_{\text{norm}}\left({R(\tau^{k}{j})}{j=1}^{G}\right)} .
This captures global performance differences across full interaction trajectories.
At the finer scale, we follow GiGPO to compute turn-relative advantage A turn A^{\mathrm{turn}} .
Finally, we fuse the two signals into a single group-in-group advantage A ′ ( 𝐚 i k ) = A ( τ i k ) + ω ⋅ A turn ( 𝐚 i k ) , A^{\prime}(\mathbf{a}^{k}{i})=A(\tau^{k}{i})+\omega\cdot A^{\mathrm{turn}}(\mathbf{a}^{k}{i}),
which provides outcome and turn-level process credit. The corresponding clipped policy update objective with ρ θ ( 𝐚 i k ) = π θ ( 𝐚 i k ∣ 𝐬 i k ) π θ old ( 𝐚 i k ∣ 𝐬 i k ) \rho{\theta}(\mathbf{a}^{k}{i})=\frac{\pi{\theta}(\mathbf{a}^{k}{i}\mid\mathbf{s}^{k}{i})}{\pi_{\theta_{\text{old}}}(\mathbf{a}^{k}{i}\mid\mathbf{s}^{k}{i})} is:
𝒥 ( θ ) \displaystyle\mathcal{J}(\theta)
= 𝔼 [ min ( ρ θ ( 𝐚 i k ) A ′ ( 𝐚 i k ) , clip ( ρ θ ( 𝐚 i k ) , 1 ± ϵ ) A ′ ( 𝐚 i k ) ) ] \displaystyle=\mathbb{E}\!\left[\min\!\Bigl(\rho_{\theta}(\mathbf{a}^{k}_{i})A^{\prime}(\mathbf{a}^{k}_{i}),\,\text{clip}(\rho_{\theta}(\mathbf{a}^{k}_{i}),1\!\pm\!\epsilon)A^{\prime}(\mathbf{a}^{k}_{i})\Bigr)\right]
(8)
− β 𝔻 KL ( π θ ∥ π ref ) . \displaystyle\qquad-\beta\,\mathbb{D}_{\mathrm{KL}}\!\left(\pi_{\theta}\|\pi_{\mathrm{ref}}\right).
5
Experiment
Table 2 : Performance on search-augmented QA tasks. Models are trained on NQ and HotpotQA with F norm = std F_{\text{norm}}=\text{std} . † \dagger and ⋆ \star indicate in-domain and out-of-domain datasets, respectively.
Method
Type
Single-Hop QA
Multi-Hop QA
NQ †
TriviaQA ⋆
PopQA ⋆
HotpotQA †
2Wiki ⋆
MuSiQue ⋆
Bamboogle ⋆
Avg.
Prompting
GPT-4o ( Achiam et al. , 2023 )
Open-source
Qwen3-32B ( Team , 2025 )
Open-source
13.56
41.32
14.28
18.24
25.77
3.98
12.32
21.58
RL Training (Based on Qwen2.5-7B-Instruct )
R1-Instruct
Open-source
21.0
44.9
17.1
20.8
27.5
6.0
19.2
22.4
Search-R1 ( Jin et al. , 2025 )
Open-source
39.3
61.0
39.7
37.0
40.1
14.6
36.8
38.5
ZeroSearch ( Sun et al. , 2025 )
Open-source
43.6
61.8
51.5
34.6
35.2
18.4
27.8
39.1
StepSearch ( Wang et al. , 2025c )
Open-source
38.6 36.6 22.6 40.0
RL Training (Based on Qwen3-4B )
GiGPO ( Feng et al. , 2025 )
Open-source
44.36
63.67
46.26
39.28
39.86
13.40
70.97
52.97
T 2 PO (Ours)
Open-source
46.13
64.08
47.85
39.80
42.51
16.64
72.58
54.93
5.1
Setup
Tasks. We evaluate LLM agents on three public available challenging interactive benchmarks including (1) WebShop (Yao et al. , 2022 ) , (2) ALFWorld (Shridhar et al. , 2020 ) and (3) Search QA.
WebShop is a web-based interactive environment that tests LLM agents in realistic online shopping scenarios. Agents must navigate a simulated HTML-based shopping website to search for products, browse pages, and complete purchases. The environment contains over 1.1M products and 12k user instructions, providing a rich and diverse action space.
ALFWorld is an embodied environment designed to assess multi-step decision-making, where an agent receives a textual goal and must accomplish it through multi-turn interaction. It comprises 3,827 task instances spanning six categories: Pick & Place, Examine in Light, Clean & Place, Heat & Place, Cool & Place, and Pick Two & Place.
In addition, Search QA includes single-hop QA datasets like NQ (Kwiatkowski et al. , 2019 ) , TriviaQA (Joshi et al. , 2017 ) , PopQA (Mallen et al. , 2023 ) and multi-hop QA datasets like HotpotQA (Yang et al. , 2018 ) , 2Wiki (Ho et al. , 2020 ) , MuSiQue (Trivedi et al. , 2021 ) , and Bamboogle (Press et al. , 2023 ) . All of the evaluation metrics are shown in Appendix A.1
Implementation Details. We use publicly available Qwen3-4B/8B models for RFT to regulate behavioral patterns, and initialize training from the Qwen3-4B/8B-RFT under three different environment random seeds. In addition to the outcome reward, we incorporate a format penalty to enforce structural compliance. Details are provided in Appendix B . For ALFWorld and WebShop, all RL training methods share identical hyperparameter configurations. The rollout group size for group-based RL methods is set to 8. For Search QA, we follow the experimental settings of Search-R1 (Jin et al. , 2025 ) . We adopt E5 as the retriever, set the rollout group size to 5, and limit the maximum number of turns to 4. Notably, we decompose each full trajectory into individual turns for optimization. Our experiments are based on verl (Sheng et al. , 2024 ) RL training framework with the agent loop. All of the experiments are implemented on 8 × \times NVIDIA H100 GPUs. To ensure a fair comparison, all baselines are initialized from the RFT-based model and employ an identical format penalty to stabilize training.
Additional task details and evaluation metrics are provided in Appendix A . Complete training configurations and hyperparameter details are provided in Appendix A.2 . Implementation specifics of RL training techniques, including RFT, format penalty, trajectory decomposition, and policy updates, are described in Appendix B .
Baseline. We compare T 2 PO against a diverse set of strong baselines. (1) Closed-source LLMs: GPT-4o, Gemini-2.5-Pro, and Claude 4, representing sota general-purpose inference model. (2) RL training methods: PPO (Schulman et al. , 2017 ) , a standard actor–critic algorithm requiring an auxiliary value model; group-based critic-free methods GRPO (Shao et al. , 2024 ) , which perform advantage estimation over trajectory groups; and the SOTA baseline GiGPO (Feng et al. , 2025 ) . Additionally, we incorporate effective RL enhancements on top of GiGPO, such as the dynamic sampling proposed in DAPO (Yu et al. , 2025 ) . In fact, T 2 PO is plug-and-play and can be readily integrated with other policy update schemes. We provide additional results based on GSPO (Zheng et al. , 2025b ) in Appendix D.1 .
5.2
Main Results
Table 1 presents performance on WebShop and ALFWorld. Direct prompting yields limited success, even for strong proprietary models, while open-source backbones remain substantially weaker under zero-shot inference. (1) Instruction tuning improves reward modeling but fails to produce reliable task completion, highlighting the limitations of imitation learning. (2) RL substantially enhances performance. Among single-turn baselines, GRPO clearly outperforms PPO, confirming the importance of structured policy optimization for stabilizing training. Multi-turn methods further improve success rates, demonstrating the necessity of explicit long-horizon credit assignment. (3) As shown in Table 1 , T 2 PO achieves the best performance across all metrics, reaching success rates of 81.64 on Qwen3-4B-RFT and 82.42 on Qwen3-8B-RFT on WebShop, and delivering consistent gains of roughly 8–12 points over prior SOTA on ALFWorld. Moreover, T 2 PO exhibits substantially reduced variance across runs, indicating improved training stability without introducing additional model parameters or environment-specific heuristics.
Table 2 reports results on single-hop and multi-hop QA benchmarks. Our method consistently achieves top performance across single-hop datasets, indicating improved evidence retrieval and grounding. On multi-hop QA, we observe pronounced gains on challenging out-of-domain datasets, particularly on MuSiQue, where our approach more than doubles prior best performance. Strong results on 2Wiki and Bamboogle further confirm robust multi-step reasoning and generalization.
Figure 5 : We evaluate both task performance and exploration efficiency. (a) shows that T 2 PO enables performance to steadily improve without collapse on three different env seeds. In (b), the bar chart shows that the distribution of token consumption for successful trajectories generated by T 2 PO is substantially lower than that of SOTA baseline. Meanwhile, the line plot indicates that the exploration efficiency of T 2 PO for successful trajectories is consistently higher. (c) further demonstrates at the turn level that T 2 PO achieves task completion with also ∼ 25 % \sim 25% reduced interaction turns during training.
5.3
Ablation on Key Modules
Table 3 presents an ablation analysis to quantify the contribution of each core component in T 2 PO.
Table 3: Ablation study of key modules on WebShop.
Key Modules
Task Score
Success Rate
w/o Rejective Fine-tuning
79.28
61.32
w/o Token-level Thinking Intervention
81.28
73.27
w/o Turn-level Dynamical Sampling
72.40
63.67
T 2 PO
93.84
81.64
No RFT cold-start. RFT on self-distilled data is responsible for filtering malformed or low-quality action during early policy optimization. Without this module, the model exhibits a noticeable degradation in both task score and success rate, indicating that structured rejection fine-tuning plays a critical role in stabilizing training and preventing error propagation in downstream rollouts.
Eliminating the TTI . It forces the model to rely on unconstrained reasoning lengths, leading to redundant low-information tokens and inflated trajectory variance. This results in a clear drop in success rate, confirming that adaptive termination based on predictive stability effectively improves exploration efficiency and reduces unnecessary computation without sacrificing reasoning quality.
Removing the TDS . It is designed to suppress redundant cross-turn reasoning patterns. Without TDS , the agent frequently repeats semantically similar reasoning traces across dialogue turns, reducing interaction diversity and limiting effective exploration. Consequently, both task score and success rate deteriorate, demonstrating that turn-level regeneration is essential for maintaining trajectory-level diversity in multi-turn environments.
Others. We further investigate the sensitivity of self-calibrated coefficient α \alpha , tolerance threshold ϵ , η \epsilon,\eta , window size N N and analyze how varying the maximum response length influences output length and training stability, along with additional ablations on other tasks. The corresponding results are reported in Appendix D .
5.4
Ablation on Other Thinking Control Methods
Table 4: Ablation of alternative thinking-control methods on WebShop with Qwen3-4B-RFT .
Method
Task Score
Success Rate
Lengthy reward
77.96
65.87
Short CoT cold-start
82.39
71.29
Void turn filtering
85.17
76.20
Hard thinking budget
84.96
79.21
Ours ( TTI + TDS )
93.84
81.64
Beyond our hierarchical uncertainty-guided control, we compare T 2 PO with representative thinking-control strategies with results shown in Table 4 , including lengthy reward (Liu et al. , 2025b ) , short CoT cold-start (Cai et al. , 2025 ) , hard thinking budget (Comanici et al. , 2025 ) and void turn filtering (Xue et al. , 2025 ) . Details of each control method are provided in Appendix C .
(1) The lengthy reward explicitly biases the policy toward shorter generations by penalizing long responses among correct outputs and long incorrect ones, but this global heuristic introduces a rigid preference that does not adapt to task difficulty or per-token predictive stability. As a result, it suppresses both redundant and informative reasoning indiscriminately, yielding only limited performance gains. (2) Short CoT cold-start with data distilled from GPT-4 (Achiam et al. , 2023 ) improves early training stability by initializing the policy with concise teacher demonstrations, yet it does not actively regulate reasoning during RL rollouts; consequently, the model gradually drifts toward repetitive or excessively long reasoning patterns as exploration proceeds. (3) Hard thinking budget imposes a fixed cap on reasoning length. Nevertheless, its static constraint cannot adapt to per-turn uncertainty or task complexity, leading to premature truncation of useful reasoning in difficult cases and insufficient suppression of redundant exploration in simpler ones. (4) Void turn filtering removes trajectories containing invalid or empty actions, preventing trivial degenerate behaviors, but fails to address redundancy among semantically similar valid turns and therefore offers only marginal improvement.
5.5
Analysis of Exploration Efficiency
Figure 7 (a) shows that the baseline exhibits early performance improvement but later suffers from instability and partial collapse, whereas T 2 PO achieves steady monotonic improvement throughout training. This indicates that adaptive thinking regulation stabilizes long-horizon multi-turn policy learning by preventing excessive low-information reasoning from dominating rollouts.
To directly measure token-level exploration efficiency , Figure 7 (b) reports the proportion of successful trajectories as a function of generated token budget. We observe that T 2 PO consistently produces a higher fraction of successful reasoning trajectories under the same token budget. In particular, the baseline wastes a substantial portion of tokens on redundant continuation beyond the effective reasoning boundary, while T 2 PO truncates low-utility reasoning once predictive distributions stabilize. At the turn level , Figure 7 (c) further reports the average number of interaction turns required to complete a task. The baseline agent frequently enters repetitive reasoning loops across turns, leading to longer trajectories and inefficient exploration. In contrast, T 2 PO detects redundant turn-level reasoning states and triggers regeneration only when necessary, thereby reducing repeated low-information interactions.
5.6
Case Study
Details of the trajectory with the interaction between the agent and environment are provided in Appendix G .
6
Conclusion
By explicitly regulating reasoning at both token and turn levels using intrinsic signals, T 2 PO effectively suppresses low-information action and mitigates training collapse without relying on additional reward shaping. Extensive experiments demonstrate that T 2 PO consistently improves training stability, exploration efficiency, and task performance.
Acknowledgment
This work was partially supported by NSF 2211557, NSF 2303037, NSF 2312501, NSF 2531008, SRC JUMP 2.0 Center, UCLA CDSC Center, Amazon Research Awards, Snapchat, and Google Gifts. We also gratefully acknowledge Amazon for its sponsorship and support of this work.
Impact Statement
This work advances the understanding of instability in multi-turn reinforcement learning for reasoning-oriented language models. By identifying inefficient exploration as a fundamental cause of training collapse and proposing principled token- and turn-level uncertainty control, our method provides a general framework for stabilizing agentic RL training. We expect this approach to facilitate scalable and reproducible training of interactive LLM agents, enabling broader deployment in complex decision-making applications.
References
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report .
arXiv preprint arXiv:2303.08774 .
Cited by: Table 1 , §5.4 , Table 2 .
Anthropic (2025) Claude sonnet 4
Note: Large language model
External Links: Link
Cited by: Table 1 .
H. J. Cai, J. Wang, X. Chen, and B. Dhingra (2025) How much backtracking is enough? exploring the interplay of sft and rl in enhancing llm reasoning .
arXiv preprint arXiv:2505.24273 .
Cited by: §C.2 , Appendix C , §5.4 .
M. Chen, G. Chen, W. Wang, and Y. Yang (2025) Seed-grpo: semantic entropy enhanced grpo for uncertainty-aware policy optimization .
arXiv preprint arXiv:2505.12346 .
Cited by: §2.2 .
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. (2025) Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities .
arXiv preprint arXiv:2507.06261 .
Cited by: §C.3 , Appendix C , Table 1 , §5.4 .
G. Dong, L. Bao, Z. Wang, K. Zhao, X. Li, J. Jin, J. Yang, H. Mao, F. Zhang, K. Gai, et al. (2025) Agentic entropy-balanced policy optimization .
arXiv preprint arXiv:2510.14545 .
Cited by: §1 .
L. Feng, Z. Xue, T. Liu, and B. An (2025) Group-in-group policy optimization for llm agent training .
arXiv preprint arXiv:2505.10978 .
Cited by: §B.4 , §1 , §2.1 , Table 1 , Table 1 , §5.1 , Table 2 .
B. Fu, Z. Cao, M. Long, and J. Wang (2020) Learning to detect open classes for universal domain adaptation .
In European conference on computer vision ,
pp. 567–583 .
Cited by: §4.1 .
W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, S. Xu, G. Wei, J. Mei, J. Wang, T. Yang, B. Yuan, and Y. Wu (2025a) AReaL: a large-scale asynchronous reinforcement learning system for language reasoning .
External Links: 2505.24298 , Link
Cited by: §1 .
Y. Fu, X. Wang, Y. Tian, and J. Zhao (2025b) Deep think with confidence .
arXiv preprint arXiv:2508.15260 .
Cited by: §2.2 .
X. Ho, A. D. Nguyen, S. Sugawara, and A. Aizawa (2020) Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps .
arXiv preprint arXiv:2011.01060 .
Cited by: §5.1 .
B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han (2025) Search-r1: training llms to reason and leverage search engines with reinforcement learning .
arXiv preprint arXiv:2503.09516 .
Cited by: §5.1 , Table 2 .
M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension .
arXiv preprint arXiv:1705.03551 .
Cited by: §5.1 .
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. (2019) Natural questions: a benchmark for question answering research .
Transactions of the Association for Computational Linguistics 7 , pp. 453–466 .
Cited by: §5.1 .
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention .
In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles ,
Cited by: Appendix E .
J. Lin (2002) Divergence measures based on the shannon entropy .
IEEE Transactions on Information theory 37 ( 1 ), pp. 145–151 .
Cited by: §4.1 .
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. (2024) Deepseek-v3 technical report .
arXiv preprint arXiv:2412.19437 .
Cited by: §1 , §4.2 .
L. Liu, F. Yao, D. Zhang, C. Dong, J. Shang, and J. Gao (2025a) FlashRL: 8bit rollouts, full power rl .
External Links: Link
Cited by: §1 .
W. Liu, R. Zhou, Y. Deng, Y. Huang, J. Liu, Y. Deng, Y. Zhang, and J. He (2025b) Learn to reason efficiently with adaptive length-based reward shaping .
arXiv preprint arXiv:2505.15612 .
Cited by: §C.1 , Appendix C , §5.4 .
A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi (2023) When not to trust language models: investigating effectiveness of parametric and non-parametric memories .
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ,
pp. 9802–9822 .
Cited by: §5.1 .
K. Mehlhorn, B. R. Newell, P. M. Todd, M. D. Lee, K. Morgan, V. A. Braithwaite, D. Hausmann, K. Fiedler, and C. Gonzalez (2015) Unpacking the exploration–exploitation tradeoff: a synthesis of human and animal literatures. .
Decision 2 ( 3 ), pp. 191 .
Cited by: §1 .
O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis (2023) Measuring and narrowing the compositionality gap in language models .
In Findings of the Association for Computational Linguistics: EMNLP 2023 ,
pp. 5687–5711 .
Cited by: §5.1 .
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms .
arXiv preprint arXiv:1707.06347 .
Cited by: Table 1 , §5.1 .
N. Shang, Y. Liu, Y. Zhu, L. L. Zhang, W. Xu, X. Guan, B. Zhang, B. Dong, X. Zhou, B. Zhang, et al. (2025) Rstar2-agent: agentic reasoning technical report .
arXiv preprint arXiv:2508.20722 .
Cited by: §2.1 .
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models .
arXiv preprint arXiv:2402.03300 .
Cited by: Table 1 , Table 1 , §5.1 .
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2024) HybridFlow: a flexible and efficient rlhf framework .
arXiv preprint arXiv: 2409.19256 .
Cited by: §B.3 , §D.2 , Appendix E , §5.1 .
M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. Hausknecht (2020) Alfworld: aligning text and embodied environments for interactive learning .
arXiv preprint arXiv:2010.03768 .
Cited by: §A.1.2 , §5.1 .
H. Sun, Z. Qiao, J. Guo, X. Fan, Y. Hou, Y. Jiang, P. Xie, Y. Zhang, F. Huang, and J. Zhou (2025) Zerosearch: incentivize the search capability of llms without searching .
arXiv preprint arXiv:2505.04588 .
Cited by: Table 2 .
K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. (2025) Kimi k1. 5: scaling reinforcement learning with llms .
arXiv preprint arXiv:2501.12599 .
Cited by: §1 .
Q. Team (2025) Qwen3 technical report .
External Links: 2505.09388 , Link
Cited by: §1 , §4.1 , Table 1 , Table 2 .
H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal (2021) Musique: multihop questions via single-hop question composition, 2022 .
URL https://arxiv. org/abs/2108.00573 .
Cited by: §5.1 .
H. Wang, J. Chang, Y. Zhai, X. Luo, J. Sun, Z. Lin, and Q. Tian (2024) Lion: implicit vision prompt tuning .
In Proceedings of the AAAI conference on artificial intelligence ,
Vol. 38 , pp. 5372–5380 .
Cited by: §C.2 .
H. Wang, X. Yang, J. Chang, D. Jin, J. Sun, S. Zhang, X. Luo, and Q. Tian (2023) Parameter-efficient tuning of large-scale multimodal foundation model .
Advances in Neural Information Processing Systems 36 , pp. 15752–15774 .
Cited by: §C.2 .
J. Wang, J. Liu, Y. Fu, Y. Li, X. Wang, Y. Lin, Y. Yue, L. Zhang, Y. Wang, and K. Wang (2025a) Harnessing uncertainty: entropy-modulated policy gradients for long-horizon llm agents .
External Links: 2509.09265 , Link
Cited by: §1 .
X. Wang, H. Zhang, H. Wang, Y. Shi, R. Li, K. Han, C. Tong, H. Deng, R. Sun, A. Taylor, et al. (2026) ARLArena: a unified framework for stable agentic reinforcement learning .
arXiv preprint arXiv:2602.21534 .
Cited by: §1 .
Z. Wang, K. Wang, Q. Wang, P. Zhang, L. Li, Z. Yang, K. Yu, M. N. Nguyen, L. Liu, E. Gottlieb, M. Lam, Y. Lu, K. Cho, J. Wu, L. Fei-Fei, L. Wang, Y. Choi, and M. Li (2025b) RAGEN: understanding self-evolution in llm agents via multi-turn reinforcement learning .
External Links: 2504.20073 , Link
Cited by: §1 , §2.1 .
Z. Wang, X. Zheng, K. An, C. Ouyang, J. Cai, Y. Wang, and Y. Wu (2025c) StepSearch: igniting llms search ability via step-wise proximal policy optimization .
arXiv preprint arXiv:2505.15107 .
Cited by: Table 2 .
Z. Wei, W. Yao, Y. Liu, W. Zhang, Q. Lu, L. Qiu, C. Yu, P. Xu, C. Zhang, B. Yin, et al. (2025) Webagent-r1: training web agents via end-to-end multi-turn reinforcement learning .
arXiv preprint arXiv:2505.16421 .
Cited by: §1 .
Z. Xue, L. Zheng, Q. Liu, Y. Li, X. Zheng, Z. Ma, and B. An (2025) Simpletir: end-to-end reinforcement learning for multi-turn tool-integrated reasoning .
arXiv preprint arXiv:2509.02479 .
Cited by: §C.4 , Appendix C , §1 , §2.1 , §5.4 .
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering .
In Proceedings of the 2018 conference on empirical methods in natural language processing ,
pp. 2369–2380 .
Cited by: §5.1 .
S. Yao, H. Chen, J. Yang, and K. Narasimhan (2022) Webshop: towards scalable real-world web interaction with grounded language agents .
Advances in Neural Information Processing Systems 35 , pp. 20744–20757 .
Cited by: §5.1 .
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. (2025) Dapo: an open-source llm reinforcement learning system at scale .
arXiv preprint arXiv:2503.14476 .
Cited by: §1 , §4.3 , Table 1 , Table 1 , §5.1 .
C. Zheng, K. Dang, B. Yu, M. Li, H. Jiang, J. Lin, Y. Liu, A. Yang, J. Zhou, and J. Lin (2025a) Stabilizing reinforcement learning with llms: formulation and practices .
arXiv preprint arXiv:2512.01374 .
Cited by: §1 .
C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. (2025b) Group sequence policy optimization .
arXiv preprint arXiv:2507.18071 .
Cited by: §D.1 , §5.1 .
Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar (2024) Archer: training language model agents via hierarchical multi-turn rl .
arXiv preprint arXiv:2402.19446 .
Cited by: §1 .
APPENDIX
Appendix A
More Task Details
A.1
Evaluation Metrics
A.1.1
WebShop
We adopt six complementary evaluation metrics to comprehensively assess task completion quality. (1) Task Score is defined as 10 × avg. reward 10\times\text{avg. reward} , measuring the average accumulated reward per episode. (2) Success Rate is defined as the proportion of episodes with terminal reward r = 1 r=1 . Notably, an episode may achieve r = 1 r=1 even if the final selected product does not exactly match the annotated target y ∗ y^{*} . This is because multiple products may satisfy the same user instruction. For instance, several products can fulfill the request “I want a red shirt”, although the instruction was generated from a particular reference item. (3–6) Title Score , reward_type , reward_attribute , and reward_option evaluate fine-grained aspects of decision quality, measuring respectively correct product title matching, category consistency, attribute satisfaction, and option-field matching.
Each natural language instruction u ∈ 𝒰 u\in\mathcal{U} is constructed by human annotators based on a target product y ∗ y^{} . It consists of three components: a non-empty attribute set U att U_{\text{att}} , an option field–value set U opt U_{\text{opt}} , and a price constraint u price u_{\text{price}} . Formally, U att ⊆ Y att ∗ U_{\text{att}}\subseteq Y_{\text{att}}^{} denotes a subset of the target product attributes, U opt ⊆ Y opt ∗ U_{\text{opt}}\subseteq Y_{\text{opt}}^{} denotes a subset of its option field–value pairs, and u price u_{\text{price}} is set higher than the target product price y price ∗ y_{\text{price}}^{} . This formulation enables lightweight and scalable data collection while preserving realistic user intent. At the end of each episode, the agent receives a terminal reward r = ℛ ( s T , a ) r=\mathcal{R}(s_{T},a) , where a = choose[ buy ] a=\texttt{choose[}{\text{buy}}\texttt{]} , y y is the product selected in the final state s T s_{T} , and Y att Y_{\text{att}} , Y opt Y_{\text{opt}} , and y price y_{\text{price}} denote the attributes, options, and price of y y . The reward is defined as:
r = r type ⋅ | U att ∩ Y att | + | U opt ∩ Y opt | + 𝟏 [ y price ≤ u price ] | U att | + | U opt | + 1 , r=r_{\text{type}}\cdot\frac{|U_{\text{att}}\cap Y_{\text{att}}|+|U_{\text{opt}}\cap Y_{\text{opt}}|+\mathbf{1}[y_{\text{price}}\leq u_{\text{price}}]}{|U_{\text{att}}|+|U_{\text{opt}}|+1},
(9)
where the type reward
r type = TextMatch ( y ¯ , y ¯ ∗ ) r_{\text{type}}=\texttt{TextMatch}(\bar{y},\bar{y}^{})
penalizes category mismatches between the predicted product y y and target product y ∗ y^{} .
Specifically, r type r_{\text{type}} assigns a low score when y y and y ∗ y^{*} share similar attributes or options but belong to different product categories.
For example, “butter” and “plant-based meat” may both exhibit attributes such as “cruelty-free” and “non-GMO”, yet represent fundamentally different product types.
A.1.2
Alfworld
We follow the standard ALFWorld evaluation protocol (Shridhar et al. , 2020 ) . The benchmark contains 3,827 task instances spanning six categories of household activities: Pick & Place , Examine in Light , Clean & Place , Heat & Place , Cool & Place , and Pick Two & Place . Unless otherwise specified, we report overall performance by aggregating results across all six task categories . The primary metric is Success Rate , defined as the fraction of task instances for which the agent successfully completes the goal. A task is considered successful if the final environment state satisfies all goal conditions specified by the instruction. Overall success rate is computed by averaging over all evaluation episodes pooled from the six task categories.
A.1.3
Search QA
We evaluate search-augmented reasoning performance using Exact Match (EM) as the primary metric. After multi-turn reasoning interleaved with search engine interactions, the model outputs a final answer enclosed by and tokens. The predicted answer a pred a_{\mathrm{pred}} is extracted and compared with the ground-truth answer a gold a_{\mathrm{gold}} via exact string matching:
r ϕ ( x , y ) = EM ( a pred , a gold ) . r_{\phi}(x,y)=\mathrm{EM}(a_{\mathrm{pred}},a_{\mathrm{gold}}).
(10)
This EM score serves both as the outcome-based reward for reinforcement learning and as the final evaluation metric during testing. By relying solely on outcome supervision, this metric directly reflects the model’s ability to formulate effective search queries, retrieve relevant external knowledge, and integrate retrieved evidence into multi-step reasoning before producing the correct answer. We report EM across seven benchmark datasets covering both general and multi-hop question answering tasks, following standard evaluation protocols.
A.2
Hyper-parameter Setting
To ensure a fair and controlled comparison across different agentic RL benchmarks, we adopt a unified hyperparameter design principle. All tasks share the same model family, optimization strategy, and rollout framework, while only task-specific parameters are adjusted to match environment characteristics. This standardized configuration eliminates confounding factors from inconsistent training setups, allowing performance differences to be attributed to algorithmic behaviors rather than implementation variance. Accordingly, the hyperparameters in Table 5 are selected to balance training stability, computational efficiency, and reproducibility across tasks.
Table 5: Key training hyperparameters for Agentic RL experiments.
Category
WebShop
ALFWorld
Search
Model and Environment
Base model
Qwen3-RFT
Qwen3-RFT
Qwen3
Max interaction steps
15
50
4
Memory context window
2
2
4
Group rollout size
8
8
5
Similarity threshold
0.9
Max prompt length
4096
2048
4096
Max response length
500
500
500
Thinking budget
450
450
450
Optimization
Group normalization mode
mean_norm
mean_norm
mean_norm
Learning rate
1 × 10 − 6 1\times 10^{-6}
1 × 10 − 6 1\times 10^{-6}
1 × 10 − 6 1\times 10^{-6}
Mini-batch size
128
256
256
Micro-batch size / GPU
8
32
16
KL coefficient
0.01
0.01
0.01
Format penalty coefficient
0.1
0.1
0.1
Rollout and Sampling
Rollout engine
vLLM
vLLM
vLLM
Rollout mode
synchronous
synchronous
synchronous
Tensor parallel size
1
1
1
GPU memory utilization
0.6
0.6
0.6
Temperature (validation)
0.6
0.6
0.6
Top- p p (validation)
0.95
0.95
0.95
Top- k k (validation)
20
20
20
Monitoring window size
20
15
20
Token-level tolerance ε \varepsilon
1e-4
1e-4
1e-4
Turn-level tolerance η \eta
1e-3
1e-3
1e-3
Training Schedule
Train batch size
128
128
256
Validation batch size
128
128
512
Total epochs
200
150
250
Hardware
GPUs
8 × \times NVIDIA H100
8 × \times NVIDIA H100
8 × \times NVIDIA H100
Nodes
1
1
1
Appendix B
RL Training Techniques
B.1
Rejective Fine-tuning
A critical challenge in early-stage agentic RL training is the presence of malformed or low-quality action outputs, which introduce substantial noise into trajectory collection and destabilize subsequent policy optimization. To mitigate this issue, we employ a rejective fine-tuning (RFT) stage to initialize the policy with high-quality behavioral priors before RL, without introducing any external supervision or additional knowledge beyond environment feedback. Concretely, we first use the base Qwen3 model to perform multi-turn rollouts in the target environment under the same prompting and tool-calling format as in RL training to derive 𝒟 RFT = { h k , a k } \mathcal{D}{\texttt{RFT}}={h^{k},a^{k}} with h k = { ( s 1 , a 1 , s 2 , a 2 , ⋯ , s k ) } h^{k}={(s^{1},a^{1},s^{2},a^{2},\cdots,s^{k})} . Each generated trajectory is evaluated by the environment to obtain a scalar task reward. We then retain only trajectories whose final task score exceeds a predefined threshold, discarding low-quality or failed interactions. The remaining high-scoring trajectories are treated as supervised demonstration data, from which we extract state–action pairs and perform one epoch of supervised fine-tuning on the policy model π θ \pi{\theta} as follows:
ℒ RFT = − 𝔼 ( h k , a k ) ∼ 𝒟 RFT [ log π θ ( a k | h k ) ] \mathcal{L}_{\texttt{RFT}}=-\mathbb{E}_{(h^{k},a^{k})\sim\mathcal{D}_{\texttt{RFT}}}[\log\pi_{\theta}(a^{k}|h^{k})]
(11)
This RFT stage equips the agent with a reliable initial policy that produces structurally valid actions and reasonable early reasoning patterns, thereby reducing malformed rollouts and improving training stability in subsequent multi-turn RL.
Notably, we find that RFT provides an effective cold start for agentic RL training. In particular, it significantly strengthens instruction-following ability, leading to more accurate output formatting. Moreover, the success rate of action outputs is substantially improved, as the initial action space is effectively narrowed under RFT initialization. At the same time, we observe that increasing the number of RFT epochs further reduces the RFT training loss. However, excessive RFT begins to degrade the base model’s intrinsic reasoning capability, which in turn hinders subsequent RL training. Based on this trade-off, we limit RFT to no more than five epochs in all experiments.
B.2
Format Penalty
The agent is required to produce actions in a structured format consisting of a reasoning segment and an executable action segment:
thinking tokens action tokens . \texttt{ thinking tokens action tokens }.
However, during early-stage RL training, LLM agents frequently generate malformed outputs, such as missing tags, duplicated tags, or interleaved natural language, which leads to invalid environment interactions and noisy supervision. To ensure consistent environment interfacing and to construct reliable rejection signals for rejective fine-tuning, we apply a format-constrained action projection operator. Given a batch of raw model outputs { a i k } i = 1 B {a^{k}{i}}{i=1}^{B} at turn k k , we define a strict format validator:
𝒱 strict ( a i k ) = { 1 , a i k matches .* .* 0 , otherwise , \mathcal{V}_{\mathrm{strict}}(a^{k}_{i})=\begin{cases}1,&a^{k}_{i}\text{ matches }\texttt{.* .* }\\
0,&\text{otherwise},\end{cases}
(12)
where the match additionally enforces exactly one opening and closing tag for each field and excludes non-target-language characters. If 𝒱 strict ( a i k ) = 1 \mathcal{V}{\mathrm{strict}}(a^{k}{i})=1 , we extract the executable action token from the field and mark the output as format-valid. If the strict constraint fails, we apply a relaxed parser that searches for the first occurrence of an field:
𝒱 relax ( a i k ) = { 1 , ∃ ⊂ a i k , 0 , otherwise . \mathcal{V}_{\mathrm{relax}}(a^{k}_{i})=\begin{cases}1,&\exists\,\texttt{}\subset a^{k}_{i},\\
0,&\text{otherwise}.\end{cases}
(13)
When 𝒱 relax ( a i k ) = 1 \mathcal{V}{\mathrm{relax}}(a^{k}{i})=1 , we still extract the corresponding action token but assign a format-invalid flag. Otherwise, the output is marked as invalid and a fallback placeholder is stored.
To explicitly discourage malformed generations during RL training, we introduce a format-based penalty into the environment reward. Specifically, if an output fails the strict format constraint, i.e., 𝒱 strict ( a i k ) = 0 \mathcal{V}{\mathrm{strict}}(a^{k}{i})=0 , we subtract a fixed penalty from the final task reward:
r i ← r i − λ fmt , where λ fmt = 0.1 . r_{i}\leftarrow r_{i}-\lambda_{\mathrm{fmt}},\quad\text{where }\lambda_{\mathrm{fmt}}=0.1.
This lightweight penalty provides a direct training signal that suppresses malformed thinking–action outputs while avoiding additional external supervision.
B.3
Trajectory Decomposition
Since we split τ = { ( 𝐬 1 , 𝐚 1 , r 1 ) , ( 𝐬 2 , 𝐚 2 , r 2 ) , … , ( 𝐬 K , 𝐚 K , r K ) } \tau={(\mathbf{s}^{1},\mathbf{a}^{1},r^{1}),(\mathbf{s}^{2},\mathbf{a}^{2},r^{2}),\ldots,(\mathbf{s}^{K},\mathbf{a}^{K},r^{K})} into K K single turns for policy optimization, it inevitably introduces off-policy staleness. Specifically, in our training pipeline built upon verl (Sheng et al. , 2024 ) , rollouts and policy updates are executed in a pipelined fashion: while the learner updates the policy parameters θ μ \theta_{\mu} at training step μ \mu , environment workers may still be generating new trajectories using a previous policy snapshot θ μ − δ \theta_{\mu-\delta} . As a result, the collected turn transitions ( 𝐬 k , 𝐚 k , r k ) (\mathbf{s}^{k},\mathbf{a}^{k},r^{k}) are not strictly on-policy with respect to the latest parameters, yielding a non-negligible policy lag.
To quantify this effect, suppose the rollout mini-batch size is ℬ rollout \mathcal{B}{\texttt{rollout}} , the update micro-batch size is ℬ update \mathcal{B}{\texttt{update}} , the prompt group size is n n , and the average turn consumption per trajectory is K ^ max \hat{K}_{\max} (noting that each trajectory may contain a different number of turns). During training, environment workers continuously generate turn-level transitions, while the learner performs parameter updates in micro-batches. Consequently, before a newly updated policy is broadcast to rollout workers, a number of turn-level samples may already have been generated using stale policy snapshots.
We approximate the expected policy lag, measured in learner update steps, as:
δ ≈ ℬ rollout ⋅ n ⋅ K ^ max ℬ update , \delta\;\approx\;\frac{\mathcal{B}_{\texttt{rollout}}\cdot n\cdot\hat{K}_{\max}}{\mathcal{B}_{\texttt{update}}},
(14)
which reflects the number of learner updates that can be executed while a batch of rollouts is being collected. Based on this, we define the staleness ratio as ρ stale = δ 1 + δ \rho_{\mathrm{stale}};=;\frac{\delta}{1+\delta} , which characterizes the fraction of samples generated under outdated policy parameters relative to the total effective update volume. A larger rollout batch size or longer average turn horizon increases ρ stale \rho_{\mathrm{stale}} , whereas larger update micro-batches or higher prompt group parallelism reduce staleness. This ratio therefore provides a concise measure of off-policy deviation induced by turn-level trajectory decomposition in pipelined training.
Advantages. It enables scalable multi-turn RL by decoupling environment interaction from policy optimization, thereby significantly improving hardware utilization and training throughput. The decomposition further allows fine-grained control over generation, facilitating dynamic intervention mechanisms such as turn-wise resampling and uncertainty-based stopping. Together, these design choices make it well-suited for long-horizon interactive tasks with large language models.
Limitations. The pipelined execution inevitably introduces off-policy staleness, as turn-level samples may be generated under outdated policy snapshots before updated parameters are synchronized across workers. As quantified by the staleness ratio ρ stale \rho_{\mathrm{stale}} , longer interaction horizons and larger rollout batches amplify this effect, potentially increasing importance weight variance and destabilizing policy optimization. To examine whether off-policy staleness has a significant impact on performance, we fix ℬ update \mathcal{B}{\texttt{update}} and vary ℬ rollout \mathcal{B}{\texttt{rollout}} and n n . Table 6 shows that training remains largely stable under these configurations, indicating that off-policy staleness does not substantially degrade performance in practice.
Table 6: Effect of Off-Policy Staleness under Different Rollout and Group Settings on WebShop.
Rollout Batch Size ℬ rollout \mathcal{B}_{\texttt{rollout}}
Prompt Group Size n n
Task Score
Success Rate(%)
8
8
92.85
80.97
8
16
93.73
81.35
16
8
93.84
81.64
B.4
Policy Update Details
Our policy update is mainly based on GiGPO (Feng et al. , 2025 ) . We also provide the details of turn-relative advantage ( A t u r n A^{turn} ) computation as follows. Given a rollout group of G G trajectories generated for the same task instance, they first enumerate all environment states encountered across all time steps and trajectories, and identify the set of unique states 𝒰 \mathcal{U} . Each unique state 𝒔 ~ ∈ 𝒰 \tilde{\bm{s}}\in\mathcal{U} serves as an anchor state , around which they gather all occurrences of that state from different trajectories and time steps, forming a turn-level group G S ( 𝒔 ~ ) G^{S}(\tilde{\bm{s}}) . Importantly, this grouping is performed entirely offline via key-based state matching, introducing no additional environment interaction or LLM inference overhead.
For each tuple ( 𝒂 i k , r i k ) (\bm{a}^{k}{i},r^{k}{i}) in a turn-level group, they compute the discounted return to capture the long-term consequence of the corresponding action. They then normalize these returns within each group to obtain the step relative advantage A t u r n A^{turn} , which measures how well an action performs compared to other actions taken from the same state. This normalization ensures that positive advantages correspond to above-average decisions, while negative values indicate sub-optimal choices under identical state conditions.
Appendix C
More Related Work
Since the core contribution of our method lies in providing explicit and fine-grained thinking control for multi-turn RL, we also consider a range of established techniques originally proposed for single-turn settings that are closely related in spirit. Therefore, in this section, we present a detailed discussion of lengthy reward (Liu et al. , 2025b ) , short-CoT cold start (Cai et al. , 2025 ) , hard thinking budget (Comanici et al. , 2025 ) , and void turn filtering (Xue et al. , 2025 ) .
C.1
Lengthy Reward
Over-thinking is also a long-standing challenge in single-turn reinforcement learning for reasoning models. To explicitly regulate excessive reasoning length, Liu et al. ( 2025b ) summarize the existing lengthy reward that incorporates response length into the reward design. Concretely, given a problem x x with ground-truth answer y ∗ y^{*} , suppose a group of responses { ( y i , z i ) } i = 1 k {(y_{i},z_{i})}{i=1}^{k} is sampled, where z i z{i} denotes the reasoning trace and len ( i ) \mathrm{len}(i) is the length of ( y i , z i ) (y_{i},z_{i}) . Let min _ len = min i len ( i ) \min_\text{len}=\min_{i}\mathrm{len}(i) and max _ len = max i len ( i ) \max_\text{len}=\max_{i}\mathrm{len}(i) . If max _ len = min _ len \max_\text{len}=\min_\text{len} , the length reward is set to zero for all responses since they share identical lengths. Otherwise, the length reward for the i i -th response is defined as:
len _ reward ( i ) = { λ , if r ( x , y i , y ∗ ) = 1 , min ( 0 , λ ) , if r ( x , y i , y ∗ ) = 0 , where λ = 0.5 − len ( i ) − min _ l e n max _ l e n − min _ l e n . \mathrm{len\_reward}(i)=\begin{cases}\lambda,&\text{if }r(x,y_{i},y^{*})=1,\\[4.0pt]
\min(0,\lambda),&\text{if }r(x,y_{i},y^{*})=0,\end{cases}\quad\text{where }\lambda=0.5-\dfrac{\mathrm{len}(i)-\min_len}{\max_len-\min_len}.
(15)
Intuitively, this formulation encourages shorter correct responses while penalizing longer ones among correct outputs, and explicitly penalizes long responses with incorrect answers. The resulting length-based reward is added to the original task reward with a weighting coefficient, providing direct control over the trade-off between reasoning length and task performance.
C.2
Short-CoT Cold Start
Recent evidence (Cai et al. , 2025 ) suggests that RL in reasoning models does not primarily benefit from memorizing correct solution trajectories, but rather from internalizing structured search behaviors embedded in demonstration traces. This is also why parameter-efficient tuning methods, such as Wang et al. ( 2023 , 2024 ) , can work. In particular, backtracking, where the model explicitly revises earlier decisions, has been identified as a crucial structural prior that enables RL to discover effective multi-step exploration strategies. However, constructing high-quality long reasoning traces with appropriate backtracking is costly and task-dependent.
Motivated by this insight, we adopt short-CoT cold start as a lightweight mechanism for agentic RL. Instead of providing full-length expert trajectories, short SFT initializes the model with concise reasoning patterns from more powerful LLM ( e.g., GPT-4o) and basic instruction-following capability, ensuring valid output formatting and a reduced effective action space. This initialization equips the policy with a minimal but consistent interaction protocol, from which RL can reliably amplify and refine latent search behaviors during multi-turn environment interactions.
C.3
Hard Thinking Budget
Google Gemini 2.5 models (Comanici et al. , 2025 ) provide a dedicated thinking phase designed to improve reasoning and planning in complex tasks. This phase is controlled through a thinking budget parameter, which specifies the maximum number of tokens allocated to internal deliberation before the model produces its final response. According to official Gemini and Vertex AI documentation, the thinking process is architecturally separated from the main response generation stage, and users may set an upper bound on its token budget. A special value of − 1 -1 allows the model to dynamically determine its own budget, while 0 disables explicit thinking for lightweight variants. Each model family further enforces valid minimum and maximum ranges ( e.g. , 128–32k tokens for Gemini 2.5 Pro).
C.4
Void-Turn Filtering.
Void turn filtering (Xue et al. , 2025 ) is a stabilization strategy for thinking control designed to improve the robustness of multi-turn policy optimization. In multi-turn reasoning, the accumulation of low-probability tokens and high sampling stochasticity often produces void turns, i.e. , responses that contain neither a valid final answer nor a complete executable structure. Typical void turns manifest as partial code fragments, repetitive text loops, or prematurely terminated outputs caused by early sampling of the end-of-sequence token. Void turn filtering addresses this issue by excluding trajectories containing such invalid turns from policy loss computation.
Appendix D
More Experimental Results
D.1
Ablation Study on other Policy Optimization Algorithm
In fact, our method is plug-and-play and can be readily integrated with other policy update schemes. On WebShop, we further replace the base policy optimization with GSPO (Zheng et al. , 2025b ) . The success rate increases from 85.18 to 91.79 after applying our TTI + + TDS, corresponding to a relative improvement of 7.76%.
Figure 6 : (a) illustrates the change in GiGPO’s average output length under different maximum response length settings. (b) illustrates the change in T 2 PO ’s average output length under different maximum response length settings. (c) shows the proportion of truncated outputs for GiGPO under different maximum response length settings. (d) shows the proportion of truncated outputs for T 2 PO under different maximum response length settings.
D.2
Ablation Study on Token-level Response Length
Figure 6 presents how the policy model’s ( i.e., Qwen3-4B) output length evolves during training under different pre-specified maximum response lengths ( i.e. , data . max_response_length \texttt{data}.\texttt{max_response_length} in verl (Sheng et al. , 2024 ) ). We draw the following conclusions from the observation:
(1) Longer output length is not always better in multi-turn RL . From Figure 6 (a) and 6 (b), when the maximum response length increases from 500 to 700, the final model’s average output length remains nearly unchanged. This indicates that for interaction-driven environments, excessive token budgets are often unnecessary. Conversely, when the maximum length is too small ( e.g. , 300), Figure 6 (c) and 6 (d) show that a large fraction of trajectories are still clipped, demonstrating that the token budget is insufficient.
(2) T 2 PO achieves higher token efficiency. Comparing Figure 6 (a) and 6 (b), under the final experimental setting with a 500-token limit, our method produces on average 20% fewer tokens than GiGPO. Meanwhile, Figure 6 (c) and 6 (d) show that in the last 50 training steps, our method rarely triggers maximum-length clipping, indicating that it avoids generating redundant or uninformative text and substantially mitigates over-thinking.
(3) T 2 PO more effectively stimulates meaningful interaction-driven reasoning. From Figure 6 (a) and 6 (b), our output length gradually increases during the first 20 training steps, reflecting progressively enhanced reasoning depth. In contrast, GiGPO’s output length sharply decreases during the first 50 steps, suggesting that it quickly suppresses exploration by discarding many over-thinking trajectories. This observation is further supported by qualitative trajectory case studies.
D.3
Sensitivity Analysis on α \alpha
In this section, we conduct a sensitivity analysis on the fusion coefficient α \alpha in our self-calibrated uncertainty signal on WebShop, which balances entropy and confidence. We vary α \alpha from 0.2 to 0.4, 0.6, and 0.8, and observe that α = 0.4 \alpha=0.4 yields the best performance. Therefore, we set α \alpha to 0.4.
Table 7: Sensitivity analysis of the fusion coefficient α \alpha in the self-calibrated uncertainty signal.
Method
𝜶 = 0.2 \bm{\alpha=0.2}
𝜶 = 0.4 \bm{\alpha=0.4}
𝜶 = 0.6 \bm{\alpha=0.6}
𝜶 = 0.8 \bm{\alpha=0.8}
Success Rate
90.73
94.15
93.76
93.55
Task Score
79.36
82.77
81.45
80.27
D.4
Efficiency Analysis on Alfworld
We observe a similar phenomenon on ALFWorld. The bar chart shows that the distribution of token consumption for successful trajectories generated by T 2 PO is substantially lower than that of the SOTA baselines. Meanwhile, the line plot indicates that the exploration efficiency of T 2 PO on successful trajectories remains consistently higher throughout training. Furthermore, the right figure demonstrates at the turn level that T 2 PO completes tasks with approximately 16% fewer interaction turns during training.
Figure 7 : Additional efficiency analysis on Alfworld.
D.5
More Results on WebShop
To better understand where performance gains originate, we further report the decomposed reward metrics on WebShop in Table 8 . Each reward component evaluates a distinct aspect of task completion, including correct product title identification ( Title Score ), accurate category matching ( reward_type ), attribute fulfillment ( reward_attribute ), and final option selection ( reward_option ).
We observe that prompting-based and instruction-tuned baselines exhibit limited performance on fine-grained reward components, indicating that pure supervised or in-context alignment is insufficient for robust multi-step decision making. Single-turn RL methods (PPO and GRPO) substantially improve all reward dimensions, confirming the benefit of reinforcement learning in aligning long-horizon behaviors. However, multi-turn RL baselines (GiGPO and GiGPO+DAPO) still present imbalanced reward distributions, particularly on Title Score and reward_option , suggesting that inefficient exploration and unstable credit assignment hinder consistent progress across interaction turns.
T 2 PO achieves the highest scores on all reward components under both backbones. Notably, gains are most pronounced on Title Score and reward_option , which require precise information acquisition and decisive action execution. This indicates that our uncertainty-aware optimization effectively suppresses low-information exploration, enabling the policy to focus on high-yield interaction trajectories and produce more reliable fine-grained decisions throughout multi-turn reasoning.
Table 8 : Reward decomposition on WebShop.
Method
Title Score
reward_type
reward_attribute
reward_option
Prompting
Claude Sonnet 4
0.3396
0.4775
0.4263
0.4309
Qwen3-4B
0.1437
0.2039
0.1756
0.0493
Qwen3-32B
0.1749
0.2283
0.2003
0.0684
Instruction Tuning
Qwen3-4B + SFT
64.58
0.8789
0.7996
0.5826
RL Training (Qwen3-4B-RFT)
PPO
31.65 ±11.45
33.92 ±9.64
29.24 ±7.19
28.76 ±6.99
GRPO
52.92 ±9.13
51.87 ±11.71
53.24 ±8.66
50.49 ±4.89
GiGPO
20.79 ±7.23
36.20 ±3.28
27.68 ±6.84
23.79 ±10.42
GiGPO + DAPO
54.07 ±9.76
67.20 ±12.45
54.71 ±8.64
52.09 ±7.69
T 2 PO (Ours)
65.61 ±9.19
67.58 ±13.67
60.89 ±14.68
57.49 ±14.12
RL Training (Qwen3-8B-RFT)
GRPO
53.87 ±6.97
50.95 ±8.45
54.67 ±8.53
50.08 ±9.13
GiGPO
54.98 ±7.45
54.13 ±4.75
58.12 ±9.87
56.28 ±5.67
GiGPO + DAPO
55.88 ±7.01
68.19 ±7.39
59.62 ±6.38
57.44 ±5.37
T 2 PO (Ours)
67.14 ±3.73
68.98 ±4.18
62.77 ±5.04
58.33 ±5.34
Algorithm 1 Token-Level Thinking Intervention (TTI)
0: Policy model π θ \pi_{\theta} , minimum prefix length L min L_{\min} , window size N N , user prompt q q , stability threshold ε \varepsilon , maximum thinking length L max L_{\max} .
0: Generated action sequence 𝐚 k \mathbf{a}^{k} for turn k k .
1: Initialize token index t ← 1 t\leftarrow 1
2: Initialize stop indicator 𝕀 stop ← 0 \mathbb{I}_{\mathrm{stop}}\leftarrow 0
3: Initialize empty sequence 𝐲 ← ∅ \mathbf{y}\leftarrow\emptyset
4: while t ≤ L max t\leq L_{\max} do
5: Sample next token:
y t ∼ π θ ( ⋅ ∣ y ≤ t , q , 𝐬 k ) y_{t}\sim\pi_{\theta}(\cdot\mid y_{\leq t},q,\mathbf{s}^{k})
6: Compute self-calibrated uncertainty M t M_{t} using Equation 3
7: if t > L min t>L_{\min} and 𝕀 stop = 0 \mathbb{I}_{\mathrm{stop}}=0 then
8: Monitor the temporal variation of the token t t at turn k k : Δ t k = | M t − M t − 1 | \Delta_{t}^{k}=|M_{t}-M_{t-1}|
9: if 1 N + 1 ∑ i = 0 N Δ t − i k < ε \frac{1}{N+1}\sum_{i=0}^{N}\Delta^{k}_{t-i}<\varepsilon then
10: Force emit reasoning terminator : z t + 1 ( v ) ← { + ∞ , v = − ∞ , otherwise z_{t+1}(v)!\leftarrow!\begin{cases}+\infty,&v=\texttt{}\ -\infty,&\text{otherwise}\end{cases}
11: Append deterministic queue 𝒬 = [ , \n ,
12: Set 𝕀 stop ← 1 \mathbb{I}_{\mathrm{stop}}\leftarrow 1
13: break
14: end if
15: end if
16: t ← t + 1 t\leftarrow t+1
17: end while
18: if t > L max t>L_{\max} then
19: Force emit reasoning terminator
20: end if
21: Decode 𝐲 \mathbf{y} into action sequence 𝐚 k \mathbf{a}^{k}
22: return 𝐚 k \mathbf{a}^{k}
Algorithm 2 Turn-Level Dynamical Sampling (TDS)
0: Policy model π θ \pi_{\theta} , user prompt q q , environment ℰ \mathcal{E} , turn threshold η \eta , resampling budget B max B_{\max} , maximum turns K max K_{\max} , target samples N target N_{\text{target}} .
0: Collected trajectory set 𝒟 \mathcal{D} .
1: Initialize 𝒟 ← ∅ \mathcal{D}\leftarrow\emptyset
2: while | 𝒟 | < N target |\mathcal{D}|<N_{\text{target}} do
3: Reset environment: 𝐬 0 ∼ ℰ . reset ( ) \mathbf{s}^{0}\sim\mathcal{E}.\texttt{reset}()
4: Initialize the turn-level observation signal Φ 0 ← null \Phi^{0}\leftarrow\texttt{null}
5: Initialize empty trajectory τ \tau
6: for k = 1 k=1 to K max K_{\max} do
7: Set resampling counter b ← 0 b\leftarrow 0
8: repeat
9: Generate action 𝐚 k \mathbf{a}^{k} under TTI (Definition 6 ):
𝐚 k ∼ π θ ( ⋅ ∣ 𝐬 k , q ) , \mathbf{a}^{k}\sim\pi_{\theta}(\cdot\mid\mathbf{s}^{k},q),
10: Compute token-level uncertainty M t k M_{t}^{k} (Equation 3 ) and obtain turn-level observation signal:
Φ k = ( ∏ t = 1 T k M t k ) 1 / T k \Phi^{k}=\Big(\prod_{t=1}^{T_{k}}M_{t}^{k}\Big)^{1/T_{k}}
11: if k > 1 k>1 then
12: Monitor temporal variation across turns Γ k = | Φ k − Φ k − 1 | \Gamma^{k}=|\Phi^{k}-\Phi^{k-1}|
13: else
14: Set Γ k ← + ∞ \Gamma^{k}\leftarrow+\infty
15: end if
16: b ← b + 1 b\leftarrow b+1
17: until Γ k ≥ η \Gamma^{k}\geq\eta or b ≥ B max b\geq B_{\max}
18: Parse and execute actions in environment:
( 𝐬 k + 1 , r k ) ← ℰ . step ( 𝐚 k ) (\mathbf{s}^{k+1},r^{k})\leftarrow\mathcal{E}.\texttt{step}(\mathbf{a}^{k})
19: Store ( 𝐬 k , 𝐚 k , r k ) (\mathbf{s}^{k},\mathbf{a}^{k},r^{k}) into τ \tau
20: if Is_all_done = True \texttt{Is_all_done}=\texttt{True} then
21: break
22: end if
23: end for
24: Add trajectory τ \tau to 𝒟 \mathcal{D}
25: end while
26: return 𝒟 \mathcal{D}
Appendix E
Codebase
Building upon the existing codebase verl (Sheng et al. , 2024 ) , our codebase introduces targeted modifications to both the vLLM (Kwon et al. , 2023 ) inference engine and the agent interaction loop, enabling seamless integration with verl while preserving its scalability and modularity. Specifically, we redesign the decoding and rollout pipeline to support fine-grained uncertainty-aware control during generation, while maintaining full compatibility with the step-wise multi-turn training paradigm and memory management mechanisms provided by verl.
In addition, our implementation is framework-agnostic and naturally extends to asynchronous RL training, allowing non-blocking rollout collection and parameter updates under distributed settings. This design ensures that our approach can be directly deployed on top of verl with minimal engineering overhead, while remaining equally applicable to other large-scale async RL infrastructures for LLM agent training.
Appendix F
Algorithm Pseudo Code
Algorithms 1 and 2 summarize the proposed hierarchical exploration control mechanisms. Algorithm 1 presents the Token-Level Thinking Intervention, which dynamically monitors the evolution of the self-calibrated uncertainty signal M t M_{t} during token generation. Once the predictive distribution exhibits sustained stabilization according to Definition 6 , the decoding process is deterministically terminated by overwriting the logits to emit a reasoning terminator token, followed by a fixed structural queue that explicitly separates reasoning and action phases. This design adaptively suppresses low-information continuation while preserving necessary task-relevant reasoning content.
Algorithm 2 describes the Turn-Level Dynamical Sampling procedure, which operates on top of TTI during multi-turn interaction with the environment. For each conversational turn, the turn-level observation signal Φ k \Phi^{k} is computed by aggregating token-level uncertainty across the generated reasoning trajectory. If the variation Γ k \Gamma^{k} between consecutive turns falls below a tolerance threshold, the current turn is deemed insufficiently informative and is regenerated under the same environment state until a sufficiently distinct reasoning trajectory is obtained or a resampling budget is exhausted.
Appendix G
Case Study
Token-level over-thinking case. The following figure presents a representative failure case from vanilla GiGPO illustrating how excessive internal reasoning leads to action truncation in long-horizon interactive environments. The agent’s state is composed of three structured components: (i) a task specification describing the target product constraints; (ii) a memory context summarizing recent observations and past actions; and (iii) a current observation listing search results together with a discrete set of admissible actions. At each step, the agent must produce a response consisting of a reasoning trace enclosed in ⋯ \cdots followed by a single executable command enclosed in ⋯ \cdots .
In this example, the reasoning trace grows disproportionately long as the agent attempts to reconcile contradictory attribute constraints ( e.g. , men’s shirt vs. women’s fit, fabric requirements, price thresholds, and unavailable color). This induces verbose attribute checking and speculative hypothesis formation, even though none of the listed products match the query. As a result, the generated reasoning exceeds the system’s output budget before the closing and tags are produced. The missing termination tags render the response unparsable by the environment, causing an immediate interaction failure despite the correct next step being a simple pagination action.
This case highlights that the action space itself is compact and unambiguous, while the unconstrained reasoning channel becomes the dominant source of failure. It motivates the need for explicit reasoning-length control or early-exit mechanisms to prevent overthinking-induced truncation in multi-turn agentic decision pipelines.
A Failure Case Caused by Over-Thinking
System Prompt
System : You’re a helpful assistant.
User : You are browsing an online shop. Based on the instruction, buy a product that close to the production description. You need to search, read the search results, pick a product, choose the size and
color and buy. You should only choose action from the available actions list provided later.
Example process: I need a gingko light and 20x20 pillow cover that is hand painted.
First search[gingko light 20x20 pillow cover hand painted],
answer format: search[blanket with fleece throw] .
Valid answer is search[ ] or click[ ].
Turn 6
State :
================= User Prompt =================
"You are an expert autonomous agent operating in the WebShop e‑commerce environment.Your task is to: Find me wash cold, machine wash men’s shirts with polyester heathers, heathers cotton, cotton heather, needle sleeve, classic fit with color: brown, and fit type: women, and size: x-large, and price lower than 40.00 dollars.Prior to this step, you have already taken 5 step(s).
================= Memory Context =================
Below are the most recent 2 observations and the corresponding actions you took: [Observation 4: ’’Back to Search’ [SEP] ’Page 1 (Total results: 50)’ [SEP] ’Next >’ [SEP] ’B09P39QN2W’ [SEP] ’I Tackled 100 Day Of School Football Boy 100th Day School T-Shirt’ [SEP] ’$16.99’ [SEP] ’B09RVF4JP1’ [SEP] ’Once Upon A Time Queen Born In 1982 T Shirts It Was Me Tank Top’ [SEP] ’$19.99’ [SEP] ’B07JVVDJ6L’ [SEP] ’Canadian Tuxedo T Shirt - Novelty Denim Tux T-Shirt’ [SEP] ’$19.99’ [SEP] ’B07T3VMGDG’ [SEP] ’Tuxedo Cat 4th of July Hat Patriotic Gift Adults Kids Raglan Baseball Tee’ [SEP] ’$23.99’ [SEP] ’B07DKGJR74’ [SEP] ’Amazon Brand - Daily Ritual Women’s 100% Cotton Oversized Fit V-Neck Pullover Sweater’ [SEP] ’$20.66 to $29.2’ [SEP] ’B08HF132PN’ [SEP] ’Flamingo Pink waterbird costume Gift Premium T-Shirt’ [SEP] ’$19.99’ [SEP] ’B09P7H5YK7’ [SEP] ’It’s In My DNA Palestinian Shirt Arabic Gifts Palestine Flag Tank Top’ [SEP] ’$19.99’ [SEP] ’B07HRFSNL4’ [SEP] ’Nautica Men’s Solid Crew Neck Short-Sleeve Pocket T-Shirt’ [SEP] ’$16.05 to $40.98’ [SEP] ’B08226NDZW’ [SEP] ’Disney Winnie The Pooh Tigger Upside Down Portrait Tank Top’ [SEP] ’$25.99’ [SEP] ’B09Q8RD8YN’ [SEP] ’Bungo Stray Anime Dogs Anime Character, Long Sleeve, Sweatshirt, Hoodie, T shirt’ [SEP] ’$19.99’’, Action 4: ’on, but maybe that’s’]
[Observation 5: ⋯ \cdots ]
================= Current Observation =================
You are now at step 6 and your current observation is: ’Back to Search’ [SEP] ’Page 1 (Total results: 50)’ [SEP] ’Next >’ [SEP] ’B09P39QN2W’ [SEP] ’I Tackled 100 Day Of School Football Boy 100th Day School T-Shirt’ [SEP] ’$16.99’ [SEP] ’B09RVF4JP1’ [SEP] ’Once Upon A Time Queen Born In 1982 T Shirts It Was Me Tank Top’ [SEP] ’$19.99’ [SEP] ’B07JVVDJ6L’ [SEP] ’Canadian Tuxedo T Shirt - Novelty Denim Tux T-Shirt’ [SEP] ’$19.99’ [SEP] ’B07T3VMGDG’ [SEP] ’Tuxedo Cat 4th of July Hat Patriotic Gift Adults Kids Raglan Baseball Tee’ [SEP] ’$23.99’ [SEP] ’B07DKGJR74’ [SEP] ’Amazon Brand - Daily Ritual Women’s 100% Cotton Oversized Fit V-Neck Pullover Sweater’ [SEP] ’$20.66 to $29.2’ [SEP] ’B08HF132PN’ [SEP] ’Flamingo Pink waterbird costume Gift Premium T-Shirt’ [SEP] ’$19.99’ [SEP] ’B09P7H5YK7’ [SEP] ’It’s In My DNA Palestinian Shirt Arabic Gifts Palestine Flag Tank Top’ [SEP] ’$19.99’ [SEP] ’B07HRFSNL4’ [SEP] ’Nautica Men’s Solid Crew Neck Short-Sleeve Pocket T-Shirt’ [SEP] ’$16.05 to $40.98’ [SEP] ’B08226NDZW’ [SEP] ’Disney Winnie The Pooh Tigger Upside Down Portrait Tank Top’ [SEP] ’$25.99’ [SEP] ’B09Q8RD8YN’ [SEP] ’Bungo Stray Anime Dogs Anime Character, Long Sleeve, Sweatshirt, Hoodie, T shirt’ [SEP] ’$19.99’.Your admissible actions of the current situation are: ’click[back to search]’,’click[next >]’,’click[b09p39qn2w]’,’click[b09rvf4jp1]’,’click[b07jvvdj6l]’,’click[b07t3vmgdg]’,’click[b07dkgjr74]’,
’click[b08hf132pn]’,’click[b09p7h5yk7]’,’click[b07hrfsnl4]’, ’click[b08226ndzw]’,’click[b09q8rd8yn]’.
================= Instruction =================
Now it’s your turn to take one action for the current step.You should first reason step-by-step about the current situation, then think carefully which admissible action best advances the shopping goal. This reasoning process MUST be enclosed within tags. Once you’ve finished your reasoning, you should choose an admissible action for current step and present it within tags.
Action : " Okay, let me try to figure out what to do here. The user is looking for a specific type of men’s shirt, but with some conflicting details. The goal is to find wash cold, machine wash men’s shirts with polyester heathers, heathers cotton, cotton heather, needle sleeve, classic fit, color brown, fit type women, size x-large, and under $40.Looking at the current observation, all the items listed are either t-shirts, tank tops, or other types of clothing. The first thing I notice is that all the items here are priced under $20 except for the Nautica shirt which has a price range up to $40.98. The user specified a price lower than $40, so that’s a bit of a problem because the Nautica shirt’s max price is over $40. But maybe there’s a specific item in the list that fits the criteria. Wait , the user is looking for men’s shirts, but the fit type is women. That seems contradictory. Maybe there’s a mistake here. Also, the description mentions ḧeathersẅhich are a type of cotton blend, but the items listed here don’t seem to match that.
⇑ \Uparrow
Hesitation in Agentic LLMs correlates with the occurrence of an ‘aha moment’
The Nautica shirt is a men’s solid crew neck short-sleeve pocket t-shirt, but the fit type is women. That’s confusing. The color is brown, but I don’t see any brown items here. The price is under $40, but the Nautica shirt is priced up to $40.98, which is over the limit. The other items are all under $20, which is under $40, but none of them seem to match the material or fit type. The user’s query includes specific terms like p̈olyester heathers, ḧeathers cotton, c̈otton heather,ẅhich are types of fabric. The items listed here are either novelty t-shirts, tank tops, or sweaters. None of them mention polyester or cotton heather. The Nautica shirt is cotton, but the fit is women’s, which might not fit the user’s requirement for men’s shirts. Also, the color brown isn’t present in any of the listed items. Since none of the items on page 1 seem to match the user’s criteria, the next logical step is to check the next page. The previous actions included clicking next >, but the"
⇑ \Uparrow
Overthinking caused truncation: missing and tags
Turn-level repeated failure case. Figure G illustrates a representative failure mode in which the agent becomes trapped in repetitive unsuccessful interaction loops. Specifically, after executing an initial search and clicking a seemingly relevant product, the agent fails to verify whether the item satisfies the required size constraint. Lacking the necessary information to make a correct decision, it returns to the search page and reissues an identical query. This process repeats without meaningful progress, leading to redundant reasoning, repeated action patterns, and unnecessary token consumption. Such behavior reflects a breakdown in effective exploration, where the agent is unable to adapt its strategy based on newly observed information, ultimately resulting in stalled task completion and inefficient multi-turn interaction.
A Failure Case with Repeated Failure Turns
Turn 1
Action : " Okay, let’s see. The user wants to find men’s dress shirts with specific colors (heather grey) and a classic fit, size 3x large, under $30. Current observation is Search, and the admissible actions are to search with a query or click on search. Since the user hasn’t done any search yet, the first step is to perform a search with the right criteria. The query should include all the keywords: heather grey, classic fit, 3x large, and price under 30. The search term would be something like heather grey men classic fit 3x large under 30. That should narrow down the results to the relevant products. The next step is to execute this search. search[heather grey men classic fit 3x large under 30] "
Turn 2
Action : " Need to look at the first results. The first result is a Nautica t-shirt. It’s a heather grey and has a classic fit. The price is $16.05, which is under $30.
click[b07hrfsnl4] "
Turn 3
Action : " Well, let’s see. The user wants men’s dress shirts with heather grey, classic fit, and 3x large, under $30. The current observation shows details of that t-shirt. The color here is g̈rey heather,̈ which matches the heather grey the user wants. The size is not mentioned here. The current observation shows the size options. The t-shirt’s size is not specified here, but the options are x-small to 6x. The user needs 3x large. The current t-shirt’s size is not listed here, but the options are available. The next step is to check the size. I do not have this information and therefore cannot make a correct decision, so we should perform a new retrieval. click [back to search] "
Turn 4
Action : " Okay, let’s see. The user wants to find men’s dress shirts with specific colors (heather grey) and a classic fit, size 3x large, and under $30. The first step is to perform a search with the right criteria. The query should include all the keywords: heather grey, classic fit, 3x large, and price under 30. The search term would be something like ḧeather grey men classic fit 3x large under 30. That should narrow down the results to the relevant products. The next step is to execute this search. search[heather grey men classic fit 3x large under 30] "
⇑ \Uparrow
Falling into repeated failures across multiple interaction turns
T 2 PO Enables Decisive, Non-Redundant Actions . The following figure shows a successful WebShop interaction where the agent generates a valid, executable action under the same structured state representation. Concretely, the state is organized into three layers: (i) a task specification that encodes fine-grained attribute constraints ( e.g. , material, fit, color, size, and price); (ii) a memory context that summarizes the most recent observations and actions, providing short-horizon history for credit assignment and decision continuity; and (iii) a current observation that enumerates the present search-result page together with a closed admissible action set (click targets and navigation operations). This layout aligns the agent’s reasoning with the environment’s interface: decisions must be grounded in what is currently visible and what can be executed.
Importantly, the generated action content is not redundant. It serves as a compact, outcome-oriented control signal distilled from the multi-constraint reasoning process. Rather than repeating state tokens, the agent uses the memory context to infer that the current result page remains mismatched to the requested attributes, then selects a single navigation operation ( click[ A Successful Case with Valid Action from Our T 2 PO
System Prompt
System : You’re a helpful assistant.
User : You are browsing an online shop. Based on the instruction, buy a product that close to the production description. You need to search, read the search results, pick a product, choose the size and
color and buy. You should only choose action from the available actions list provided later.
Example process: I need a gingko light and 20x20 pillow cover that is hand painted.
First search[gingko light 20x20 pillow cover hand painted],
answer format: search[blanket with fleece throw] .
Valid answer is search[ ] or click[ ].
Turn 6
State :
================= User Prompt =================
"You are an expert autonomous agent operating in the WebShop e‑commerce environment. Your task is to: Find me wash cold, machine wash men’s shirts with polyester heathers, heathers cotton, cotton heather, needle sleeve, classic fit with color: brown, and fit type: women, and size: x-large, and price lower than 40.00 dollars. Prior to this step, you have already taken 5 step(s).
================= Memory Context =================
Below are the most recent 2 observations and the corresponding actions you took: [Observation 4: ’’Back to Search’ [SEP] ’Page 1 (Total results: 50)’ [SEP] ’Next >’ [SEP] ’B09P39QN2W’ [SEP] ’I Tackled 100 Day Of School Football Boy 100th Day School T-Shirt’ [SEP] ’$16.99’ [SEP] ’B09RVF4JP1’ [SEP] ’Once Upon A Time Queen Born In 1982 T Shirts It Was Me Tank Top’ [SEP] ’$19.99’ [SEP] ’B07JVVDJ6L’ [SEP] ’Canadian Tuxedo T Shirt - Novelty Denim Tux T-Shirt’ [SEP] ’$19.99’ [SEP] ’B07T3VMGDG’ [SEP] ’Tuxedo Cat 4th of July Hat Patriotic Gift Adults Kids Raglan Baseball Tee’ [SEP] ’$23.99’ [SEP] ’B07DKGJR74’ [SEP] ’Amazon Brand - Daily Ritual Women’s 100% Cotton Oversized Fit V-Neck Pullover Sweater’ [SEP] ’$20.66 to $29.2’ [SEP] ’B08HF132PN’ [SEP] ’Flamingo Pink waterbird costume Gift Premium T-Shirt’ [SEP] ’$19.99’ [SEP] ’B09P7H5YK7’ [SEP] ’It’s In My DNA Palestinian Shirt Arabic Gifts Palestine Flag Tank Top’ [SEP] ’$19.99’ [SEP] ’B07HRFSNL4’ [SEP] ’Nautica Men’s Solid Crew Neck Short-Sleeve Pocket T-Shirt’ [SEP] ’$16.05 to $40.98’ [SEP] ’B08226NDZW’ [SEP] ’Disney Winnie The Pooh Tigger Upside Down Portrait Tank Top’ [SEP] ’$25.99’ [SEP] ’B09Q8RD8YN’ [SEP] ’Bungo Stray Anime Dogs Anime Character, Long Sleeve, Sweatshirt, Hoodie, T shirt’ [SEP] ’$19.99’’, Action 4: ’click n̈ext >’̈]
[Observation 5: ⋯ \cdots ]
================= Current Observation =================
You are now at step 6 and your current observation is: ’Back to Search’ [SEP] ’Page 2 (Total results: 50)’ [SEP] ’ ’ [SEP] ’B07XPR3R7N’ [SEP] ’Marvel Avengers: Endgame Captain America America’s Language T-Shirt’ [SEP] ’$22.99’ [SEP] ’B07Q48JVT8’ [SEP] ’Marvel Infinity War Thanos Streetwear Poster Graphic Hoodie’ [SEP] ’$47.99’ [SEP] ’B09GL561XH’ [SEP] ’Disney Mickey And Friends Happy Birthday Mickey Confetti T-Shirt’ [SEP] ’$22.99’ [SEP] ’B079PH6955’ [SEP] ’Need Some Hank & A Drank Country Music T Shirt for Rednecks’ [SEP] ’$15.49’ [SEP] ’B00O30JLDK’ [SEP] ’Legendary Whitetails Men’s Non-Typical Long Sleeve T-Shirt’ [SEP] ’$10.52 to $40.5’ [SEP] ’B0969G2DH8’ [SEP] ’The cool twins just showed up T-Shirt’ [SEP] ’$17.99’ [SEP] ’B085WQKRRJ’ [SEP] ’CLANMILUMS Men’s Classic Comfort Soft Regular Fit Short Sleeve Henley T-Shirt Tee’ [SEP] ’$17.99 to $19.99’ [SEP] ’B09HX5CD2D’ [SEP] ’CSU Cleveland State University Vikings Property Fleece Drawstring Shorts Heather Charcoal’ [SEP] ’$39.95’ [SEP] ’B01MG1LTMS’ [SEP] ’Got Snow? Funny Snowmobile Snowboard Skiing Cold Weather Winter Sports Unisex Hooded Sweatshirt’ [SEP] ’$24.99 to $29.99’ [SEP] ’B09R9YCM6R’ [SEP] ’XXBR Summer T-shirts for Mens, Soldier Short Sleeve 3D Street Vintage Printed Shirt Slim Fit Muscle Casual Tee Tops’ [SEP] ’$8.98 to $11.99’.
Your admissible actions of the current situation are: [’click[back to search]’,’click[ ]’,’click[b07xpr3r7n]’,’click[b07q48jvt8]’,’click[b09gl561xh]’,’click[b079ph6955]’,’click[b00o30jldk]’,
’click[b0969g2dh8]’,’click[b085wqkrrj]’,’click[b09hx5cd2d]’,’click[b01mg1ltms]’,’click[b09r9ycm6r]’,].
================= Instruction =================
Now it’s your turn to take one action for the current step.You should first reason step-by-step about the current situation, then think carefully which admissible action best advances the shopping goal. This reasoning process MUST be enclosed within tags. Once you’ve finished your reasoning, you should choose an admissible action for current step and present it within tags.assistant
Action : " Okay, let’s tackle this step by step. The user is looking for a specific type of men’s shirt, but the search results seem to be going in the wrong direction. The initial query was for wash cold, machine wash men’s shirts with polyester heathers, heathers cotton, cotton heather, needle sleeve, classic fit, color brown, women’s fit, size x-large, under $40. Looking at the current observation, the search results are on page 2, and the items listed don’t match the criteria. For example, the first item is a Marvel Avengers T-shirt priced at $22.99, which is way too pricey and not the right type. The next items are things like a Thanos hoodie ($47.99), a Disney Mickey T-shirt ($22.99), a country music T-shirt ($15.49), and others. None of these seem to be men’s shirts with the specified materials or fit. The price is under $40, but some items are over that, like the Thanos hoodie. The materials mentioned (polyester heather, cotton heather) aren’t listed here either.Since the current page (page 2) doesn’t have any items matching the criteria, the next logical step is to go back to the previous page (page 1) to check if there’s a better result there. Alternatively, if page 1 also doesn’t have the right items, the user might need to adjust the search terms, but since the user hasn’t done that yet, it’s better to check the previous page first. So the best action here is ’click[ click[ "
⇑ \Uparrow
We can successfully parse the action!
BETA
AI Summary: Based on hf metadata. Not a recommendation.
🛡️ Paper Transparency Report
Technical metadata sourced from upstream repositories.
🆔 Identity & Source
- id
- arxiv-paper--unknown--2605.02178
- slug
- unknown--2605.02178
- source
- hf
- author
- Haixin Wang, Hejie Cui, Chenwei Zhang
- license
- ArXiv
- tags
- paper, research
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
📊 Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.