📄

Paper

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

by Chenxing Wei, Hong Wang, Ying He arxiv-paper--unknown--2603.01375

Nexus Index

37.0 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 0

R: Recency 100

Q: Quality 45

Tech Context

Vital Performance

0 DL / 30D

0.0%

High Impact 0 Citations

2024 Year

ArXiv Venue

- FNI Rank

Paper Information Summary
Entity Passport
Registry ID	arxiv-paper--unknown--2603.01375
License	ArXiv
Provider	hf

📜

Cite this paper

Academic & Research Attribution

BibTeX

@misc{arxiv_paper__unknown__2603.01375,
  author = {Chenxing Wei, Hong Wang, Ying He},
  title = {Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation Paper},
  year = {2026},
  howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2603.01375}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Chenxing Wei, Hong Wang, Ying He. (2026). Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation [Paper]. Free2AITools. https://free2aitools.com/paper/arxiv-paper--unknown--2603.01375

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

37.0

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 100

Quality (Q) 45

💬 Index Insight

FNI V2.0 for Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

📝 Executive Summary

"Technical abstract for this publication is currently being indexed."

❝ Cite Node

@article{Unknown2026Words,
  title={Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation},
  author={},
  journal={arXiv preprint arXiv:arxiv-paper--unknown--2603.01375},
  year={2026}
}

Abstract & Analysis

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

text

   Title: 
   

  Content selection saved. Describe the issue below:

   Description:

License: CC BY 4.0

arXiv:2603.01375v1 [cs.AI] 02 Mar 2026

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

Chenxing Wei

Hong Wang

Ying He

Zhongxiang Dai

Bo Jiang

F. Richard Yu

Yao Shu

Abstract

Test-time policy adaptation for multi-turn interactions (T 2 PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.

Machine Learning, ICML

1

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in general tasks (Yang et al. , 2025 ; OpenAI, 2025 ; Google, 2025 ) , increasingly serving as collaborative partners that engage in complex, multi-turn dialogues with users to solve open-ended problems (Yi et al. , 2025 ) . However, a fundamental mismatch persists between static training paradigms (e.g., SFT (Ouyang et al. , 2022 ; Wei et al. , 2025a ) , RLHF (Shao et al. , 2024 ; Wei et al. , 2025c ) ) and dynamic real-world deployments (Li et al. , 2025b ; Laban et al. , 2025 ) . Consequently, pre-trained models often falter in extended dialogues (Wang et al. , 2024 ) , exhibiting limited adaptability (Yi et al. , 2025 ) and poor error correction capabilities (Deshpande et al. , 2025 ) , as evidenced by the performance stagnation observed in (Wei et al. , 2025b ) . To bridge this gap without the prohibitive cost of retraining, Test-Time Policy Adaptation for Multi-Turn Interactions (T 2 PAM) (Wei et al. , 2025b ) has emerged as a critical paradigm. This approach aims to optimize the policy of model in real-time during multi-turn sessions, ensuring alignment with specific user preferences to significantly enhance response accuracy and acceptance rates.

Despite the promise of T 2 PAM, existing paradigms commonly treat test-time adaptation as a single-axis problem : either purely refining instructions (Prompt Engineering) (Yi et al. , 2025 ) or, as seen in representative approaches like ROSA (Wei et al. , 2025b ) and TTRL (Zuo et al. , 2025 ) , only adjusting weights (Test-Time Training). In this paper, we challenge this bifurcated view by explicitly modeling the effective policy of an LLM as a coupled function π ( x , θ ) \pi(x,\theta) dependent on both its internal parameters (Weights) and the external context (Words). We argue that such conditional optimization strategies, which update one variable while freezing the other, overlook a fundamental reality: interaction failures stem from a coupled mix of context ambiguity and model incapacity (Keluskar et al. , 2024 ) . Addressing these factors in isolation proves insufficient—parameter-centric methods risk overfitting to noisy histories, while prompt-centric methods often hit capability ceilings. This misalignment ultimately harms downstream performance, leading to failures in generating correct responses (low accuracy) and unnecessarily prolonged interaction turns that severely degrade user acceptance (Tang et al. , 2025a ) . Detailed related work is provided in the Appendix A .

Figure 1 :
Overview of the ROSA2 Framework. We formulate T 2 PAM as a joint optimization problem over the coupled variables ϕ t = { x t + 1 , θ t } \phi_{t}={x_{t+1},\theta_{t}} . During the Forward Phase (solid lines), the model generates a response y t y_{t} conditioned on the history H t − 1 H_{t-1} . The Backward Phase (dashed lines) approximates the full gradient ∇ j o i n t \nabla_{joint} of the interaction loss ℒ \mathcal{L} via two synergistic modules: the Textual Optimization (top, green) utilizes textual gradients ( ∇ x \nabla_{\mathrm{x}} ) to refine the user feedback into a clearer instruction ( x t + 1 → x t + 1 ∗ x_{t+1}\rightarrow x_{t+1}^{*} ), resolving context ambiguity; while the Parameter Optimization (bottom, blue) employs gradient updates ( ∇ θ \nabla_{\theta} ) to adjust the adapter weights ( θ t → θ t + 1 \theta_{t}\rightarrow\theta_{t+1} ), enhancing the intrinsic capability of model. This co-adaptation ensures the system becomes both “Clearer” in intent and “Stronger” in execution for the next turn.

To overcome this limitation, we argue that effective adaptation requires resolving a fundamental error attribution:

When a model fails in a multi-turn context, is it due to a lack of intrinsic capability (parameter misalignment) or a misunderstanding of the task intent (context ambiguity)?

Addressing these factors in isolation proves insufficient (Chen et al. , 2025 ) . Pure prompt engineering cannot remedy intrinsic capability deficits (Lee et al. , 2025 ) , whereas pure parameter adaptation is prone to learning spurious mappings from noisy inputs (Li et al. , 2025a ) . As visualized in Figure 2 (b), the optimization landscape of T 2 PAM is characterized by coupled semantic and parametric gaps. Approaching this coupled system via independent updates (analogous to following partial derivatives ) often leads to convergence at suboptimal local minima: solely optimizing parameters gravitates towards an Overfitting Trap , while solely refining the context stalls in a Deficit Trap . Consequently, we posit that T 2 PAM must be reformulated as a joint optimization problem. Crucially, we argue that these optimization paths are not merely additive but synergistic, with semantic clarity acting as a pre-conditioner for parametric alignment . By prioritizing the elimination of semantic ambiguity, we cleanse the learning signal, ensuring that the gradient descent for parameters is strictly oriented towards the true task intent rather than fitting accumulated noise. This co-adaptation allows us to approximate the full gradient of the interaction objective, enabling a unified trajectory that effectively bypasses partial-optimization traps and accelerates convergence to the Success Zone of true user intents. This perspective aligns with recent research on model alignment (Liu et al. , 2023 ; Bo et al. , 2025 ) .

Driven by this insight, we introduce ROSA2, a unified framework designed to approximate the full gradient of the interaction objective by co-adapting the semantic context and model parameters. Instead of treating error signals as a monolith, our approach effectively disentangles the optimization process: it employs textual gradients to sharpen the user intent (Words) and utilizes closed-form updates to enhance the model’s intrinsic execution capabilities (Weights). Theoretically, we demonstrate that this semantic pre-conditioning is rigorous, proving that it strictly bounds the magnitude of parameter shifts required to reach the optimal policy. This theoretical advantage translates directly into empirical gains: ROSA2 establishes a new state-of-the-art on the multiple benchmarks with a 30% average accuracy improvement, while simultaneously cutting interaction costs by reducing average turns by 40%. These results validate our core hypothesis: precise context is the catalyst that maximizes the efficacy of parameter adaptation.

Figure 2 :
Empirical Observations and Theoretical Landscape. Figure (a) In the experimental results on MATH (Qwen3-8B) reveal that single-axis methods ( Green / Blue solid lines) suffer from premature stagnation. However, the immediate recovery observed in the Switch experiments ( Green / Blue dashed lines) suggests this bottleneck is structural. Figure (b) We map these dynamics to the optimization landscape using consistent color and line styling : the Prompt-Only path ( Green ) stalls in the Deficit Trap (Hitting capability ceilings), while the Param-Only path ( Blue ) gravitates towards the Overfitting Trap (Memorizing noise). The dashed arrows in Figure (b) visualize how the Switch Method escapes these local minima by activating the missing axis. Crucially, ROSA2 ( Red ) approximates the joint gradient ∇ joint \nabla_{\text{joint}} , forming an Optimal Trajectory that bypasses these traps and proceeds directly to the Success Zone , corresponding to the superior convergence shown in Figure (a).

Our contributions are summarized as follows:

•
We propose ROSA2, to the best of our knowledge, the first work to reformulate test-time adaptation as a joint optimization of semantic context and model parameters, effectively resolving the error attribution dilemma inherent in conditional optimization methods. (Section 3 )

•
We provide rigorous proofs showing that semantic refinement acts as a pre-conditioner to strictly reduce parameter shift (Theorem 4.1 ) and guarantee faster convergence to the optimal policy (Theorem 4.2 ). (Section 4 )

•
Extensive evaluations demonstrate that ROSA2 achieves state-of-the-art results across diverse domains (e.g., +30.8% on MATH) while reducing interaction turns by nearly 40% , leading to lower total latency with negligible memory overhead. (Section 5 )

2

Motivation: The Traps of Conditional Optimization

T 2 PAM presents a joint optimization challenge involving both context ambiguity and model capability. Formally, we consider a policy π \pi parameterized by both the context x x (Words) and the model weights θ \theta (Weights). We hypothesize that conditional optimization strategies, which update either x x or θ \theta in isolation, inevitably converge to suboptimal states characterized by either persistent reasoning deficits (due to frozen parameters) or overfitting to noisy prompts (due to lack of context refinement).

2.1

Experimental Setup.

To empirically validate this hypothesis, we conducted a controlled study using the Qwen3-8B model on the MATH dataset (Hendrycks et al. , 2021b ) , simulating a challenging 10-turn interaction scenario. We compared four distinct optimization settings to isolate the effects of different variables: (1) Standard Inference : The model performs multi-turn reasoning with both the prompt and model parameters frozen; (2) Prompt Optimization : We freeze the model parameters and exclusively update the system prompt using TextGrad; (3) Parameter Optimization : We fix the system prompt and exclusively update the model parameters via ROSA ; (4) Switch Method : To test the limitations of conditional optimization methods, we implement the Switch Method at the observed stagnation point (Turn 5). Specifically, for the model initially optimizing prompts, we freeze the prompt and switch to updating parameters; conversely, for the model initially optimizing parameters, we freeze the weights and switch to updating the prompt.

2.2

Observation: Stagnation and Recovery.

The empirical results in Figure 2 (a) demonstrate a notable trend. The Baseline (Gray dotted) exhibits limited self-correction capability, remaining nearly flat. Furthermore, conditional optimization methods, despite initial gains, suffer from diminishing returns and eventual premature stagnation. Specifically, the Prompt-Only method is constrained by policy misalignment , where semantic updates fail to bridge the reasoning gap, while the Param-Only method plateaus early due to overfitting . Crucially, a turning point occurs upon intervention: implementing the Switch Method at Turn 5 (Dashed curves) triggers a distinct performance improvement. This recovery indicates that the stagnation was driven by the limitations of conditional optimization.

2.3

Theory: Traps of conditional optimization.

We map these empirical results to the theoretical optimization landscape in Figure 2 (b), identifying two distinct failure modes inherent to conditional updates. The stagnation of the Prompt-Only method corresponds to the Deficit Trap (Green zone): when parameters are frozen, purely semantic updates cannot rectify intrinsic reasoning deficits, leaving the model stuck despite having a refined prompt. Conversely, the stagnation of the Param-Only method corresponds to the Overfitting Trap (Blue zone): without context refinement, parameter updates risk overfitting to ambiguous prompts. The Switch experiments validates these traps: introducing the missing optimization dimension allows the model to escape the local minima (dashed arrows), confirming that both semantic clarity and parametric capability are required for sustained improvement.

2.4

From Conditional Optimization to Joint Optimization.

Building on the insight that semantic clarity and parametric capability must be co-adapted, we propose ROSA2 which implements a joint optimization strategy. By approximating the full gradient of the interaction objective from the very first turn, ROSA2 leverages the complementary strengths of semantic refinement and parametric adaptation to bypass both the Deficit and Overfitting Traps. As shown in Figure 2 (a) (Red Solid), it follows an Optimal Trajectory , achieving significantly faster convergence and higher accuracy. Driven by this insight, the following section details the co-adaptation framework of ROSA2.

3

Joint Optimization via Full-Gradient Approximation

Building on the aforementioned motivation in Section 2 , we propose ROSA2, a novel framework that treats T 2 PAM as a joint optimization problem. By viewing the policy as a coupled function of Words (Context) and Weights (Parameters), ROSA2 approximates the full gradient of the interaction objective to strictly align the policy of model with the latent optimal user preference.

3.1

Problem Formulation: Joint Optimization in the Current Turn

As shown in Figure 1 , for the t t -th turn of a multi-turn interaction session, let H t − 1 = { ( x 1 , y 1 ) , … , ( x t − 1 ∗ , y t − 1 ) , x t ∗ } H_{t-1}={(x_{1},y_{1}),\dots,(x^{}{t-1},y{t-1}),x^{}{t}} denote the immutable interaction history accumulated prior to generating the current response, containing the completed dialogue pairs from previous turns and the refined query x t ∗ x^{*}{t} for the current turn. At the current turn t t , the model operates with the composed parameters θ = θ base + θ t \theta=\theta_{\text{base}}+\theta_{t} , where θ t \theta_{t} represents the current learnable adapter weights. The response y t y_{t} is generated according to the current policy π θ \pi_{\theta} conditioned on the history:

text

  y  t   ∼   π  t    (  ⋅  ∣   H   t  −  1    ,  θ  )   .   y_{t}\sim\pi_{t}(\cdot\mid H_{t-1},\theta).

(1)

Subsequently, the user provides feedback denoted as x t + 1 x_{t+1} , which serves as the raw query for the next turn. Distinct from standard paradigms, we treat this feedback x t + 1 x_{t+1} as an optimizable variable (Words) alongside the model parameters θ t \theta_{t} (Weights). Thus, we define ϕ t = { x t + 1 , θ t } \phi_{t}={x_{t+1},\theta_{t}} as the set of joint optimization variables for the current step.

The Joint Optimal Policy Construction.

We postulate the existence of a Joint Optimal Policy π ∗ \pi^{*} that represents the ideal response distribution for the current turn. Following the principles of reward-weighted regression (Rafailov et al. , 2023 ) , we construct this target distribution by re-weighting the policy from the previous turn, denoted as π t − 1 \pi_{t-1} . In our setting, π t − 1 \pi_{t-1} serves as the reference policy for the current adaptation step (Wei et al. , 2025b ) . Formally:

text

    π  t  ∗      (   y  ∣   H   t  −  1     )    ≜    1   Z  t       π   t  −  1       (   y  ∣   H   t  −  1     )      exp  ⁡   (    r     (  y  )    β   )      ,   \pi^{*}_{t}(y\mid H_{t-1})\triangleq\frac{1}{Z_{t}}\pi_{t-1}(y\mid H_{t-1})\exp\left(\frac{r(y)}{\beta}\right),

(2)

where r ( y ) r(y) is the reward signal for the generated response from user feedback. Crucially, the partition function Z t Z_{t} depends solely on the policy π t \pi_{t} and the fixed history:

text

   Z  t   =    𝔼   y  ∼   π   t  −  1         [   exp  ⁡   (    r     (  y  )    β   )    ]     .   Z_{t}=\mathbb{E}_{y\sim\pi_{t-1}}\left[\exp\left(\frac{r(y)}{\beta}\right)\right].

(3)

Therefore, Z t Z_{t} is a constant scalar with respect to the current optimization variables ϕ t = { x t + 1 , θ t } \phi_{t}={x_{t+1},\theta_{t}} .

Optimization Objective.

Our goal is to update the current policy π t \pi_{t} (parameterized by x , θ x,\theta ) to approximate this target π t ∗ \pi^{*}_{t} . We formulate this as minimizing the Forward KL Divergence , denoted as the loss function ℒ \mathcal{L} :

text

 ℒ   (   ϕ  t   )   =   D   K    L     (   π  t  ∗    (  ⋅  ∣   ϕ  t   )   ∥   π  t    (  ⋅  ∣   ϕ  t   )   )   .   \mathcal{L}(\phi_{t})=D_{KL}\Big(\pi^{*}_{t}(\cdot\mid\phi_{t})\;\Big\|\;\pi_{t}(\cdot\mid\phi_{t})\Big).

(4)

Expanding the KL divergence:

text

   ℒ     (   ϕ  t   )    =       𝔼   y  ∼   π  t  ∗        [    log  ⁡   π  t  ∗       (  y  )    ]    ⏟    −   E     (   π  t  ∗   )      −    𝔼   y  ∼   π  t  ∗        [    log  ⁡   π  t       (   y  ∣   ϕ  t    )    ]      .   \mathcal{L}(\phi_{t})=\underbrace{\mathbb{E}_{y\sim\pi^{*}_{t}}[\log\pi^{*}_{t}(y)]}_{-E(\pi^{*}_{t})}-\mathbb{E}_{y\sim\pi^{*}_{t}}[\log\pi_{t}(y\mid\phi_{t})].

(5)

Since π t ∗ \pi^{}{t} is fixed by the forward pass (determined by π t − 1 \pi{t-1} and r r ), its entropy E ( π ∗ ) E(\pi^{}) is independent of the optimizable variables ϕ t \phi_{t} . Consequently, minimizing the divergence is equivalent to minimizing the cross-entropy, or maximizing the expected log-likelihood of the optimal policy:

text

   ℒ     (   ϕ  t   )    ≅   −    𝔼   y  ∼   π  ∗        [    log  ⁡   π  t       (   y  ∣   ϕ  t    )    ]      .   \mathcal{L}(\phi_{t})\cong-\mathbb{E}_{y\sim\pi^{*}}\left[\log\pi_{t}(y\mid\phi_{t})\right].

(6)

Total Derivative and Co-Adaptation.

To perform the update, we examine the total differential d ℒ d\mathcal{L} with respect to ϕ t \phi_{t} . Using importance sampling to estimate the gradient expectation under the previous policy distribution π t \pi_{t} :

text

   ∇   ϕ  t    ℒ   =   −    𝔼   y  ∼   π  t        [       π  ∗      (  y  )      π  t      (  y  )           ∇   ϕ  t    log   ⁡   π  t       (   y  ∣   ϕ  t    )    ]      \displaystyle\nabla_{\phi_{t}}\mathcal{L}=-\mathbb{E}_{y\sim\pi_{t}}\left[\frac{\pi^{*}(y)}{\pi_{t}(y)}\nabla_{\phi_{t}}\log\pi_{t}(y\mid\phi_{t})\right]

(7)

text

    =   −    𝔼   y  ∼   π  t        [     1   Z  t        exp  ⁡   (     r     (  y  )    β    )         ∇   ϕ  t    log   ⁡   π  t       (   y  ∣   ϕ  t    )    ]      .   \displaystyle=-\mathbb{E}_{y\sim\pi_{t}}\Bigg[\frac{1}{Z_{t}}\exp\left(\frac{r(y)}{\beta}\right)\nabla_{\phi_{t}}\log\pi_{t}(y\mid\phi_{t})\Bigg].

Expanding the gradient operator ∇ ϕ t \nabla_{\phi_{t}} reveals the coupled nature of the optimization. To strictly decrease the divergence, the total change in the loss function must follow the full gradient in the joint space:

d ℒ ∝ − 1 Z t 𝔼 y ∼ π t [ exp ⁡ ( r ( y ) β ) ⏟ Reward Weight ( ∇ x log ⁡ π t ⋅ d x ⏟ Optimizing Prompt + ∇ θ log ⁡ π t ⋅ d θ ⏟ Optimizing Params ) ] . \begin{aligned} &d\mathcal{L}\propto\ -\frac{1}{Z_{t}}&\mathop{\mathbb{E}\quad}\limits_{y\sim\pi_{t}}\Bigg[\underbrace{\exp\left(\frac{r(y)}{\beta}\right)}{\text{Reward Weight}}\Bigg(\underbrace{\nabla{x}\log\pi_{t}\cdot dx}{\text{Optimizing Prompt}}+\underbrace{\nabla{\theta}\log\pi_{t}\cdot d\theta}_{\text{Optimizing Params}}\Bigg)\Bigg].\end{aligned}

(8)

Equation 8 theoretically mandates the T 2 PAM: since Z t Z_{t} is a constant scaling factor derived from the previous turn, approximating the joint optimal policy requires simultaneously rectifying the query x t x_{t} and updating the parameters θ t \theta_{t} along the direction of the reward-weighted log-likelihood.

3.2

The ROSA2 Algorithm

Guided by the total differential derivation in Eq. 8 , we propose ROSA2, a co-adaptation framework designed to iteratively approximate the joint optimal policy through multi-turn interactions. The complete protocol is detailed in Algorithm 1 . The process begins by initializing the turn counter t = 1 t=1 , the learnable adapter parameters θ 1 \theta_{1} to zero, and the current history H H containing the initial user query x 1 x_{1} (lines 1-2 in Algorithm 1 ). At each turn t t , the workflow proceeds through two distinct phases:

Phase 1: Generation and Evaluation.

To leverage the adapted knowledge, the system first composes the effective model parameters θ \theta by adding the current adapter weights θ t \theta_{t} to the frozen base model parameters θ base \theta_{\text{base}} (line 5). A response y ^ t \hat{y}{t} is then generated using the current policy π θ \pi{\theta} , conditioned on the accumulated history H H (line 6). Subsequently, the system receives a binary reward r t r_{t} and the feedback of user for the next turn, denoted as x t + 1 x_{t+1} (line 7). If the response is accepted ( r t = + 1 r_{t}=+1 ) or the turn limit T max T_{\max} is reached, the process terminates and returns y ^ t \hat{y}_{t} (lines 8-9).

Phase 2: Joint Optimization.

If the response is rejected ( r t = − 1 r_{t}=-1 ) and the session continues, ROSA2 triggers the co-adaptation process to jointly optimize the state for the next interaction. First, the Semantic Stream addresses context ambiguity. It utilizes the deficiency detected in the current response y ^ t \hat{y}{t} to compute a semantic gradient, which is then used to refine the raw incoming feedback x t + 1 x{t+1} into a more precise and instructive query x t + 1 ∗ x_{t+1}^{*} (lines 12-14). Uniquely, even if explicit user feedback is absent (i.e., x t + 1 = ∅ x_{t+1}=\emptyset ), this stream autonomously synthesizes a corrective query based on the gradient derived from the failure in y ^ t \hat{y}_{t} . This ensures that the model receives a semantically optimized instruction for the next turn, regardless of whether the user provided specific guidance. By generating such fine-grained feedback in every iteration, we effectively minimize the semantic gap between the intent of the user and the understanding of the model.

Algorithm 1 ROSA2 Co-Adaptation Protocol

1: Input: Initial Query x 1 x_{1} , Base Model Parameters θ base \theta_{\text{base}} , Max Turns T max T_{\max} .

2: Initialize: Turn Counter t ← 1 t\leftarrow 1 , Adaptation Parameters θ 1 ← 𝟎 \theta_{1}\leftarrow\mathbf{0} , Current History H 0 ← { x 1 } H_{0}\leftarrow{x_{1}} .

3: while t ≤ T max t\leq T_{\max} do

4: // Phase 1: Generation and Evaluation

5: Compose parameters: θ ← θ base + θ t \theta\leftarrow\theta_{\text{base}}+\theta_{t} .

6: Generate response: y ^ t ∼ π ( ⋅ ∣ H t − 1 , θ ) \hat{y}{t}\sim\pi(\cdot\mid H{t-1},\theta) .

7: Receive reward r t r_{t} and feedback x t + 1 x_{t+1} (next turn query) from Environment/User.

8: if r t = + 1 r_{t}=+1 or t = T max t=T_{\max} then

9: Return y ^ t \hat{y}_{t} // Task completed or limit reached

10: end if

11: // Phase 2: Joint Optimization

12: // Step A: Semantic Update (TextGrad)

13: Compute semantic gradient and refine query:

14: x t + 1 ∗ ← x t + 1 − ∇ text ℒ ( y ^ t ) x_{t+1}^{*}\leftarrow x_{t+1}-\nabla_{\text{text}}\mathcal{L}(\hat{y}_{t})

15: // Step B: Parametric Update (ROSA)

16: Construct target distribution π ∗ \pi^{*} using π θ \pi_{\theta} and r t r_{t} .

17: θ t + 1 ← θ t − ∇ θ ℒ ( θ ∣ r t , π ∗ , π θ ) \theta_{t+1}\leftarrow\theta_{t}-\nabla_{\theta}\mathcal{L}(\theta\mid r_{t},\pi^{*},\pi_{\theta})

18: Update History: H t ← H t − 1 ∪ { y ^ t , x t + 1 ∗ } H_{t}\leftarrow H_{t-1}\cup{\hat{y}{t},x{t+1}^{*}}

19: t ← t + 1 t\leftarrow t+1

20: end while

Simultaneously, the Parametric Stream utilizes the binary reward ( r t r_{t} ) and the current policy π θ \pi_{\theta} to estimate the latent target policy of the user π ∗ \pi^{} . It then computes a parameter update Δ θ t \Delta\theta_{t} to force the policy of the model π t \pi_{t} to approximate π ∗ \pi^{} (lines 15-17). The computational efficiency of this one-step update method makes it highly suitable for real-time multi-turn interactions, allowing for rapid iterative updates that eventually align the policy of the model with the preferences of the user.

Finally, the system prepares for the next iteration by updating the history H H to include the current response y ^ t \hat{y}{t} and the refined query x t + 1 ∗ x{t+1}^{*} , ensuring that subsequent generations are conditioned on the optimized context (lines 19-20).

Advantages.

The ROSA2 framework provides a solution to T 2 PAM explicitly derived from the full-gradient approximation. By co-adapting both the semantic context and parameters of the model, it overcomes the limitations of conditional optimization baselines. Specifically, the Semantic Stream guarantees that the feedback provided to the model is consistently clear and correct, effectively addressing scenarios where explicit feedback is absent. Complementarily, the Parametric Stream ensures the model possesses the necessary capability to execute these instructions. This synergistic loop enables ROSA2 to robustly handle ambiguous inputs and recover from errors, significantly improving the success rate in complex multi-turn tasks.

4

Theoretical Results

Building upon the joint optimization formulation defined in Section 3.1 , we now establish the convergence properties of the ROSA2 framework. Specifically, we analyze how the joint updates of the query x x and parameters θ \theta (Eq. 8 ) theoretically drive the policy of the model towards the latent optimal user policy π user ∗ \pi_{\text{user}}^{*} . This theoretical analysis proceeds in two stages. We first examine the mechanistic synergy in Section 4.1 , proving that semantic refinement strictly reduces the norm of the required parameter shift (Theorem 4.1 ). Subsequently, we extend this local property to a global perspective in Section 4.2 , deriving a unified convergence bound (Theorem 4.2 ) that explicitly quantifies the reduction in divergence from the optimal policy of user while accounting for approximation errors.

4.1

Mechanism: Parametric Error Reduction

We first analyze the impact of optimizing the context 𝐱 \mathbf{x} on the parametric optimization θ \theta . A central insight is that refining the context 𝐱 \mathbf{x} significantly reduces the magnitude of the required parameter shifts to achieve alignment. We formalize this phenomenon in the following theorem.

Theorem 4.1

text

(Reduction of Parameter Shift) .

Let Δ θ t ( 𝐱 ) \Delta\theta_{t}(\mathbf{x}) be the solution to the linearized parameter update defined in Eq. (6) of ROSA (Wei et al. , 2025b ) given a query 𝐱 \mathbf{x} . If we successfully updates the query from 𝐱 t \mathbf{x}{t} to 𝐱 t ∗ \mathbf{x}{t}^{*} such that the semantic gap to the user intent is reduced (i.e., D KL ( π user ∗ ∥ π ( ⋅ | 𝐱 t ∗ ) )

text

   ‖   Δ     θ  t      (   𝐱  t  ∗   )    ‖   2   <    ‖   Δ     θ  t      (   𝐱  t   )    ‖   2    \|\Delta\theta_{t}(\mathbf{x}_{t}^{*})\|_{2}<\|\Delta\theta_{t}(\mathbf{x}_{t})\|_{2}

(9)

Remark. The detailed proof is provided in Section B.1 . Theorem 4.1 underscores the synergistic necessity of simultaneously updating 𝐱 \mathbf{x} and θ \theta . By aligning the input context with the model’s existing knowledge boundary first, we minimize the residual error that the parameters must correct.

Empirical Evidence. This mechanism is strongly supported by the experimental results in Figure 3 . The parametric error of ROSA2 (blue line, ‖ Δ θ ‖ 2 |\Delta\theta|^{2} ) is significantly reduced compared to the ROSA baseline (gray line), confirming that semantic refinement strictly reduces the optimization difficulty for the parametric stream.

4.2

Unified Convergence Bound

Building on Theorem 4.1 , we derive a unified bound that quantifies the overall performance of Co-Adaptation. This theorem extends the Theorem 4 in (Wei et al. , 2025b ) by explicitly accounting for the approximation errors.

Theorem 4.2

text

(Unified Convergence Bound) .

Assume the log-policy function log ⁡ π ( 𝐲 ∣ 𝐱 , θ ) \log\pi(\mathbf{y}\mid\mathbf{x},\theta) is L L -Lipschitz smooth with respect to the joint state ϕ = { 𝐱 , θ } \phi={\mathbf{x},\theta} . After T T turns of Co-Adaptation, the divergence between the final policy π ϕ T \pi_{\phi_{T}} and the user optimal policy π user ∗ \pi_{\text{user}}^{*} is bounded by:

text

   D  KL      (    π  user  ∗   ∥   π   ϕ  T     )    ≤      D  KL      (    π  user  ∗   ∥   π   ϕ  0     )    ⏟   Initial Error    \displaystyle D_{\text{KL}}(\pi_{\text{user}}^{*}\|\pi_{\phi_{T}})\leq\underbrace{D_{\text{KL}}(\pi_{\text{user}}^{*}\|\pi_{\phi_{0}})\vphantom{-\frac{1}{\beta}\sum_{t=1}^{T}\pi_{\text{user}}^{*}(\mathbf{y}_{t}|\mathbf{x}_{t})}}_{\text{Initial Error}}

(10)

text

   −       1  β         ∑   t  =  1   T      π  user  ∗      (    𝐲  t   |   𝐱  t  ∗    )      ⏟   Improvement    +       L  2         ∑   t  =  1   T     (     ‖   Δ     𝐱  t    ‖   2  2   +    ‖   Δ     θ  t    ‖   2  2    )     ⏟   Approx. Error    .   \displaystyle-\underbrace{\frac{1}{\beta}\sum_{t=1}^{T}\pi_{\text{user}}^{*}(\mathbf{y}_{t}|\mathbf{x}^{*}_{t})}_{\text{Improvement}}+\underbrace{\frac{L}{2}\sum_{t=1}^{T}\left(\|\Delta\mathbf{x}_{t}\|^{2}_{2}+\|\Delta\theta_{t}\|^{2}_{2}\right)}_{\text{Approx. Error}}\ .

where ‖ Δ 𝐱 t ‖ 2 2 |\Delta\mathbf{x}{t}|^{2}{2} and ‖ Δ θ t ‖ 2 2 |\Delta\theta_{t}|^{2}_{2} represent the update steps in the semantic error and parametric error at turn t t , respectively.

Figure 3 : Dynamics of approximation error terms. The plot compares the baseline parametric error (gray) against the decomposed errors of ROSA2. The parametric error of ROSA2 (blue) is significantly reduced compared to the baseline, verifying Theorem 4.1 . Furthermore, the total error of ROSA2 (red) remains lower than the baseline despite the additional semantic cost (green), verifying Theorem 4.2 , which decays exponentially.

Table 1 :
Main Results on Standard Reasoning Benchmarks. We report the accuracy (%) across mathematical (MATH, MATH-500), general (MMLU-R, SuperGPQA), multilingual (MT-AIME24, MT-MATH100), and code generation (HumanEval) tasks. The gains are calculated relative to the Baseline. Best scores are bolded , and second-best scores are underlined .

Mathematical Reasoning
General Reasoning
Multilingual Reasoning
Code Gen.

Model
Method
MATH
MATH-500
MMLU-R
SuperGPQA
MT-AIME24
MT-MATH100
HumanEval

Qwen2.5-0.5B -Instruct

Baseline 23.0 24.0 9.4 3.8 2.6 15.4 31.1

TextGrad

31.2 (+8.2)

29.6 (+5.6)

12.4 (+3.0)

3.8 (+0.0)

2.2 (-0.4)

18.4 (+3.0)

36.6 (+5.5)

ROSA 29.2 (+6.2)

30.4 (+6.4)

11.4 (+2.0)

4.0 (+0.2)

3.8 (+1.2)

19.6 (+4.2)

38.4 (+7.3)

ROSA2

40.8 (+17.8)

39.6 (+15.6)

18.4 (+9.0)

6.4 (+2.6)

4.4 (+1.8)

25.2 (+9.8)

44.5 (+13.4)

Qwen3-0.6B -Instruct

Baseline 19.6 22.4 24.0 3.8 3.2 26.2 41.5

TextGrad 65.0 (+45.4)

62.0 (+39.6)

46.4 (+22.4)

3.8 (+0.0)

7.0 (+3.8)

62.2 (+36.0)

65.8 (+24.4)

ROSA

66.2 (+46.6)

63.0 (+40.6)

48.8 (+24.8)

4.0 (+0.2)

7.2 (+4.0)

62.0 (+35.8)

72.0 (+30.5)

ROSA2

70.8 (+51.2)

71.6 (+49.2)

50.0 (+26.0)

6.4 (+2.6)

9.6 (+6.4)

73.4 (+47.2)

81.7 (+40.2)

Qwen2.5-7B -Base

Baseline 47.0 49.4 39.8 17.8 17.0 60.4 57.9

TextGrad 54.8 (+7.8)

54.0 (+4.6)

60.2 (+20.4)

46.4 (+28.6)

37.6 (+20.6)

75.4 (+15.0)

72.0 (+14.0)

ROSA

63.4 (+16.4)

62.4 (+13.0)

60.2 (+20.4)

47.8 (+30.0)

37.0 (+20.0)

70.4 (+10.0)

74.4 (+16.5)

ROSA2

68.4 (+21.4)

67.2 (+17.8)

63.0 (+23.2)

48.8 (+31.0)

37.8 (+20.8)

78.2 (+17.8)

79.9 (+22.0)

Qwen3-8B

Baseline 50.0 42.8 57.0 24.2 29.4 75.2 78.0

TextGrad

63.4 (+13.4)

62.4 (+19.6)

70.6 (+13.6)

40.0 (+15.8)

40.0 (+10.6)

81.2 (+6.0)

82.3 (+4.3)

ROSA 62.2 (+12.2)

60.8 (+18.0)

75.8 (+18.8)

38.6 (+14.4)

38.6 (+9.2)

88.4 (+13.2)

83.7 (+5.6)

ROSA2

80.8 (+30.8)

80.6 (+37.8)

84.4 (+27.4)

52.4 (+28.2)

44.4 (+15.0)

93.6 (+18.4)

88.4 (+10.4)

DeepSeek-R1 -Distill-Llama-8B

Baseline 27.6 22.8 23.6 10.4 4.8 17.8 25.0

TextGrad 34.0 (+6.4)

31.6 (+8.8)

43.4 (+19.8)

20.8 (+10.4)

16.2 (+11.4)

30.4 (+12.6)

39.0 (+14.0)

ROSA

37.8 (+10.2)

37.6 (+14.8)

42.8 (+19.2)

21.4 (+11.0)

17.2 (+12.4)

38.6 (+20.8)

39.3 (+14.3)

ROSA2

54.2 (+26.6)

54.6 (+31.8)

59.4 (+35.8)

35.0 (+24.6)

21.4 (+16.6)

50.6 (+32.8)

40.2 (+15.2)

en en

Figure 4 :
Performance trajectory on challenging benchmarks. We plot the accuracy on AIME25, GPQA-Diamond, M_IMO, and BigCodeBench-Hard as a function of interaction turns. ROSA2 (red line) demonstrates sustained accuracy improvements, successfully solving complex problems where baselines plateau.

Remark. The detailed proof is provided in Section B.2 . Theorem 4.2 formally decomposes the convergence dynamics into three interconnected components. First, the Initial Error serves as the constant baseline divergence at the start of the interaction. Second, the Improvement term quantifies the cumulative error reduction driven by user feedback. Crucially, co-adaptation amplifies this term by refining the query into a correct form 𝐱 t ∗ \mathbf{x}{t}^{*} , which ensures the model generates responses 𝐲 t \mathbf{y}{t} with significantly higher probability mass under the optimal user policy π user ∗ \pi_{\text{user}}^{*} . Finally, the Approx. Error reflects the penalty incurred from inexact updates. Although ROSA2 introduces an additional semantic cost ‖ Δ 𝐱 t ‖ 2 2 |\Delta\mathbf{x}{t}|^{2}{2} , it mitigates the total error (red line in Figure 3 ) through the mechanism established in Theorem 4.1 .

Empirical Evidence: As illustrated in Figure 3 , as the query context 𝐱 t \mathbf{x}{t} progressively approaches the optimal form 𝐱 ∗ \mathbf{x}^{*} , the squared norm of the semantic update ‖ Δ 𝐱 t ‖ 2 2 |\Delta\mathbf{x}{t}|^{2}_{2} (green line) exhibits an exponential decay. Consequently, the total approximation error of ROSA2 (red line) is initially high due to the large semantic discrepancy, but rapidly drops to remain significantly lower than the single-stream baseline (gray line). This empirically validates that ROSA2 achieves a lower overall approximation error.

5

Empirical Results

Following the protocol in Section 2.1 , we employ an automated pipeline across verifiable benchmarks, where correctness is validated via ground-truth matching (reasoning) or execution-based unit tests (coding/agents). Interactions persist until the model succeeds or reaches a turn limit, allowing simultaneous measurement of efficacy (success rate) and efficiency (turn count). Detailed experimental setups are deferred to Appendix C . Our analysis focuses on: (1) reasoning performance (Section 5.1 ), (2) adaptability in sparse-reward environments (Section 5.2 ), and (3) computational cost (Section 5.3 ).

5.1

Performance and Efficiency in Diverse Tasks

Overcoming Single-Axis Limitations. As shown in Table 1 , ROSA2 consistently outperforms the single-axis baselines, TextGrad (Words-only) and ROSA (Weights-only), across all evaluated model sizes (0.5B to 8B) and domains. This performance gap validates our core hypothesis regarding error attribution: TextGrad, while effective at refining prompts, often hits a capability ceiling on hard tasks where the frozen model simply lacks the intrinsic knowledge to execute the instruction. Conversely, ROSA, by updating parameters on potentially ambiguous inputs, tends to overfit to noise, leading to the stagnation observed in Figure 4 . ROSA2 breaks these bottlenecks: the semantic updates unlock the potential for parameter adaptation, enabling rapid improvements on challenging tasks.

Pre-Conditioning Effect. To understand the source of our efficiency, we analyze the interaction dynamics in Table 2 . ROSA2 achieves the highest Correction Uplift (e.g., 81.4% on MATH), confirming that the Semantic Stream successfully rectifies initial misunderstandings. More importantly, ROSA2 significantly reduces the Avg Turn required to reach a solution (e.g., -40% compared to ROSA). This reduction provides empirical validation for Theorem 4.2 . As the interaction progresses, the continuous semantic refinement actively suppresses the gradient estimation noise, ensuring that the cumulative Approximation Error remains significantly lower than that of ROSA. Consequently, this minimization leads to a tighter alignment with the latent user policy π user ∗ \pi_{\text{user}}^{*} , which directly translates into the observed higher correction rates and lower turns.

Table 2 : Analysis of Interaction Dynamics on Qwen3-8B. Correction Uplift indicates the percentage of eventually solved problems that were corrected after the initial failure. Avg Turn denotes the average interaction turns required to solve a problem.

Dataset
Method
Correction Uplift ( ↑ \uparrow )
Avg Turn ( ↓ \downarrow )

MATH
Baseline 70.0% 7.2

TextGrad 75.1% (+5.1%)

6.0 (-1.2)

ROSA

77.3% (+7.3%)

6.3 (-0.9)

ROSA2

81.4% (+11.4%)

4.4 (-2.8)

MMLU
Baseline 50.9% 6.6

TextGrad 59.5% (+8.6%)

5.2 (-1.4)

ROSA

60.7% (+9.8%)

5.0 (-1.6)

ROSA2

64.9% (+14.0%)

4.1 (-2.5)

MT-AIME24
Baseline 66.7% 9.0

TextGrad

74.5% (+7.8%)

7.9 (-1.1)

ROSA 73.1% (+6.4%)

8.2 (-0.8)

ROSA2

77.5% (+10.8%)

7.7 (-1.3)

5.2

Adaptability in Sparse-Reward Environments

We evaluate ROSA2 on UI agent tasks (OSWorld (Xie et al. , 2024 ) , AndroidWorld (Rawles et al. , 2025 ) ) characterized by sparse rewards and precise execution demands. Table 3 confirms robust improvements across SFT and DPO backbones, highlighting the necessity of joint optimization. Single-axis methods fail here: TextGrad (Words-only) cannot rectify low-level motor precision errors, while ROSA (Weights-only) struggles to converge given sparse signals.

ROSA2 effectively navigates this dilemma by leveraging Semantic Pre-conditioning to bridge the reward gap. Specifically, the Textual Optimization module retrospectively analyzes the sequence of unrewarded actions, synthesizing fine-grained corrective instructions that pinpoint specific execution failures. This process effectively ”densifies” the feedback, transforming a vague, delayed failure signal into a detailed supervision signal for the next attempt. Consequently, the Parameter Optimization can utilize this clarified context as a pre-conditioner to fine-tune the execution policy with precision, rather than blindly searching in a sparse reward landscape. This synergy, where semantic retrospective feedback guides parametric actuation, is the fundamental reason ROSA2 achieves superior adaptability in agentic tasks.

Table 3 : Adaptability in sparse-reward environments (UI Agents).

Model
OSWorld
AndroidWorld

UI-TARS-7B-SFT (Qin et al. , 2025 )

13.2 27.6

UI-TARS-7B-SFT + TextGrad 13.7 (+0.5)

28.3 (+0.7)

UI-TARS-7B-SFT + ROSA

17.8 (+4.6)

30.9 (+3.3)

UI-TARS-7B-SFT + ROSA2

23.6 (+10.4)

35.3 (+7.7)

UI-TARS-7B-DPO 14.8 28.9

UI-TARS-7B-DPO + TextGrad 14.9 (+0.1)

28.7 (-0.2)

UI-TARS-7B-DPO + ROSA

18.0 (+3.2)

31.7 (+2.8)

UI-TARS-7B-DPO + ROSA2

24.4 (+10.6)

36.6 (+7.7)

5.3

Computational Cost Analysis

Finally, we analyze the practical deployment costs in terms of latency and memory. As shown in Table 4 , ROSA2 achieves a remarkable reduction in Avg Time per problem, a gain driven by two synergistic factors: (i) Intra-turn efficiency : the continuous optimization of Words and Weights enables the model to resolve problems using significantly more concise Chain-of-Thought (CoT) trajectories, drastically cutting the per-turn inference time; and (ii) Inter-turn efficiency : the reduction in total conversation turns as established in Section 5.1 . Regarding memory, ROSA2 introduces negligible overhead (maximum + 3.1 +3.1 GB on MATH), demonstrating that its high reasoning performance does not come at the cost of hardware accessibility.

Table 4 : Computational Cost Analysis.

Dataset
Method
Avg Time (s) ( ↓ \downarrow )
Peak Memory (GB) ( ↓ \downarrow )

MATH
Baseline 334.5 90.6

ROSA2

297.6 (-36.9)

93.7 (+3.1)

AIME25
Baseline 557.4 94.9

ROSA2

467.2 (-90.2)

95.4 (+0.5)

HumanEval
Baseline 543.7 94.8

ROSA2

521.3 (-22.4)

95.2 (+0.4)

BigCodeBench-Hard
Baseline 677.9 95.2

ROSA2

590.6 (-87.3)

95.5 (+0.3)

6

Conclusions

We introduced ROSA2, a joint optimization framework over context and parameters that effectively resolves the error attribution dilemma. By bypassing the local minima inherent to conditional baselines, ROSA2 achieves state-of-the-art accuracy with reduced latency across diverse benchmarks.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning, specifically within the domain of test-time adaptation for multi-turn interactions. Our framework demonstrates that co-adapting context and parameters can unlock state-of-the-art performance on reasoning and agentic benchmarks. While this work primarily contributes to the technical efficiency and accuracy of LLMs, it also highlights the potential for more capable UI agents. We believe there are no specific negative societal consequences that must be highlighted here, beyond the general considerations associated with the deployment of increasingly capable generative AI models.

References

AIME (2025) AIME problems and solutions .

External Links: Link

Cited by: 2nd item .

X. Bo, R. Li, Z. Sun, Q. Dai, Z. Zhang, Z. Tian, X. Chen, and Z. Dong (2025) Prompt and parameter co-optimization for large language models .

External Links: 2509.24245 , Link

Cited by: §1 .

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021) Evaluating large language models trained on code .

External Links: 2107.03374 , Link

Cited by: §C.1 .

M. Chen, R. Sun, T. Pfister, and S. O. Arik (2025) Learning to clarify: multi-turn conversations with action-based contrastive self-training .

In The Thirteenth International Conference on Learning Representations ,

External Links: Link

Cited by: Appendix A , §1 .

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Ding, H. Xin, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Wang, J. Chen, J. Yuan, J. Qiu, J. Li, J. L. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Wang, L. Zhang, L. Xu, L. Xia, M. Zhang, M. Zhang, M. Tang, M. Li, M. Wang, M. Li, N. Tian, P. Huang, P. Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. J. Chen, R. L. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, S. S. Li, S. Zhou, S. Wu, S. Ye, T. Yun, T. Pei, T. Sun, T. Wang, W. Zeng, W. Zhao, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, W. L. Xiao, W. An, X. Liu, X. Wang, X. Chen, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yang, X. Li, X. Su, X. Lin, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Sun, X. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Xu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang (2025) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning .

External Links: 2501.12948 , Link

Cited by: §C.2 .

K. Deshpande, V. Sirdeshmukh, J. B. Mols, L. Jin, E. Hernandez-Cardona, D. Lee, J. Kritz, W. E. Primack, S. Yue, and C. Xing (2025) MultiChallenge: a realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs .

In Findings of the Association for Computational Linguistics: ACL 2025 , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.) ,

Vienna, Austria , pp. 18632–18702 .

External Links: Link , Document , ISBN 979-8-89176-256-5

Cited by: §1 .

Google (2025) Gemini 3 .

Note: https://aistudio.google.com/models/gemini-3 Accessed: 2025

Cited by: §1 .

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2021a) Measuring massive multitask language understanding .

Proceedings of the International Conference on Learning Representations (ICLR) .

Cited by: 2nd item .

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021b) Measuring mathematical problem solving with the math dataset .

NeurIPS .

Cited by: 1st item , §2.1 .

A. Keluskar, A. Bhattacharjee, and H. Liu (2024) Do llms understand ambiguity in text? a case study in open-world question answering .

External Links: 2411.12395 , Link

Cited by: §1 .

P. Laban, H. Hayashi, Y. Zhou, and J. Neville (2025) LLMs get lost in multi-turn conversation .

External Links: 2505.06120 , Link

Cited by: §1 .

Y. Lee, J. Boen, and C. Finn (2025) Feedback descent: open-ended text optimization via pairwise comparison .

External Links: 2511.07919 , Link

Cited by: Appendix A , §1 .

Y. Li, X. Hu, X. Qu, L. Li, and Y. Cheng (2025a) Test-time preference optimization: on-the-fly alignment via iterative textual feedback .

In Forty-second International Conference on Machine Learning ,

External Links: Link

Cited by: Appendix A , §1 .

Y. Li, X. Shen, X. Yao, X. Ding, Y. Miao, R. Krishnan, and R. Padman (2025b) Beyond single-turn: a survey on multi-turn interactions with large language models .

External Links: 2504.04717 , Link

Cited by: §1 .

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step .

arXiv preprint arXiv:2305.20050 .

Cited by: 3rd item .

J. Liu, H. Zhang, Z. Zhuang, Y. Kang, D. Wang, and B. Wang (2023) Design from policies: conservative test-time adaptation for offline policy optimization .

In Thirty-seventh Conference on Neural Information Processing Systems ,

External Links: Link

Cited by: §1 .

OpenAI (2025) Introducing gpt-5.2 .

Note: https://openai.com/index/introducing-gpt-5-2/ Accessed: 2025

Cited by: §1 .

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback .

External Links: 2203.02155 , Link

Cited by: §1 .

Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. (2025) UI-tars: pioneering automated gui interaction with native agents .

arXiv preprint arXiv:2501.12326 .

Cited by: Table 3 .

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2025) Qwen2.5 technical report .

External Links: 2412.15115 , Link

Cited by: §C.2 , §C.2 .

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model .

In Thirty-seventh Conference on Neural Information Processing Systems ,

External Links: Link

Cited by: §3.1 .

C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, D. Toyama, R. Berry, D. Tyamagundlu, T. Lillicrap, and O. Riva (2025) AndroidWorld: a dynamic benchmarking environment for autonomous agents .

External Links: 2405.14573 , Link

Cited by: §5.2 .

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) GPQA: a graduate-level google-proof q&a benchmark .

In First Conference on Language Modeling ,

External Links: Link

Cited by: 1st item .

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models .

External Links: 2402.03300 , Link

Cited by: §1 .

G. Son, J. Hong, H. Ko, and J. Thorne (2025) Linguistic generalizability of test-time scaling in mathematical reasoning .

arXiv preprint arXiv:2502.17407 .

Cited by: §C.1 .

A. Tang, L. Soulier, and V. Guigue (2025a) Clarifying ambiguities: on the role of ambiguity types in prompting methods for clarification generation .

In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval ,

SIGIR ’25 , New York, NY, USA , pp. 20–30 .

External Links: ISBN 9798400715921 , Link , Document

Cited by: §1 .

A. Tang, L. Soulier, and V. Guigue (2025b) Clarifying ambiguities: on the role of ambiguity types in prompting methods for clarification generation .

External Links: 2504.12113 , Link

Cited by: Appendix A .

M. Team, X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, C. Zheng, K. Deng, S. Guo, S. Jia, S. Jiang, Y. Liao, R. Li, Q. Li, S. Li, Y. Li, Y. Li, D. Ma, Y. Ni, H. Que, Q. Wang, Z. Wen, S. Wu, T. Xing, M. Xu, Z. Yang, Z. M. Wang, J. Zhou, Y. Bai, X. Bu, C. Cai, L. Chen, Y. Chen, C. Cheng, T. Cheng, K. Ding, S. Huang, Y. Huang, Y. Li, Y. Li, Z. Li, T. Liang, C. Lin, H. Lin, Y. Ma, Z. Peng, Z. Peng, Q. Qi, S. Qiu, X. Qu, Y. Tan, Z. Wang, C. Wang, H. Wang, Y. Wang, Y. Wang, J. Xu, K. Yang, R. Yuan, Y. Yue, T. Zhan, C. Zhang, J. Zhang, X. Zhang, X. Zhang, Y. Zhang, Y. Zhao, X. Zheng, C. Zhong, Y. Gao, Z. Li, D. Liu, Q. Liu, T. Liu, S. Ni, J. Peng, Y. Qin, W. Su, G. Wang, S. Wang, J. Yang, M. Yang, M. Cao, X. Yue, Z. Zhang, W. Zhou, J. Liu, Q. Lin, W. Huang, and G. Zhang (2025) SuperGPQA: scaling llm evaluation across 285 graduate disciplines .

External Links: 2502.14739 , Link

Cited by: 3rd item .

X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji (2024) MINT: evaluating LLMs in multi-turn interaction with tools and language feedback .

In The Twelfth International Conference on Learning Representations ,

External Links: Link

Cited by: §1 .

C. Wei, Y. Shu, Y. T. He, and F. Yu (2025a) Flexora: flexible low-rank adaptation for large language models .

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.) ,

Vienna, Austria , pp. 14643–14682 .

External Links: Link , Document , ISBN 979-8-89176-251-0

Cited by: §1 .

C. Wei, H. Wang, Y. T. He, F. Yu, and Y. Shu (2025b) Test-time policy adaptation for enhanced multi-turn interactions with LLMs .

In First Workshop on Multi-Turn Interactions in Large Language Models ,

External Links: Link

Cited by: Appendix A , Appendix A , §B.1 , §B.2 , §1 , §1 , §3.1 , §4.2 , Theorem 4.1 .

C. Wei, J. Yu, Y. T. He, H. Dong, Y. Shu, and F. Yu (2025c) ReDit: reward dithering for improved LLM policy optimization .

In The Thirty-ninth Annual Conference on Neural Information Processing Systems ,

External Links: Link

Cited by: §1 .

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu (2024) OSWorld: benchmarking multimodal agents for open-ended tasks in real computer environments .

External Links: 2404.07972 , Link

Cited by: §5.2 .

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Tang, W. Yin, X. Ren, X. Wang, X. Zhang, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Zhang, Y. Wan, Y. Liu, Z. Wang, Z. Cui, Z. Zhang, Z. Zhou, and Z. Qiu (2025) Qwen3 technical report .

External Links: 2505.09388 , Link

Cited by: §C.2 , §C.2 , §1 .

Z. Yi, J. Ouyang, Z. Xu, Y. Liu, T. Liao, H. Luo, and Y. Shen (2025) A survey on recent advances in llm-based multi-turn dialogue systems .

External Links: 2402.18013 , Link

Cited by: Appendix A , §1 , §1 .

Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua, B. Qi, Y. Sun, Z. Ma, L. Yuan, N. Ding, and B. Zhou (2025) TTRL: test-time reinforcement learning .

In The Thirty-ninth Annual Conference on Neural Information Processing Systems ,

External Links: Link

Cited by: Appendix A , §1 .

Appendix A

Related Work

Adaptation via Context Refinement (Words). Approaches focusing on the “Words” axis, broadly categorized under Prompt Engineering, aim to optimize the external context x x while keeping the model parameters θ \theta frozen. Yi et al. ( 2025 ) review the progression of these methods from manual instruction design to automated strategies that dynamically refine inputs to align with user needs. Recent work has further emphasized the importance of clarifying input intent; for instance, Tang et al. ( 2025b ) investigate the role of ambiguity types in prompting, demonstrating that sharpening semantic clarity can improve generation quality. However, these context-centric optimization methods face a fundamental theoretical ceiling: they cannot induce capabilities that do not exist within the frozen parameters. As argued by Chen et al. ( 2025 ) and Lee et al. ( 2025 ) , semantic refinement alone is insufficient to remedy intrinsic capability deficits. Consequently, such methods often plateau in what Wei et al. ( 2025b ) describe as a “Deficit Trap,” where the model understands the task intent but lacks the execution capacity to solve it.

Adaptation via Parameter Updates (Weights). Conversely, the paradigm of Test-Time Training (TTT) or Test-Time Policy Adaptation ( T 2 P A M T^{2}PAM ) focuses on the “Weights” axis, allowing for the real-time update of internal parameters θ \theta during inference. Wei et al. ( 2025b ) introduced ROSA, a method that employs low-rank adaptation (LoRA) to minimize the divergence from a reward-weighted policy, effectively bridging the capability gap observed in frozen models. Similarly, Zuo et al. ( 2025 ) proposed Test-Time Reinforcement Learning (TTRL), which treats each interaction turn as a policy optimization step driven by reward signals. While these parameter-centric approaches offer a mechanism to enhance intrinsic model capabilities, they are highly sensitive to the quality of the learning signal. Li et al. ( 2025a ) highlight that performing parameter updates on noisy or ambiguous interaction histories often leads to the learning of spurious correlations. Without the pre-conditioning of a clear context, these methods are prone to gravitating towards an “Overfitting Trap,” resulting in performance degradation over extended interaction turns.

Appendix B

Proofs

B.1

Proof of Theorem 4.1

The proof follows from the closed-form solution of the linearized parameter update in ROSA.

Step 1: The Residual-Driven Update. According to Eq. (6) in ROSA (Wei et al. , 2025b ) , the parameter update Δ θ \Delta\theta is the least-squares solution to fitting the residual between the target distribution π ~ ∗ \tilde{\pi}^{*} and the current policy π \pi :

text

  (   J  T   J  )   Δ  θ  =   J  T   R   (  x  )   ,  where   R   (  x  )   =    π  ~   ∗    (  ⋅  |  x  )   −  π   (  ⋅  |  x  ,   θ   t  −  1    )    (J^{T}J)\Delta\theta=J^{T}R(x),\quad\text{where }R(x)=\tilde{\pi}^{*}(\cdot|x)-\pi(\cdot|x,\theta_{t-1})

(11)

The magnitude of the update is bounded by the magnitude of this residual vector R ( x ) R(x) :

text

   ‖   Δ     θ  t      (  x  )    ‖   2   ≤    1    σ  min      (  J  )         ‖   R     (  x  )    ‖   2     ||\Delta\theta_{t}(x)||_{2}\leq\frac{1}{\sigma_{\min}(J)}||R(x)||_{2}

(12)

Step 2: Effect of Semantic Refinement. The Semantic Stream updates x t → x t ∗ x_{t}\to x_{t}^{} to minimize the semantic discrepancy, effectively bringing the current policy’s distribution closer to the user’s optimal policy π u s e r ∗ \pi_{user}^{} . Since the target π ~ ∗ \tilde{\pi}^{} is constructed based on π u s e r ∗ \pi_{user}^{} , reducing the distance to π u s e r ∗ \pi_{user}^{} also reduces the distance to π ~ ∗ \tilde{\pi}^{} . Therefore, the refined query yields a smaller residual vector:

text

   ‖   R     (   x  t  ∗   )    ‖   2   <    ‖   R     (   x  t   )    ‖   2    ||R(x_{t}^{*})||_{2}<||R(x_{t})||_{2}

(13)

Step 3: Conclusion. Substituting the reduced residual back into the bound from Step 1, we obtain:

text

   ‖   Δ     θ  t      (   x  t  ∗   )    ‖   2   ≤   C  ⋅    ‖   R     (   x  t  ∗   )    ‖   2    <   C  ⋅    ‖   R     (   x  t   )    ‖   2    ≈    ‖   Δ     θ  t      (   x  t   )    ‖   2    ||\Delta\theta_{t}(x_{t}^{*})||_{2}\leq C\cdot||R(x_{t}^{*})||_{2}


    (14)  
Thus, optimizing the query reduces the norm of the required parameter update. ∎
B.2
Proof of Theorem   4.2  
The proof relies on decomposing the total error into a ”theoretical improvement” component and an ”approximation error” component.
We analyze the change in KL divergence at step    t  t    by introducing the theoretical target      π  ~   t  ∗   \tilde{\pi}_{t}^{*}    as an intermediate point. Using the telescoping sum property, the total error after    T  T    turns can be written as:

    
      
        text
        
      
        D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ  T    )   −   D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ  0    )    \displaystyle D_{KL}(\pi_{user}^{*}||\pi_{\phi_{T}})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{0}})    
 =    ∑   t  =  1   T     (   D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ  t    )   −   D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ   t  −  1     )   )    \displaystyle=\sum_{t=1}^{T}\left(D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t}})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t-1}})\right)    
    
    (15)  

    
      
        text
        
      
       =    ∑   t  =  1   T     (      D   K    L     (   π   u    s    e    r   ∗   |  |    π  ~   t  ∗   )   −   D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ   t  −  1     )    ⏟   Term A: Ideal Gain     \displaystyle=\sum_{t=1}^{T}\Bigg(\underbrace{D_{KL}(\pi_{user}^{*}||\tilde{\pi}_{t}^{*})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t-1}})}_{\text{Term A: Ideal Gain}}    





 +      D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ  t    )   −   D   K    L     (   π   u    s    e    r   ∗   |  |    π  ~   t  ∗   )    ⏟   Term B: Approximation Cost   )   \displaystyle\quad\quad+\underbrace{D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t}})-D_{KL}(\pi_{user}^{*}||\tilde{\pi}_{t}^{*})}_{\text{Term B: Approximation Cost}}\Bigg)    
    
  Bounding Term A.  Since our target construction follows the reward-weighted regression, we invoke Theorem 2 from ROSA  (Wei  et al. ,  2025b ) , which guarantees monotonic error reduction:

    
      
        text
        
      
        D   K    L     (   π   u    s    e    r   ∗   |  |    π  ~   t  ∗   )   −   D   K    L     (   π   u    s    e    r   ∗   |  |   π   ϕ   t  −  1     )   ≤  −   1  β    𝔼   y  ∼   π   u    s    e    r   ∗      [   r  t    (  y  )   ]    D_{KL}(\pi_{user}^{*}||\tilde{\pi}_{t}^{*})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t-1}})\leq-\frac{1}{\beta}\mathbb{E}_{y\sim\pi_{user}^{*}}[r_{t}(y)]    
    
    (16)  
Bounding Term B.  Under the    L  L   -Lipschitz smoothness assumption on the joint policy     π     (   y  ∣   x  ,  θ    )    \pi(y\mid x,\theta)   , the divergence between the target and actual policy is bounded by the squared Euclidean distance of the joint update:

    
      
        text
        
      
       Term B  ≤   D   K    L     (    π  ~   t  ∗   |  |   π   ϕ  t    )   ≤   L  2   |  |   ϕ  t   −   ϕ   t  −  1    |   |  2  2   =   L  2    (  |  |  Δ   x  t   |   |  2  2   +  |  |  Δ   θ  t   |   |  2  2   )    \text{Term B}\leq D_{KL}(\tilde{\pi}_{t}^{*}||\pi_{\phi_{t}})\leq\frac{L}{2}||\phi_{t}-\phi_{t-1}||_{2}^{2}=\frac{L}{2}\left(||\Delta x_{t}||_{2}^{2}+||\Delta\theta_{t}||_{2}^{2}\right)    
    
    (17)  
Summing these bounds over     t  =   1    …    T    t=1\dots T    yields Theorem   4.2  . ∎
Appendix C
Experimental Setup 
To rigorously evaluate the efficacy, efficiency, and generalizability of ROSA2, we conducted comprehensive experiments across a wide spectrum of tasks and model architectures. This section details the datasets, models, evaluation metrics, and reward mechanisms employed in our study.
C.1
Datasets 
We assessed ROSA2 on a diverse suite of challenging benchmarks categorized into four distinct domains: Mathematical Reasoning, General Reasoning, Code Generation, and Multilingual Reasoning. Table   5   summarizes the statistics of these datasets.
   Table 5 :   Summary of benchmarks used for evaluation. ”N/A” denotes datasets primarily used for testing that lack a standard pre-defined training split.  
  Domain
  Dataset
  Train Size
  Test Size  
  Mathematical Reasoning
  MATH
 7,500 
 5,000 
  AIME25
 N/A 
 30 
  MATH-500
 N/A 
 500 
  General Reasoning
  GPQA-diamond
 N/A 
 198 
  MMLU-Redux
 N/A 
 3,000 
  SuperGPQA
 26,500 
 N/A 
 Code Generation 
  HumanEval
 N/A 
 164 
 Multilingual Reasoning 
  MCLM
 N/A 
 156 
Mathematical Reasoning.
This domain targets complex, multi-step problem-solving. We employed three standard benchmarks:
 •
MATH   (Hendrycks  et al. ,  2021b ) : A collection of 12,500 challenging high-school level competition problems spanning algebra, geometry, and calculus.
 •
AIME25   (AIME,  2025 ) : A curated subset of 25 extremely difficult problems from the American Invitational Mathematics Examination, designed to probe advanced reasoning limits.
 •
MATH-500   (Lightman  et al. ,  2023 ) : A widely recognized evaluation subset of the  MATH  test set, consisting of 500 problems selected for efficient model assessment.
General Reasoning.
To evaluate knowledge application across broad topics, we utilized three expert-level QA datasets:
 •
GPQA-diamond   (Rein  et al. ,  2024 ) : A high-difficulty set of graduate-level questions written by domain experts; the ”diamond” subset ensures the highest quality.
 •
MMLU-Redux   (Hendrycks  et al. ,  2021a ) : A refined version of the Massive Multitask Language Understanding benchmark, covering 57 subjects ranging from elementary math to professional law.
 •
SuperGPQA   (Team  et al. ,  2025 ) : An expansion of GPQA containing nearly 5,000 expert-validated questions across 285 graduate-level disciplines.
Code Generation.
We assessed code synthesis capabilities using  HumanEval   (Chen  et al. ,  2021 ) . This benchmark comprises 164 hand-written programming problems equipped with function signatures, docstrings, and unit tests to verify functional correctness.
Multilingual Reasoning.
Cross-lingual reasoning was evaluated using  MCLM   (Son  et al. ,  2025 ) , which translates challenging English benchmarks into multiple languages. Our evaluation specifically focuses on the multilingual versions of IMO, AIME, and MATH problems ( M-IMO ,  MT-AIME24 , and  MT-MATH100 ).
Evaluation Protocol.
To simulate real-world deployment, our primary evaluation is conducted on official, held-out test sets. In cases where a dedicated test set is unavailable, or for specific ablation studies, we utilized corresponding training or development sets. Specifically, for  SuperGPQA , we sampled a portion of the training data for testing purposes; for all other benchmarks, standard test sets were strictly observed.
C.2
Models 
We selected a diverse array of open-source Large Language Models (LLMs) to ensure the robustness of our findings irrespective of model architecture or scale. As detailed in Table   6  , our selection includes instruction-tuned variants designed for chat and instruction-following tasks. Note that to mitigate potential data contamination concerns with the  Qwen2.5  series on specific benchmarks, we also validated results using the more recent  Qwen3  and  DeepSeek-R1  models.
   Table 6 :   Categorization of language models used in experiments.  
  Category
  Model
  Params
  Type  
  Small-Scale
  Qwen2.5-0.5B-Instruct
 0.5B 
 Instruct 
  Qwen3-0.6B
 0.6B 
 Base 
  Large-Scale
  Qwen2.5-7B-Instruct
 7B 
 Instruct 
  Qwen3-8B
 8B 
 Base 
  Reasoning-Focused
  DeepSeek-R1-Distill-Llama-8B
 8B 
 Reasoning 
  DeepSeek-R1-Distill-Qwen-7B
 7B 
 Reasoning 
Small-Scale Models.
To evaluate ROSA2 in resource-constrained settings, we selected compact models from the Qwen family:  Qwen2.5-0.5B-Instruct   (Qwen  et al. ,  2025 ) , optimized for instruction following, and  Qwen3-0.6B   (Yang  et al. ,  2025 ) , representing the newer generation with architectural enhancements.
Large-Scale Models.
We tested scalability using capable base models:  Qwen2.5-7B-Instruct   (Qwen  et al. ,  2025 ) , a standard 7B parameter instruction-tuned model, and its successor  Qwen3-8B   (Yang  et al. ,  2025 ) .
Reasoning-Focused Models.
We specifically included the DeepSeek-R1 series  (DeepSeek-AI  et al. ,  2025 ) , which are optimized via reinforcement learning for complex reasoning. We utilized distilled variants based on both Llama ( DeepSeek-R1-Distill-Llama-8B ) and Qwen ( DeepSeek-R1-Distill-Qwen-7B ) architectures to allow for controlled architectural comparisons.
C.3
Evaluation Metrics 
Our evaluation framework focuses on two critical dimensions: downstream task performance and computational efficiency.
Performance Metrics.
•  
  Accuracy:  Defined as the proportion of unique problems correctly solved within a maximum of    K  K    conversational turns. Let    𝒫  \mathcal{P}    be the set of problems and      S  i   ∈   {  0  ,  1  }    S_{i}\in{0,1}    be an indicator variable where      S  i   =  1   S_{i}=1    if problem    i  i    is solved at any turn     t  ≤  K   t\leq K   . Accuracy is calculated as:

    
      
        text
        
      
       Accuracy  =     ∑   i  ∈  𝒫     S  i     |  𝒫  |     \text{Accuracy}=\frac{\sum_{i\in\mathcal{P}}S_{i}}{|\mathcal{P}|}    
    
    (18)  
 •
Correction Uplift:  This metric quantifies the model’s capacity to self-correct. It represents the percentage of problems initially answered incorrectly that were subsequently solved in later turns. Let      𝒫  fail   ⊂  𝒫   \mathcal{P}_{\text{fail}}\subset\mathcal{P}    denote problems failed at turn     t  =  1   t=1   . The metric is defined as:

    
      
        text
        
      
       Correction Uplift  =      ∑   i  ∈   𝒫  fail      S  i     |   𝒫  fail   |    ×   100  %     \text{Correction Uplift}=\frac{\sum_{i\in\mathcal{P}_{\text{fail}}}S_{i}}{|\mathcal{P}_{\text{fail}}|}\times 100\%    
    
    (19)  
Efficiency Metrics.
To measure computational overhead, we track:
 •
Avg Time :  The average time solve per problem.
 •
Peak GPU Memory:  The maximum VRAM usage observed during the inference and update process.
C.4
Reward Models 
We employed two distinct reward mechanisms to simulate varying feedback granularities found in real-world applications.
Rule-Based Reward Model (Sparse Feedback).
This model simulates scenarios with definitive, binary judgments. It programmatically extracts the final answer (e.g., from a  \boxed{}  environment) and matches it against the ground truth. A reward of     +  1.0   +1.0    is assigned for an exact match, and     −  1.0   -1.0    otherwise. The core implementation logic is provided below.
Core logic for the rule-based reward model  
   ⬇
  class     MathVerifyRewardModel  : 

    
      
        text
        
      
          def     __init__  (  self  ,     ground_truth_answer  :     str  ): 

        self  .  ground_truth_answer     =     ground_truth_answer 



    def     get_reward  (  self  ,     response_text  :     str  )     ->     float  : 

        #   Returns   +1.0   for   match,   -1.0   otherwise 

        return     1.0     if     compute_score  (  response_text  , 

            self  .  ground_truth_answer  )     ==     1.0     else     -1.0 
    
    def     compute_score  (  solution_str  ,     ground_truth  )     ->     float  : 

    
      
        text
        
      
          retval     =     0.0 

    try  : 

        string_in_last_boxed     =     last_boxed_only_string  (  solution_str  ) 

        if     string_in_last_boxed     is     not     None  : 

            answer     =     remove_boxed  (  string_in_last_boxed  ) 

            if     is_equiv  (  answer  ,     ground_truth  ): 

                retval     =     1.0 

    except     Exception  : 

        pass 

    return     retval
    
     BETA

 
  📦Data Source: hf
🔄 Daily sync (03:00 UTC)
AI Summary: Based on hf metadata. Not a recommendation.

📊 FNI Methodology
📚 Knowledge Baseℹ️ Verify with original source

    🛡️ Paper Transparency Report 
 Technical metadata sourced from upstream repositories.
 
  
 
Open Metadata
 
 
    🆔 Identity & Source 
    id 
 arxiv-paper--unknown--2603.01375
 
  slug 
 unknown--2603.01375
 
  source 
 hf
 
  author 
 Chenxing Wei, Hong Wang, Ying He
 
  license 
 ArXiv
 
  tags 
 paper, research
 
 
 
  ⚙️ Technical Specs 
    architecture 
 null
 
  params billions 
 null
 
  context length 
 null
 
  pipeline tag 
 
 
 
 
  📊 Engagement & Metrics 
    downloads 
 0
 
  stars 
 0
 
  forks 
 0
 
 
 
 
  
Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

Cite this paper

🔬Technical Deep Dive

⚖️ Nexus Index V2.0

💬 Index Insight

Verification Authority

📝 Executive Summary

❝ Cite Node

Abstract & Analysis

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

Abstract

1

2

2.1

2.2

2.3

2.4

3

3.1

The Joint Optimal Policy Construction.

Optimization Objective.

Total Derivative and Co-Adaptation.

3.2

Phase 1: Generation and Evaluation.

Phase 2: Joint Optimization.

Advantages.

4

4.1

Theorem 4.1

4.2

Theorem 4.2

5

5.1

5.2

5.3

6

Impact Statement

References

Appendix A

Appendix B

B.1

B.2

Appendix C

C.1

Mathematical Reasoning.

General Reasoning.

Code Generation.

Multilingual Reasoning.

Evaluation Protocol.

C.2

Small-Scale Models.

Large-Scale Models.

Reasoning-Focused Models.

C.3

Performance Metrics.

Efficiency Metrics.

C.4

Rule-Based Reward Model (Sparse Feedback).

🛡️ Paper Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics