Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2603.01375 |
| License | ArXiv |
| Provider | hf |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2603.01375,
author = {Chenxing Wei, Hong Wang, Ying He},
title = {Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2603.01375}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).
Verification Authority
π Executive Summary
β Cite Node
@article{Unknown2026Words,
title={Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2603.01375},
year={2026}
} Abstract & Analysis
Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation
Title:
Content selection saved. Describe the issue below:
Description:
License: CC BY 4.0
arXiv:2603.01375v1 [cs.AI] 02 Mar 2026
Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation
Chenxing Wei
ββ
Hong Wang
ββ
Ying He
ββ
Zhongxiang Dai
ββ
Bo Jiang
ββ
F. Richard Yu
ββ
Yao Shu
Abstract
Test-time policy adaptation for multi-turn interactions (T 2 PAM) is essential for aligning Large Language Models (LLMs) with dynamic user needs during inference time. However, existing paradigms commonly treat test-time adaptation as a single-axis problem, either purely refining instructions (Prompt Engineering) or only adjusting weights (Test-Time Training), ignoring that interaction failures stem from a coupled mix of ambiguity and incapacity. We argue that these two optimization paths are not merely additive but synergistic: semantic clarity acts as a pre-conditioner for effective parameter updates. To this end, we propose ROSA2, a framework that reformulates interaction as a joint optimization problem over the heterogeneous space of Words and Weights. By mathematically decomposing the error signal, ROSA2 utilizes textual gradients to rectify intent ambiguity and parameter updates to bridge capability gaps. Theoretically, we prove that this co-adaptation strictly reduces the required parameter shift for convergence. Empirically, ROSA2 outperforms state-of-the-art baselines by 30% on MATH while reducing interaction turns by 40%, demonstrating that refining the context unlocks the true potential of parameter updates.
Machine Learning, ICML
1
Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities in general tasksΒ (Yang et al. , 2025 ; OpenAI, 2025 ; Google, 2025 ) , increasingly serving as collaborative partners that engage in complex, multi-turn dialogues with users to solve open-ended problemsΒ (Yi et al. , 2025 ) . However, a fundamental mismatch persists between static training paradigms (e.g., SFTΒ (Ouyang et al. , 2022 ; Wei et al. , 2025a ) , RLHFΒ (Shao et al. , 2024 ; Wei et al. , 2025c ) ) and dynamic real-world deploymentsΒ (Li et al. , 2025b ; Laban et al. , 2025 ) . Consequently, pre-trained models often falter in extended dialoguesΒ (Wang et al. , 2024 ) , exhibiting limited adaptabilityΒ (Yi et al. , 2025 ) and poor error correction capabilitiesΒ (Deshpande et al. , 2025 ) , as evidenced by the performance stagnation observed in Β (Wei et al. , 2025b ) . To bridge this gap without the prohibitive cost of retraining, Test-Time Policy Adaptation for Multi-Turn Interactions (T 2 PAM)Β (Wei et al. , 2025b ) has emerged as a critical paradigm. This approach aims to optimize the policy of model in real-time during multi-turn sessions, ensuring alignment with specific user preferences to significantly enhance response accuracy and acceptance rates.
Despite the promise of T 2 PAM, existing paradigms commonly treat test-time adaptation as a single-axis problem : either purely refining instructions (Prompt Engineering)Β (Yi et al. , 2025 ) or, as seen in representative approaches like ROSAΒ (Wei et al. , 2025b ) and TTRLΒ (Zuo et al. , 2025 ) , only adjusting weights (Test-Time Training). In this paper, we challenge this bifurcated view by explicitly modeling the effective policy of an LLM as a coupled function Ο β ( x , ΞΈ ) \pi(x,\theta) dependent on both its internal parameters (Weights) and the external context (Words). We argue that such conditional optimization strategies, which update one variable while freezing the other, overlook a fundamental reality: interaction failures stem from a coupled mix of context ambiguity and model incapacity Β (Keluskar et al. , 2024 ) . Addressing these factors in isolation proves insufficientβparameter-centric methods risk overfitting to noisy histories, while prompt-centric methods often hit capability ceilings. This misalignment ultimately harms downstream performance, leading to failures in generating correct responses (low accuracy) and unnecessarily prolonged interaction turns that severely degrade user acceptanceΒ (Tang et al. , 2025a ) . Detailed related work is provided in the AppendixΒ A .
Figure 1 :
Overview of the ROSA2 Framework. We formulate T 2 PAM as a joint optimization problem over the coupled variables Ο t = { x t + 1 , ΞΈ t } \phi_{t}={x_{t+1},\theta_{t}} .
During the Forward Phase (solid lines), the model generates a response y t y_{t} conditioned on the history H t β 1 H_{t-1} .
The Backward Phase (dashed lines) approximates the full gradient β j β o β i β n β t \nabla_{joint} of the interaction loss β \mathcal{L} via two synergistic modules:
the Textual Optimization (top, green) utilizes textual gradients ( β x \nabla_{\mathrm{x}} ) to refine the user feedback into a clearer instruction ( x t + 1 β x t + 1 β x_{t+1}\rightarrow x_{t+1}^{*} ), resolving context ambiguity;
while the Parameter Optimization (bottom, blue) employs gradient updates ( β ΞΈ \nabla_{\theta} ) to adjust the adapter weights ( ΞΈ t β ΞΈ t + 1 \theta_{t}\rightarrow\theta_{t+1} ), enhancing the intrinsic capability of model.
This co-adaptation ensures the system becomes both βClearerβ in intent and βStrongerβ in execution for the next turn.
To overcome this limitation, we argue that effective adaptation requires resolving a fundamental error attribution:
When a model fails in a multi-turn context, is it due to a lack of intrinsic capability (parameter misalignment) or a misunderstanding of the task intent (context ambiguity)?
Addressing these factors in isolation proves insufficientΒ (Chen et al. , 2025 ) . Pure prompt engineering cannot remedy intrinsic capability deficitsΒ (Lee et al. , 2025 ) , whereas pure parameter adaptation is prone to learning spurious mappings from noisy inputsΒ (Li et al. , 2025a ) . As visualized in FigureΒ 2 (b), the optimization landscape of T 2 PAM is characterized by coupled semantic and parametric gaps. Approaching this coupled system via independent updates (analogous to following partial derivatives ) often leads to convergence at suboptimal local minima: solely optimizing parameters gravitates towards an Overfitting Trap , while solely refining the context stalls in a Deficit Trap . Consequently, we posit that T 2 PAM must be reformulated as a joint optimization problem. Crucially, we argue that these optimization paths are not merely additive but synergistic, with semantic clarity acting as a pre-conditioner for parametric alignment . By prioritizing the elimination of semantic ambiguity, we cleanse the learning signal, ensuring that the gradient descent for parameters is strictly oriented towards the true task intent rather than fitting accumulated noise. This co-adaptation allows us to approximate the full gradient of the interaction objective, enabling a unified trajectory that effectively bypasses partial-optimization traps and accelerates convergence to the Success Zone of true user intents. This perspective aligns with recent research on model alignmentΒ (Liu et al. , 2023 ; Bo et al. , 2025 ) .
Driven by this insight, we introduce ROSA2, a unified framework designed to approximate the full gradient of the interaction objective by co-adapting the semantic context and model parameters. Instead of treating error signals as a monolith, our approach effectively disentangles the optimization process: it employs textual gradients to sharpen the user intent (Words) and utilizes closed-form updates to enhance the modelβs intrinsic execution capabilities (Weights). Theoretically, we demonstrate that this semantic pre-conditioning is rigorous, proving that it strictly bounds the magnitude of parameter shifts required to reach the optimal policy. This theoretical advantage translates directly into empirical gains: ROSA2 establishes a new state-of-the-art on the multiple benchmarks with a 30% average accuracy improvement, while simultaneously cutting interaction costs by reducing average turns by 40%. These results validate our core hypothesis: precise context is the catalyst that maximizes the efficacy of parameter adaptation.
Figure 2 :
Empirical Observations and Theoretical Landscape. Figure (a) In the experimental results on MATH (Qwen3-8B) reveal that single-axis methods ( Green / Blue solid lines) suffer from premature stagnation. However, the immediate recovery observed in the Switch experiments ( Green / Blue dashed lines) suggests this bottleneck is structural.
Figure (b) We map these dynamics to the optimization landscape using consistent color and line styling :
the Prompt-Only path ( Green ) stalls in the Deficit Trap (Hitting capability ceilings), while the Param-Only path ( Blue ) gravitates towards the Overfitting Trap (Memorizing noise). The dashed arrows in Figure (b) visualize how the Switch Method escapes these local minima by activating the missing axis.
Crucially, ROSA2 ( Red ) approximates the joint gradient β joint \nabla_{\text{joint}} , forming an Optimal Trajectory that bypasses these traps and proceeds directly to the Success Zone , corresponding to the superior convergence shown in Figure (a).
Our contributions are summarized as follows:
β’
We propose ROSA2, to the best of our knowledge, the first work to reformulate test-time adaptation as a joint optimization of semantic context and model parameters, effectively resolving the error attribution dilemma inherent in conditional optimization methods. (Section Β 3 )
β’
We provide rigorous proofs showing that semantic refinement acts as a pre-conditioner to strictly reduce parameter shift (TheoremΒ 4.1 ) and guarantee faster convergence to the optimal policy (TheoremΒ 4.2 ). (Section Β 4 )
β’
Extensive evaluations demonstrate that ROSA2 achieves state-of-the-art results across diverse domains (e.g., +30.8% on MATH) while reducing interaction turns by nearly 40% , leading to lower total latency with negligible memory overhead. (Section Β 5 )
2
Motivation: The Traps of Conditional Optimization
T 2 PAM presents a joint optimization challenge involving both context ambiguity and model capability. Formally, we consider a policy Ο \pi parameterized by both the context x x (Words) and the model weights ΞΈ \theta (Weights). We hypothesize that conditional optimization strategies, which update either x x or ΞΈ \theta in isolation, inevitably converge to suboptimal states characterized by either persistent reasoning deficits (due to frozen parameters) or overfitting to noisy prompts (due to lack of context refinement).
2.1
Experimental Setup.
To empirically validate this hypothesis, we conducted a controlled study using the Qwen3-8B model on the MATH datasetΒ (Hendrycks et al. , 2021b ) , simulating a challenging 10-turn interaction scenario. We compared four distinct optimization settings to isolate the effects of different variables: (1) Standard Inference : The model performs multi-turn reasoning with both the prompt and model parameters frozen; (2) Prompt Optimization : We freeze the model parameters and exclusively update the system prompt using TextGrad; (3) Parameter Optimization : We fix the system prompt and exclusively update the model parameters via ROSA ; (4) Switch Method : To test the limitations of conditional optimization methods, we implement the Switch Method at the observed stagnation point (Turn 5). Specifically, for the model initially optimizing prompts, we freeze the prompt and switch to updating parameters; conversely, for the model initially optimizing parameters, we freeze the weights and switch to updating the prompt.
2.2
Observation: Stagnation and Recovery.
The empirical results in FigureΒ 2 (a) demonstrate a notable trend. The Baseline (Gray dotted) exhibits limited self-correction capability, remaining nearly flat. Furthermore, conditional optimization methods, despite initial gains, suffer from diminishing returns and eventual premature stagnation. Specifically, the Prompt-Only method is constrained by policy misalignment , where semantic updates fail to bridge the reasoning gap, while the Param-Only method plateaus early due to overfitting . Crucially, a turning point occurs upon intervention: implementing the Switch Method at Turn 5 (Dashed curves) triggers a distinct performance improvement. This recovery indicates that the stagnation was driven by the limitations of conditional optimization.
2.3
Theory: Traps of conditional optimization.
We map these empirical results to the theoretical optimization landscape in FigureΒ 2 (b), identifying two distinct failure modes inherent to conditional updates. The stagnation of the Prompt-Only method corresponds to the Deficit Trap (Green zone): when parameters are frozen, purely semantic updates cannot rectify intrinsic reasoning deficits, leaving the model stuck despite having a refined prompt. Conversely, the stagnation of the Param-Only method corresponds to the Overfitting Trap (Blue zone): without context refinement, parameter updates risk overfitting to ambiguous prompts. The Switch experiments validates these traps: introducing the missing optimization dimension allows the model to escape the local minima (dashed arrows), confirming that both semantic clarity and parametric capability are required for sustained improvement.
2.4
From Conditional Optimization to Joint Optimization.
Building on the insight that semantic clarity and parametric capability must be co-adapted, we propose ROSA2 which implements a joint optimization strategy. By approximating the full gradient of the interaction objective from the very first turn, ROSA2 leverages the complementary strengths of semantic refinement and parametric adaptation to bypass both the Deficit and Overfitting Traps. As shown in FigureΒ 2 (a) (Red Solid), it follows an Optimal Trajectory , achieving significantly faster convergence and higher accuracy. Driven by this insight, the following section details the co-adaptation framework of ROSA2.
3
Joint Optimization via Full-Gradient Approximation
Building on the aforementioned motivation in SectionΒ 2 , we propose ROSA2, a novel framework that treats T 2 PAM as a joint optimization problem. By viewing the policy as a coupled function of Words (Context) and Weights (Parameters), ROSA2 approximates the full gradient of the interaction objective to strictly align the policy of model with the latent optimal user preference.
3.1
Problem Formulation: Joint Optimization in the Current Turn
As shown in FigureΒ 1 , for the t t -th turn of a multi-turn interaction session, let H t β 1 = { ( x 1 , y 1 ) , β¦ , ( x t β 1 β , y t β 1 ) , x t β } H_{t-1}={(x_{1},y_{1}),\dots,(x^{}{t-1},y{t-1}),x^{}{t}} denote the immutable interaction history accumulated prior to generating the current response, containing the completed dialogue pairs from previous turns and the refined query x t β x^{*}{t} for the current turn. At the current turn t t , the model operates with the composed parameters ΞΈ = ΞΈ base + ΞΈ t \theta=\theta_{\text{base}}+\theta_{t} , where ΞΈ t \theta_{t} represents the current learnable adapter weights. The response y t y_{t} is generated according to the current policy Ο ΞΈ \pi_{\theta} conditioned on the history:
y t βΌ Ο t ( β
β£ H t β 1 , ΞΈ ) . y_{t}\sim\pi_{t}(\cdot\mid H_{t-1},\theta).
(1)
Subsequently, the user provides feedback denoted as x t + 1 x_{t+1} , which serves as the raw query for the next turn. Distinct from standard paradigms, we treat this feedback x t + 1 x_{t+1} as an optimizable variable (Words) alongside the model parameters ΞΈ t \theta_{t} (Weights). Thus, we define Ο t = { x t + 1 , ΞΈ t } \phi_{t}={x_{t+1},\theta_{t}} as the set of joint optimization variables for the current step.
The Joint Optimal Policy Construction.
We postulate the existence of a Joint Optimal Policy Ο β \pi^{*} that represents the ideal response distribution for the current turn. Following the principles of reward-weighted regressionΒ (Rafailov et al. , 2023 ) , we construct this target distribution by re-weighting the policy from the previous turn, denoted as Ο t β 1 \pi_{t-1} . In our setting, Ο t β 1 \pi_{t-1} serves as the reference policy for the current adaptation stepΒ (Wei et al. , 2025b ) . Formally:
Ο t β β ( y β£ H t β 1 ) β 1 Z t β Ο t β 1 β ( y β£ H t β 1 ) β exp β‘ ( r β ( y ) Ξ² ) , \pi^{*}_{t}(y\mid H_{t-1})\triangleq\frac{1}{Z_{t}}\pi_{t-1}(y\mid H_{t-1})\exp\left(\frac{r(y)}{\beta}\right),
(2)
where r β ( y ) r(y) is the reward signal for the generated response from user feedback. Crucially, the partition function Z t Z_{t} depends solely on the policy Ο t \pi_{t} and the fixed history:
Z t = πΌ y βΌ Ο t β 1 β [ exp β‘ ( r β ( y ) Ξ² ) ] . Z_{t}=\mathbb{E}_{y\sim\pi_{t-1}}\left[\exp\left(\frac{r(y)}{\beta}\right)\right].
(3)
Therefore, Z t Z_{t} is a constant scalar with respect to the current optimization variables Ο t = { x t + 1 , ΞΈ t } \phi_{t}={x_{t+1},\theta_{t}} .
Optimization Objective.
Our goal is to update the current policy Ο t \pi_{t} (parameterized by x , ΞΈ x,\theta ) to approximate this target Ο t β \pi^{*}_{t} . We formulate this as minimizing the Forward KL Divergence , denoted as the loss function β \mathcal{L} :
β ( Ο t ) = D K β L ( Ο t β ( β
β£ Ο t ) β₯ Ο t ( β
β£ Ο t ) ) . \mathcal{L}(\phi_{t})=D_{KL}\Big(\pi^{*}_{t}(\cdot\mid\phi_{t})\;\Big\|\;\pi_{t}(\cdot\mid\phi_{t})\Big).
(4)
Expanding the KL divergence:
β β ( Ο t ) = πΌ y βΌ Ο t β β [ log β‘ Ο t β β ( y ) ] β β E β ( Ο t β ) β πΌ y βΌ Ο t β β [ log β‘ Ο t β ( y β£ Ο t ) ] . \mathcal{L}(\phi_{t})=\underbrace{\mathbb{E}_{y\sim\pi^{*}_{t}}[\log\pi^{*}_{t}(y)]}_{-E(\pi^{*}_{t})}-\mathbb{E}_{y\sim\pi^{*}_{t}}[\log\pi_{t}(y\mid\phi_{t})].
(5)
Since Ο t β \pi^{}{t} is fixed by the forward pass (determined by Ο t β 1 \pi{t-1} and r r ), its entropy E β ( Ο β ) E(\pi^{}) is independent of the optimizable variables Ο t \phi_{t} . Consequently, minimizing the divergence is equivalent to minimizing the cross-entropy, or maximizing the expected log-likelihood of the optimal policy:
β β ( Ο t ) β
β πΌ y βΌ Ο β β [ log β‘ Ο t β ( y β£ Ο t ) ] . \mathcal{L}(\phi_{t})\cong-\mathbb{E}_{y\sim\pi^{*}}\left[\log\pi_{t}(y\mid\phi_{t})\right].
(6)
Total Derivative and Co-Adaptation.
To perform the update, we examine the total differential d β β d\mathcal{L} with respect to Ο t \phi_{t} . Using importance sampling to estimate the gradient expectation under the previous policy distribution Ο t \pi_{t} :
β Ο t β = β πΌ y βΌ Ο t β [ Ο β β ( y ) Ο t β ( y ) β β Ο t log β‘ Ο t β ( y β£ Ο t ) ] \displaystyle\nabla_{\phi_{t}}\mathcal{L}=-\mathbb{E}_{y\sim\pi_{t}}\left[\frac{\pi^{*}(y)}{\pi_{t}(y)}\nabla_{\phi_{t}}\log\pi_{t}(y\mid\phi_{t})\right]
(7)
= β πΌ y βΌ Ο t β [ 1 Z t β exp β‘ ( r β ( y ) Ξ² ) β β Ο t log β‘ Ο t β ( y β£ Ο t ) ] . \displaystyle=-\mathbb{E}_{y\sim\pi_{t}}\Bigg[\frac{1}{Z_{t}}\exp\left(\frac{r(y)}{\beta}\right)\nabla_{\phi_{t}}\log\pi_{t}(y\mid\phi_{t})\Bigg].
Expanding the gradient operator β Ο t \nabla_{\phi_{t}} reveals the coupled nature of the optimization. To strictly decrease the divergence, the total change in the loss function must follow the full gradient in the joint space:
d β β β β 1 Z t πΌ y βΌ Ο t [ exp β‘ ( r β ( y ) Ξ² ) β Reward Weight β ( β x log β‘ Ο t β d β x β Optimizing Prompt + β ΞΈ log β‘ Ο t β d β ΞΈ β Optimizing Params ) ] . \begin{aligned} &d\mathcal{L}\propto\ -\frac{1}{Z_{t}}&\mathop{\mathbb{E}\quad}\limits_{y\sim\pi_{t}}\Bigg[\underbrace{\exp\left(\frac{r(y)}{\beta}\right)}{\text{Reward Weight}}\Bigg(\underbrace{\nabla{x}\log\pi_{t}\cdot dx}{\text{Optimizing Prompt}}+\underbrace{\nabla{\theta}\log\pi_{t}\cdot d\theta}_{\text{Optimizing Params}}\Bigg)\Bigg].\end{aligned}
(8)
EquationΒ 8 theoretically mandates the T 2 PAM: since Z t Z_{t} is a constant scaling factor derived from the previous turn, approximating the joint optimal policy requires simultaneously rectifying the query x t x_{t} and updating the parameters ΞΈ t \theta_{t} along the direction of the reward-weighted log-likelihood.
3.2
The ROSA2 Algorithm
Guided by the total differential derivation in Eq.Β 8 , we propose ROSA2, a co-adaptation framework designed to iteratively approximate the joint optimal policy through multi-turn interactions. The complete protocol is detailed in AlgorithmΒ 1 . The process begins by initializing the turn counter t = 1 t=1 , the learnable adapter parameters ΞΈ 1 \theta_{1} to zero, and the current history H H containing the initial user query x 1 x_{1} (lines 1-2 in AlgorithmΒ 1 ). At each turn t t , the workflow proceeds through two distinct phases:
Phase 1: Generation and Evaluation.
To leverage the adapted knowledge, the system first composes the effective model parameters ΞΈ \theta by adding the current adapter weights ΞΈ t \theta_{t} to the frozen base model parameters ΞΈ base \theta_{\text{base}} (line 5). A response y ^ t \hat{y}{t} is then generated using the current policy Ο ΞΈ \pi{\theta} , conditioned on the accumulated history H H (line 6). Subsequently, the system receives a binary reward r t r_{t} and the feedback of user for the next turn, denoted as x t + 1 x_{t+1} (line 7). If the response is accepted ( r t = + 1 r_{t}=+1 ) or the turn limit T max T_{\max} is reached, the process terminates and returns y ^ t \hat{y}_{t} (lines 8-9).
Phase 2: Joint Optimization.
If the response is rejected ( r t = β 1 r_{t}=-1 ) and the session continues, ROSA2 triggers the co-adaptation process to jointly optimize the state for the next interaction. First, the Semantic Stream addresses context ambiguity. It utilizes the deficiency detected in the current response y ^ t \hat{y}{t} to compute a semantic gradient, which is then used to refine the raw incoming feedback x t + 1 x{t+1} into a more precise and instructive query x t + 1 β x_{t+1}^{*} (lines 12-14). Uniquely, even if explicit user feedback is absent (i.e., x t + 1 = β x_{t+1}=\emptyset ), this stream autonomously synthesizes a corrective query based on the gradient derived from the failure in y ^ t \hat{y}_{t} . This ensures that the model receives a semantically optimized instruction for the next turn, regardless of whether the user provided specific guidance. By generating such fine-grained feedback in every iteration, we effectively minimize the semantic gap between the intent of the user and the understanding of the model.
Algorithm 1 ROSA2 Co-Adaptation Protocol
1: β Input: Initial Query x 1 x_{1} , Base Model Parameters ΞΈ base \theta_{\text{base}} , Max Turns T max T_{\max} .
2: β Initialize: Turn Counter t β 1 t\leftarrow 1 , Adaptation Parameters ΞΈ 1 β π \theta_{1}\leftarrow\mathbf{0} , Current History H 0 β { x 1 } H_{0}\leftarrow{x_{1}} .
3: β while t β€ T max t\leq T_{\max} do
4: ββ // Phase 1: Generation and Evaluation
5: ββCompose parameters: ΞΈ β ΞΈ base + ΞΈ t \theta\leftarrow\theta_{\text{base}}+\theta_{t} .
6: ββGenerate response: y ^ t βΌ Ο ( β β£ H t β 1 , ΞΈ ) \hat{y}{t}\sim\pi(\cdot\mid H{t-1},\theta) .
7: ββReceive reward r t r_{t} and feedback x t + 1 x_{t+1} (next turn query) from Environment/User.
8: ββ if r t = + 1 r_{t}=+1 or t = T max t=T_{\max} then
9: βββ Return y ^ t \hat{y}_{t} // Task completed or limit reached
10: ββ end if
11: ββ // Phase 2: Joint Optimization
12: ββ // Step A: Semantic Update (TextGrad)
13: ββCompute semantic gradient and refine query:
14: ββ x t + 1 β β x t + 1 β β text β β ( y ^ t ) x_{t+1}^{*}\leftarrow x_{t+1}-\nabla_{\text{text}}\mathcal{L}(\hat{y}_{t})
15: ββ // Step B: Parametric Update (ROSA)
16: ββConstruct target distribution Ο β \pi^{*} using Ο ΞΈ \pi_{\theta} and r t r_{t} .
17: ββ ΞΈ t + 1 β ΞΈ t β β ΞΈ β β ( ΞΈ β£ r t , Ο β , Ο ΞΈ ) \theta_{t+1}\leftarrow\theta_{t}-\nabla_{\theta}\mathcal{L}(\theta\mid r_{t},\pi^{*},\pi_{\theta})
18: ββUpdate History: H t β H t β 1 βͺ { y ^ t , x t + 1 β } H_{t}\leftarrow H_{t-1}\cup{\hat{y}{t},x{t+1}^{*}}
19: ββ t β t + 1 t\leftarrow t+1
20: β end while
Simultaneously, the Parametric Stream utilizes the binary reward ( r t r_{t} ) and the current policy Ο ΞΈ \pi_{\theta} to estimate the latent target policy of the user Ο β \pi^{} . It then computes a parameter update Ξ β ΞΈ t \Delta\theta_{t} to force the policy of the model Ο t \pi_{t} to approximate Ο β \pi^{} (lines 15-17). The computational efficiency of this one-step update method makes it highly suitable for real-time multi-turn interactions, allowing for rapid iterative updates that eventually align the policy of the model with the preferences of the user.
Finally, the system prepares for the next iteration by updating the history H H to include the current response y ^ t \hat{y}{t} and the refined query x t + 1 β x{t+1}^{*} , ensuring that subsequent generations are conditioned on the optimized context (lines 19-20).
Advantages.
The ROSA2 framework provides a solution to T 2 PAM explicitly derived from the full-gradient approximation. By co-adapting both the semantic context and parameters of the model, it overcomes the limitations of conditional optimization baselines. Specifically, the Semantic Stream guarantees that the feedback provided to the model is consistently clear and correct, effectively addressing scenarios where explicit feedback is absent. Complementarily, the Parametric Stream ensures the model possesses the necessary capability to execute these instructions. This synergistic loop enables ROSA2 to robustly handle ambiguous inputs and recover from errors, significantly improving the success rate in complex multi-turn tasks.
4
Theoretical Results
Building upon the joint optimization formulation defined in SectionΒ 3.1 , we now establish the convergence properties of the ROSA2 framework. Specifically, we analyze how the joint updates of the query x x and parameters ΞΈ \theta (Eq.Β 8 ) theoretically drive the policy of the model towards the latent optimal user policy Ο user β \pi_{\text{user}}^{*} . This theoretical analysis proceeds in two stages. We first examine the mechanistic synergy in SectionΒ 4.1 , proving that semantic refinement strictly reduces the norm of the required parameter shift (TheoremΒ 4.1 ). Subsequently, we extend this local property to a global perspective in SectionΒ 4.2 , deriving a unified convergence bound (TheoremΒ 4.2 ) that explicitly quantifies the reduction in divergence from the optimal policy of user while accounting for approximation errors.
4.1
Mechanism: Parametric Error Reduction
We first analyze the impact of optimizing the context π± \mathbf{x} on the parametric optimization ΞΈ \theta . A central insight is that refining the context π± \mathbf{x} significantly reduces the magnitude of the required parameter shifts to achieve alignment. We formalize this phenomenon in the following theorem.
Theorem 4.1
(Reduction of Parameter Shift) .
Let Ξ β ΞΈ t β ( π± ) \Delta\theta_{t}(\mathbf{x}) be the solution to the linearized parameter update defined in Eq.Β (6) of ROSA (Wei et al. , 2025b ) given a query π± \mathbf{x} . If we successfully updates the query from π± t \mathbf{x}{t} to π± t β \mathbf{x}{t}^{*} such that the semantic gap to the user intent is reduced (i.e., D KL ( Ο user β β₯ Ο ( β | π± t β ) )
β Ξ β ΞΈ t β ( π± t β ) β 2 < β Ξ β ΞΈ t β ( π± t ) β 2 \|\Delta\theta_{t}(\mathbf{x}_{t}^{*})\|_{2}<\|\Delta\theta_{t}(\mathbf{x}_{t})\|_{2}
(9)
Remark. The detailed proof is provided in SectionΒ B.1 . TheoremΒ 4.1 underscores the synergistic necessity of simultaneously updating π± \mathbf{x} and ΞΈ \theta . By aligning the input context with the modelβs existing knowledge boundary first, we minimize the residual error that the parameters must correct.
Empirical Evidence. This mechanism is strongly supported by the experimental results in FigureΒ 3 . The parametric error of ROSA2 (blue line, β Ξ β ΞΈ β 2 |\Delta\theta|^{2} ) is significantly reduced compared to the ROSA baseline (gray line), confirming that semantic refinement strictly reduces the optimization difficulty for the parametric stream.
4.2
Unified Convergence Bound
Building on TheoremΒ 4.1 , we derive a unified bound that quantifies the overall performance of Co-Adaptation. This theorem extends the Theorem 4 in (Wei et al. , 2025b ) by explicitly accounting for the approximation errors.
Theorem 4.2
(Unified Convergence Bound) .
Assume the log-policy function log β‘ Ο β ( π² β£ π± , ΞΈ ) \log\pi(\mathbf{y}\mid\mathbf{x},\theta) is L L -Lipschitz smooth with respect to the joint state Ο = { π± , ΞΈ } \phi={\mathbf{x},\theta} . After T T turns of Co-Adaptation, the divergence between the final policy Ο Ο T \pi_{\phi_{T}} and the user optimal policy Ο user β \pi_{\text{user}}^{*} is bounded by:
D KL β ( Ο user β β₯ Ο Ο T ) β€ D KL β ( Ο user β β₯ Ο Ο 0 ) β Initial Error \displaystyle D_{\text{KL}}(\pi_{\text{user}}^{*}\|\pi_{\phi_{T}})\leq\underbrace{D_{\text{KL}}(\pi_{\text{user}}^{*}\|\pi_{\phi_{0}})\vphantom{-\frac{1}{\beta}\sum_{t=1}^{T}\pi_{\text{user}}^{*}(\mathbf{y}_{t}|\mathbf{x}_{t})}}_{\text{Initial Error}}
(10)
β 1 Ξ² β β t = 1 T Ο user β β ( π² t | π± t β ) β Improvement + L 2 β β t = 1 T ( β Ξ β π± t β 2 2 + β Ξ β ΞΈ t β 2 2 ) β Approx. Error . \displaystyle-\underbrace{\frac{1}{\beta}\sum_{t=1}^{T}\pi_{\text{user}}^{*}(\mathbf{y}_{t}|\mathbf{x}^{*}_{t})}_{\text{Improvement}}+\underbrace{\frac{L}{2}\sum_{t=1}^{T}\left(\|\Delta\mathbf{x}_{t}\|^{2}_{2}+\|\Delta\theta_{t}\|^{2}_{2}\right)}_{\text{Approx. Error}}\ .
where β Ξ β π± t β 2 2 |\Delta\mathbf{x}{t}|^{2}{2} and β Ξ β ΞΈ t β 2 2 |\Delta\theta_{t}|^{2}_{2} represent the update steps in the semantic error and parametric error at turn t t , respectively.
Figure 3 : Dynamics of approximation error terms. The plot compares the baseline parametric error (gray) against the decomposed errors of ROSA2. The parametric error of ROSA2 (blue) is significantly reduced compared to the baseline, verifying TheoremΒ 4.1 . Furthermore, the total error of ROSA2 (red) remains lower than the baseline despite the additional semantic cost (green), verifying TheoremΒ 4.2 , which decays exponentially.
Table 1 :
Main Results on Standard Reasoning Benchmarks. We report the accuracy (%) across mathematical (MATH, MATH-500), general (MMLU-R, SuperGPQA), multilingual (MT-AIME24, MT-MATH100), and code generation (HumanEval) tasks. The gains are calculated relative to the Baseline. Best scores are bolded , and second-best scores are underlined .
Mathematical Reasoning
General Reasoning
Multilingual Reasoning
Code Gen.
Model
Method
MATH
MATH-500
MMLU-R
SuperGPQA
MT-AIME24
MT-MATH100
HumanEval
Qwen2.5-0.5B -Instruct
Baseline 23.0 24.0 9.4 3.8 2.6 15.4 31.1
TextGrad
31.2 (+8.2)
29.6 (+5.6)
12.4 (+3.0)
3.8 (+0.0)
2.2 (-0.4)
18.4 (+3.0)
36.6 (+5.5)
ROSA 29.2 (+6.2)
30.4 (+6.4)
11.4 (+2.0)
4.0 (+0.2)
3.8 (+1.2)
19.6 (+4.2)
38.4 (+7.3)
ROSA2
40.8 (+17.8)
39.6 (+15.6)
18.4 (+9.0)
6.4 (+2.6)
4.4 (+1.8)
25.2 (+9.8)
44.5 (+13.4)
Qwen3-0.6B -Instruct
Baseline 19.6 22.4 24.0 3.8 3.2 26.2 41.5
TextGrad 65.0 (+45.4)
62.0 (+39.6)
46.4 (+22.4)
3.8 (+0.0)
7.0 (+3.8)
62.2 (+36.0)
65.8 (+24.4)
ROSA
66.2 (+46.6)
63.0 (+40.6)
48.8 (+24.8)
4.0 (+0.2)
7.2 (+4.0)
62.0 (+35.8)
72.0 (+30.5)
ROSA2
70.8 (+51.2)
71.6 (+49.2)
50.0 (+26.0)
6.4 (+2.6)
9.6 (+6.4)
73.4 (+47.2)
81.7 (+40.2)
Qwen2.5-7B -Base
Baseline 47.0 49.4 39.8 17.8 17.0 60.4 57.9
TextGrad 54.8 (+7.8)
54.0 (+4.6)
60.2 (+20.4)
46.4 (+28.6)
37.6 (+20.6)
75.4 (+15.0)
72.0 (+14.0)
ROSA
63.4 (+16.4)
62.4 (+13.0)
60.2 (+20.4)
47.8 (+30.0)
37.0 (+20.0)
70.4 (+10.0)
74.4 (+16.5)
ROSA2
68.4 (+21.4)
67.2 (+17.8)
63.0 (+23.2)
48.8 (+31.0)
37.8 (+20.8)
78.2 (+17.8)
79.9 (+22.0)
Qwen3-8B
Baseline 50.0 42.8 57.0 24.2 29.4 75.2 78.0
TextGrad
63.4 (+13.4)
62.4 (+19.6)
70.6 (+13.6)
40.0 (+15.8)
40.0 (+10.6)
81.2 (+6.0)
82.3 (+4.3)
ROSA 62.2 (+12.2)
60.8 (+18.0)
75.8 (+18.8)
38.6 (+14.4)
38.6 (+9.2)
88.4 (+13.2)
83.7 (+5.6)
ROSA2
80.8 (+30.8)
80.6 (+37.8)
84.4 (+27.4)
52.4 (+28.2)
44.4 (+15.0)
93.6 (+18.4)
88.4 (+10.4)
DeepSeek-R1 -Distill-Llama-8B
Baseline 27.6 22.8 23.6 10.4 4.8 17.8 25.0
TextGrad 34.0 (+6.4)
31.6 (+8.8)
43.4 (+19.8)
20.8 (+10.4)
16.2 (+11.4)
30.4 (+12.6)
39.0 (+14.0)
ROSA
37.8 (+10.2)
37.6 (+14.8)
42.8 (+19.2)
21.4 (+11.0)
17.2 (+12.4)
38.6 (+20.8)
39.3 (+14.3)
ROSA2
54.2 (+26.6)
54.6 (+31.8)
59.4 (+35.8)
35.0 (+24.6)
21.4 (+16.6)
50.6 (+32.8)
40.2 (+15.2)
en en
Figure 4 :
Performance trajectory on challenging benchmarks. We plot the accuracy on AIME25, GPQA-Diamond, M_IMO, and BigCodeBench-Hard as a function of interaction turns. ROSA2 (red line) demonstrates sustained accuracy improvements, successfully solving complex problems where baselines plateau.
Remark. The detailed proof is provided in SectionΒ B.2 . TheoremΒ 4.2 formally decomposes the convergence dynamics into three interconnected components. First, the Initial Error serves as the constant baseline divergence at the start of the interaction. Second, the Improvement term quantifies the cumulative error reduction driven by user feedback. Crucially, co-adaptation amplifies this term by refining the query into a correct form π± t β \mathbf{x}{t}^{*} , which ensures the model generates responses π² t \mathbf{y}{t} with significantly higher probability mass under the optimal user policy Ο user β \pi_{\text{user}}^{*} . Finally, the Approx. Error reflects the penalty incurred from inexact updates. Although ROSA2 introduces an additional semantic cost β Ξ β π± t β 2 2 |\Delta\mathbf{x}{t}|^{2}{2} , it mitigates the total error (red line in FigureΒ 3 ) through the mechanism established in TheoremΒ 4.1 .
Empirical Evidence: As illustrated in FigureΒ 3 , as the query context π± t \mathbf{x}{t} progressively approaches the optimal form π± β \mathbf{x}^{*} , the squared norm of the semantic update β Ξ β π± t β 2 2 |\Delta\mathbf{x}{t}|^{2}_{2} (green line) exhibits an exponential decay. Consequently, the total approximation error of ROSA2 (red line) is initially high due to the large semantic discrepancy, but rapidly drops to remain significantly lower than the single-stream baseline (gray line). This empirically validates that ROSA2 achieves a lower overall approximation error.
5
Empirical Results
Following the protocol in SectionΒ 2.1 , we employ an automated pipeline across verifiable benchmarks, where correctness is validated via ground-truth matching (reasoning) or execution-based unit tests (coding/agents). Interactions persist until the model succeeds or reaches a turn limit, allowing simultaneous measurement of efficacy (success rate) and efficiency (turn count). Detailed experimental setups are deferred to AppendixΒ C . Our analysis focuses on: (1) reasoning performance (SectionΒ 5.1 ), (2) adaptability in sparse-reward environments (SectionΒ 5.2 ), and (3) computational cost (SectionΒ 5.3 ).
5.1
Performance and Efficiency in Diverse Tasks
Overcoming Single-Axis Limitations. As shown in TableΒ 1 , ROSA2 consistently outperforms the single-axis baselines, TextGrad (Words-only) and ROSA (Weights-only), across all evaluated model sizes (0.5B to 8B) and domains. This performance gap validates our core hypothesis regarding error attribution: TextGrad, while effective at refining prompts, often hits a capability ceiling on hard tasks where the frozen model simply lacks the intrinsic knowledge to execute the instruction. Conversely, ROSA, by updating parameters on potentially ambiguous inputs, tends to overfit to noise, leading to the stagnation observed in FigureΒ 4 . ROSA2 breaks these bottlenecks: the semantic updates unlock the potential for parameter adaptation, enabling rapid improvements on challenging tasks.
Pre-Conditioning Effect. To understand the source of our efficiency, we analyze the interaction dynamics in TableΒ 2 . ROSA2 achieves the highest Correction Uplift (e.g., 81.4% on MATH), confirming that the Semantic Stream successfully rectifies initial misunderstandings. More importantly, ROSA2 significantly reduces the Avg Turn required to reach a solution (e.g., -40% compared to ROSA). This reduction provides empirical validation for TheoremΒ 4.2 . As the interaction progresses, the continuous semantic refinement actively suppresses the gradient estimation noise, ensuring that the cumulative Approximation Error remains significantly lower than that of ROSA. Consequently, this minimization leads to a tighter alignment with the latent user policy Ο user β \pi_{\text{user}}^{*} , which directly translates into the observed higher correction rates and lower turns.
Table 2 : Analysis of Interaction Dynamics on Qwen3-8B. Correction Uplift indicates the percentage of eventually solved problems that were corrected after the initial failure. Avg Turn denotes the average interaction turns required to solve a problem.
Dataset
Method
Correction Uplift ( β \uparrow )
Avg Turn ( β \downarrow )
MATH
Baseline
70.0%
7.2
TextGrad 75.1% (+5.1%)
6.0 (-1.2)
ROSA
77.3% (+7.3%)
6.3 (-0.9)
ROSA2
81.4% (+11.4%)
4.4 (-2.8)
MMLU
Baseline
50.9%
6.6
TextGrad 59.5% (+8.6%)
5.2 (-1.4)
ROSA
60.7% (+9.8%)
5.0 (-1.6)
ROSA2
64.9% (+14.0%)
4.1 (-2.5)
MT-AIME24
Baseline
66.7%
9.0
TextGrad
74.5% (+7.8%)
7.9 (-1.1)
ROSA 73.1% (+6.4%)
8.2 (-0.8)
ROSA2
77.5% (+10.8%)
7.7 (-1.3)
5.2
Adaptability in Sparse-Reward Environments
We evaluate ROSA2 on UI agent tasks (OSWorldΒ (Xie et al. , 2024 ) , AndroidWorldΒ (Rawles et al. , 2025 ) ) characterized by sparse rewards and precise execution demands. TableΒ 3 confirms robust improvements across SFT and DPO backbones, highlighting the necessity of joint optimization. Single-axis methods fail here: TextGrad (Words-only) cannot rectify low-level motor precision errors, while ROSA (Weights-only) struggles to converge given sparse signals.
ROSA2 effectively navigates this dilemma by leveraging Semantic Pre-conditioning to bridge the reward gap. Specifically, the Textual Optimization module retrospectively analyzes the sequence of unrewarded actions, synthesizing fine-grained corrective instructions that pinpoint specific execution failures. This process effectively βdensifiesβ the feedback, transforming a vague, delayed failure signal into a detailed supervision signal for the next attempt. Consequently, the Parameter Optimization can utilize this clarified context as a pre-conditioner to fine-tune the execution policy with precision, rather than blindly searching in a sparse reward landscape. This synergy, where semantic retrospective feedback guides parametric actuation, is the fundamental reason ROSA2 achieves superior adaptability in agentic tasks.
Table 3 : Adaptability in sparse-reward environments (UI Agents).
Model
OSWorld
AndroidWorld
UI-TARS-7B-SFTΒ (Qin et al. , 2025 )
13.2 27.6
UI-TARS-7B-SFT + TextGrad 13.7 (+0.5)
28.3 (+0.7)
UI-TARS-7B-SFT + ROSA
17.8 (+4.6)
30.9 (+3.3)
UI-TARS-7B-SFT + ROSA2
23.6 (+10.4)
35.3 (+7.7)
UI-TARS-7B-DPO 14.8 28.9
UI-TARS-7B-DPO + TextGrad 14.9 (+0.1)
28.7 (-0.2)
UI-TARS-7B-DPO + ROSA
18.0 (+3.2)
31.7 (+2.8)
UI-TARS-7B-DPO + ROSA2
24.4 (+10.6)
36.6 (+7.7)
5.3
Computational Cost Analysis
Finally, we analyze the practical deployment costs in terms of latency and memory. As shown in TableΒ 4 , ROSA2 achieves a remarkable reduction in Avg Time per problem, a gain driven by two synergistic factors: (i) Intra-turn efficiency : the continuous optimization of Words and Weights enables the model to resolve problems using significantly more concise Chain-of-Thought (CoT) trajectories, drastically cutting the per-turn inference time; and (ii) Inter-turn efficiency : the reduction in total conversation turns as established in SectionΒ 5.1 . Regarding memory, ROSA2 introduces negligible overhead (maximum + 3.1 +3.1 GB on MATH), demonstrating that its high reasoning performance does not come at the cost of hardware accessibility.
Table 4 : Computational Cost Analysis.
Dataset
Method
Avg Time (s) ( β \downarrow )
Peak Memory (GB) ( β \downarrow )
MATH
Baseline
334.5
90.6
ROSA2
297.6 (-36.9)
93.7 (+3.1)
AIME25
Baseline
557.4
94.9
ROSA2
467.2 (-90.2)
95.4 (+0.5)
HumanEval
Baseline
543.7
94.8
ROSA2
521.3 (-22.4)
95.2 (+0.4)
BigCodeBench-Hard
Baseline
677.9
95.2
ROSA2
590.6 (-87.3)
95.5 (+0.3)
6
Conclusions
We introduced ROSA2, a joint optimization framework over context and parameters that effectively resolves the error attribution dilemma. By bypassing the local minima inherent to conditional baselines, ROSA2 achieves state-of-the-art accuracy with reduced latency across diverse benchmarks.
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning, specifically within the domain of test-time adaptation for multi-turn interactions. Our framework demonstrates that co-adapting context and parameters can unlock state-of-the-art performance on reasoning and agentic benchmarks. While this work primarily contributes to the technical efficiency and accuracy of LLMs, it also highlights the potential for more capable UI agents. We believe there are no specific negative societal consequences that must be highlighted here, beyond the general considerations associated with the deployment of increasingly capable generative AI models.
References
AIME (2025) AIME problems and solutions .
External Links: Link
Cited by: 2nd item .
X. Bo, R. Li, Z. Sun, Q. Dai, Z. Zhang, Z. Tian, X. Chen, and Z. Dong (2025) Prompt and parameter co-optimization for large language models .
External Links: 2509.24245 , Link
Cited by: Β§1 .
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021) Evaluating large language models trained on code .
External Links: 2107.03374 , Link
Cited by: Β§C.1 .
M. Chen, R. Sun, T. Pfister, and S. O. Arik (2025) Learning to clarify: multi-turn conversations with action-based contrastive self-training .
In The Thirteenth International Conference on Learning Representations ,
External Links: Link
Cited by: Appendix A , Β§1 .
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Ding, H. Xin, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Wang, J. Chen, J. Yuan, J. Qiu, J. Li, J. L. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Wang, L. Zhang, L. Xu, L. Xia, M. Zhang, M. Zhang, M. Tang, M. Li, M. Wang, M. Li, N. Tian, P. Huang, P. Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. J. Chen, R. L. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, S. S. Li, S. Zhou, S. Wu, S. Ye, T. Yun, T. Pei, T. Sun, T. Wang, W. Zeng, W. Zhao, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, W. L. Xiao, W. An, X. Liu, X. Wang, X. Chen, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yang, X. Li, X. Su, X. Lin, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Sun, X. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Xu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang (2025) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning .
External Links: 2501.12948 , Link
Cited by: Β§C.2 .
K. Deshpande, V. Sirdeshmukh, J. B. Mols, L. Jin, E. Hernandez-Cardona, D. Lee, J. Kritz, W. E. Primack, S. Yue, and C. Xing (2025) MultiChallenge: a realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs .
In Findings of the Association for Computational Linguistics: ACL 2025 , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.) ,
Vienna, Austria , pp.Β 18632β18702 .
External Links: Link , Document , ISBN 979-8-89176-256-5
Cited by: Β§1 .
Google (2025) Gemini 3 .
Note: https://aistudio.google.com/models/gemini-3 Accessed: 2025
Cited by: Β§1 .
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2021a) Measuring massive multitask language understanding .
Proceedings of the International Conference on Learning Representations (ICLR) .
Cited by: 2nd item .
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021b) Measuring mathematical problem solving with the math dataset .
NeurIPS .
Cited by: 1st item , Β§2.1 .
A. Keluskar, A. Bhattacharjee, and H. Liu (2024) Do llms understand ambiguity in text? a case study in open-world question answering .
External Links: 2411.12395 , Link
Cited by: Β§1 .
P. Laban, H. Hayashi, Y. Zhou, and J. Neville (2025) LLMs get lost in multi-turn conversation .
External Links: 2505.06120 , Link
Cited by: Β§1 .
Y. Lee, J. Boen, and C. Finn (2025) Feedback descent: open-ended text optimization via pairwise comparison .
External Links: 2511.07919 , Link
Cited by: Appendix A , Β§1 .
Y. Li, X. Hu, X. Qu, L. Li, and Y. Cheng (2025a) Test-time preference optimization: on-the-fly alignment via iterative textual feedback .
In Forty-second International Conference on Machine Learning ,
External Links: Link
Cited by: Appendix A , Β§1 .
Y. Li, X. Shen, X. Yao, X. Ding, Y. Miao, R. Krishnan, and R. Padman (2025b) Beyond single-turn: a survey on multi-turn interactions with large language models .
External Links: 2504.04717 , Link
Cited by: Β§1 .
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Letβs verify step by step .
arXiv preprint arXiv:2305.20050 .
Cited by: 3rd item .
J. Liu, H. Zhang, Z. Zhuang, Y. Kang, D. Wang, and B. Wang (2023) Design from policies: conservative test-time adaptation for offline policy optimization .
In Thirty-seventh Conference on Neural Information Processing Systems ,
External Links: Link
Cited by: Β§1 .
OpenAI (2025) Introducing gpt-5.2 .
Note: https://openai.com/index/introducing-gpt-5-2/ Accessed: 2025
Cited by: Β§1 .
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback .
External Links: 2203.02155 , Link
Cited by: Β§1 .
Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. (2025) UI-tars: pioneering automated gui interaction with native agents .
arXiv preprint arXiv:2501.12326 .
Cited by: Table 3 .
Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2025) Qwen2.5 technical report .
External Links: 2412.15115 , Link
Cited by: Β§C.2 , Β§C.2 .
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model .
In Thirty-seventh Conference on Neural Information Processing Systems ,
External Links: Link
Cited by: Β§3.1 .
C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, D. Toyama, R. Berry, D. Tyamagundlu, T. Lillicrap, and O. Riva (2025) AndroidWorld: a dynamic benchmarking environment for autonomous agents .
External Links: 2405.14573 , Link
Cited by: Β§5.2 .
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) GPQA: a graduate-level google-proof q&a benchmark .
In First Conference on Language Modeling ,
External Links: Link
Cited by: 1st item .
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models .
External Links: 2402.03300 , Link
Cited by: Β§1 .
G. Son, J. Hong, H. Ko, and J. Thorne (2025) Linguistic generalizability of test-time scaling in mathematical reasoning .
arXiv preprint arXiv:2502.17407 .
Cited by: Β§C.1 .
A. Tang, L. Soulier, and V. Guigue (2025a) Clarifying ambiguities: on the role of ambiguity types in prompting methods for clarification generation .
In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval ,
SIGIR β25 , New York, NY, USA , pp.Β 20β30 .
External Links: ISBN 9798400715921 , Link , Document
Cited by: Β§1 .
A. Tang, L. Soulier, and V. Guigue (2025b) Clarifying ambiguities: on the role of ambiguity types in prompting methods for clarification generation .
External Links: 2504.12113 , Link
Cited by: Appendix A .
M. Team, X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, C. Zheng, K. Deng, S. Guo, S. Jia, S. Jiang, Y. Liao, R. Li, Q. Li, S. Li, Y. Li, Y. Li, D. Ma, Y. Ni, H. Que, Q. Wang, Z. Wen, S. Wu, T. Xing, M. Xu, Z. Yang, Z. M. Wang, J. Zhou, Y. Bai, X. Bu, C. Cai, L. Chen, Y. Chen, C. Cheng, T. Cheng, K. Ding, S. Huang, Y. Huang, Y. Li, Y. Li, Z. Li, T. Liang, C. Lin, H. Lin, Y. Ma, Z. Peng, Z. Peng, Q. Qi, S. Qiu, X. Qu, Y. Tan, Z. Wang, C. Wang, H. Wang, Y. Wang, Y. Wang, J. Xu, K. Yang, R. Yuan, Y. Yue, T. Zhan, C. Zhang, J. Zhang, X. Zhang, X. Zhang, Y. Zhang, Y. Zhao, X. Zheng, C. Zhong, Y. Gao, Z. Li, D. Liu, Q. Liu, T. Liu, S. Ni, J. Peng, Y. Qin, W. Su, G. Wang, S. Wang, J. Yang, M. Yang, M. Cao, X. Yue, Z. Zhang, W. Zhou, J. Liu, Q. Lin, W. Huang, and G. Zhang (2025) SuperGPQA: scaling llm evaluation across 285 graduate disciplines .
External Links: 2502.14739 , Link
Cited by: 3rd item .
X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji (2024) MINT: evaluating LLMs in multi-turn interaction with tools and language feedback .
In The Twelfth International Conference on Learning Representations ,
External Links: Link
Cited by: Β§1 .
C. Wei, Y. Shu, Y. T. He, and F. Yu (2025a) Flexora: flexible low-rank adaptation for large language models .
In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.) ,
Vienna, Austria , pp.Β 14643β14682 .
External Links: Link , Document , ISBN 979-8-89176-251-0
Cited by: Β§1 .
C. Wei, H. Wang, Y. T. He, F. Yu, and Y. Shu (2025b) Test-time policy adaptation for enhanced multi-turn interactions with LLMs .
In First Workshop on Multi-Turn Interactions in Large Language Models ,
External Links: Link
Cited by: Appendix A , Appendix A , Β§B.1 , Β§B.2 , Β§1 , Β§1 , Β§3.1 , Β§4.2 , Theorem 4.1 .
C. Wei, J. Yu, Y. T. He, H. Dong, Y. Shu, and F. Yu (2025c) ReDit: reward dithering for improved LLM policy optimization .
In The Thirty-ninth Annual Conference on Neural Information Processing Systems ,
External Links: Link
Cited by: Β§1 .
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu (2024) OSWorld: benchmarking multimodal agents for open-ended tasks in real computer environments .
External Links: 2404.07972 , Link
Cited by: Β§5.2 .
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Tang, W. Yin, X. Ren, X. Wang, X. Zhang, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Zhang, Y. Wan, Y. Liu, Z. Wang, Z. Cui, Z. Zhang, Z. Zhou, and Z. Qiu (2025) Qwen3 technical report .
External Links: 2505.09388 , Link
Cited by: Β§C.2 , Β§C.2 , Β§1 .
Z. Yi, J. Ouyang, Z. Xu, Y. Liu, T. Liao, H. Luo, and Y. Shen (2025) A survey on recent advances in llm-based multi-turn dialogue systems .
External Links: 2402.18013 , Link
Cited by: Appendix A , Β§1 , Β§1 .
Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua, B. Qi, Y. Sun, Z. Ma, L. Yuan, N. Ding, and B. Zhou (2025) TTRL: test-time reinforcement learning .
In The Thirty-ninth Annual Conference on Neural Information Processing Systems ,
External Links: Link
Cited by: Appendix A , Β§1 .
Appendix A
Related Work
Adaptation via Context Refinement (Words). Approaches focusing on the βWordsβ axis, broadly categorized under Prompt Engineering, aim to optimize the external context x x while keeping the model parameters ΞΈ \theta frozen. Yi et al. ( 2025 ) review the progression of these methods from manual instruction design to automated strategies that dynamically refine inputs to align with user needs. Recent work has further emphasized the importance of clarifying input intent; for instance, Tang et al. ( 2025b ) investigate the role of ambiguity types in prompting, demonstrating that sharpening semantic clarity can improve generation quality. However, these context-centric optimization methods face a fundamental theoretical ceiling: they cannot induce capabilities that do not exist within the frozen parameters. As argued by Chen et al. ( 2025 ) and Lee et al. ( 2025 ) , semantic refinement alone is insufficient to remedy intrinsic capability deficits. Consequently, such methods often plateau in what Wei et al. ( 2025b ) describe as a βDeficit Trap,β where the model understands the task intent but lacks the execution capacity to solve it.
Adaptation via Parameter Updates (Weights). Conversely, the paradigm of Test-Time Training (TTT) or Test-Time Policy Adaptation ( T 2 β P β A β M T^{2}PAM ) focuses on the βWeightsβ axis, allowing for the real-time update of internal parameters ΞΈ \theta during inference. Wei et al. ( 2025b ) introduced ROSA, a method that employs low-rank adaptation (LoRA) to minimize the divergence from a reward-weighted policy, effectively bridging the capability gap observed in frozen models. Similarly, Zuo et al. ( 2025 ) proposed Test-Time Reinforcement Learning (TTRL), which treats each interaction turn as a policy optimization step driven by reward signals. While these parameter-centric approaches offer a mechanism to enhance intrinsic model capabilities, they are highly sensitive to the quality of the learning signal. Li et al. ( 2025a ) highlight that performing parameter updates on noisy or ambiguous interaction histories often leads to the learning of spurious correlations. Without the pre-conditioning of a clear context, these methods are prone to gravitating towards an βOverfitting Trap,β resulting in performance degradation over extended interaction turns.
Appendix B
Proofs
B.1
Proof of Theorem 4.1
The proof follows from the closed-form solution of the linearized parameter update in ROSA.
Step 1: The Residual-Driven Update. According to Eq.Β (6) in ROSA (Wei et al. , 2025b ) , the parameter update Ξ β ΞΈ \Delta\theta is the least-squares solution to fitting the residual between the target distribution Ο ~ β \tilde{\pi}^{*} and the current policy Ο \pi :
( J T J ) Ξ ΞΈ = J T R ( x ) , whereΒ R ( x ) = Ο ~ β ( β
| x ) β Ο ( β
| x , ΞΈ t β 1 ) (J^{T}J)\Delta\theta=J^{T}R(x),\quad\text{where }R(x)=\tilde{\pi}^{*}(\cdot|x)-\pi(\cdot|x,\theta_{t-1})
(11)
The magnitude of the update is bounded by the magnitude of this residual vector R β ( x ) R(x) :
β Ξ β ΞΈ t β ( x ) β 2 β€ 1 Ο min β ( J ) β β R β ( x ) β 2 ||\Delta\theta_{t}(x)||_{2}\leq\frac{1}{\sigma_{\min}(J)}||R(x)||_{2}
(12)
Step 2: Effect of Semantic Refinement. The Semantic Stream updates x t β x t β x_{t}\to x_{t}^{} to minimize the semantic discrepancy, effectively bringing the current policyβs distribution closer to the userβs optimal policy Ο u β s β e β r β \pi_{user}^{} . Since the target Ο ~ β \tilde{\pi}^{} is constructed based on Ο u β s β e β r β \pi_{user}^{} , reducing the distance to Ο u β s β e β r β \pi_{user}^{} also reduces the distance to Ο ~ β \tilde{\pi}^{} . Therefore, the refined query yields a smaller residual vector:
β R β ( x t β ) β 2 < β R β ( x t ) β 2 ||R(x_{t}^{*})||_{2}<||R(x_{t})||_{2}
(13)
Step 3: Conclusion. Substituting the reduced residual back into the bound from Step 1, we obtain:
β Ξ β ΞΈ t β ( x t β ) β 2 β€ C β
β R β ( x t β ) β 2 < C β
β R β ( x t ) β 2 β β Ξ β ΞΈ t β ( x t ) β 2 ||\Delta\theta_{t}(x_{t}^{*})||_{2}\leq C\cdot||R(x_{t}^{*})||_{2}
(14)
Thus, optimizing the query reduces the norm of the required parameter update. β
B.2
Proof of Theorem 4.2
The proof relies on decomposing the total error into a βtheoretical improvementβ component and an βapproximation errorβ component. We analyze the change in KL divergence at step t t by introducing the theoretical target Ο ~ t β \tilde{\pi}_{t}^{*} as an intermediate point. Using the telescoping sum property, the total error after T T turns can be written as:
D K β L ( Ο u β s β e β r β | | Ο Ο T ) β D K β L ( Ο u β s β e β r β | | Ο Ο 0 ) \displaystyle D_{KL}(\pi_{user}^{*}||\pi_{\phi_{T}})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{0}})
= β t = 1 T ( D K β L ( Ο u β s β e β r β | | Ο Ο t ) β D K β L ( Ο u β s β e β r β | | Ο Ο t β 1 ) ) \displaystyle=\sum_{t=1}^{T}\left(D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t}})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t-1}})\right)
(15)
= β t = 1 T ( D K β L ( Ο u β s β e β r β | | Ο ~ t β ) β D K β L ( Ο u β s β e β r β | | Ο Ο t β 1 ) β Term A: Ideal Gain \displaystyle=\sum_{t=1}^{T}\Bigg(\underbrace{D_{KL}(\pi_{user}^{*}||\tilde{\pi}_{t}^{*})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t-1}})}_{\text{Term A: Ideal Gain}}
+ D K β L ( Ο u β s β e β r β | | Ο Ο t ) β D K β L ( Ο u β s β e β r β | | Ο ~ t β ) β Term B: Approximation Cost ) \displaystyle\quad\quad+\underbrace{D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t}})-D_{KL}(\pi_{user}^{*}||\tilde{\pi}_{t}^{*})}_{\text{Term B: Approximation Cost}}\Bigg)
Bounding Term A. Since our target construction follows the reward-weighted regression, we invoke Theorem 2 from ROSA (Wei et al. , 2025b ) , which guarantees monotonic error reduction:
D K β L ( Ο u β s β e β r β | | Ο ~ t β ) β D K β L ( Ο u β s β e β r β | | Ο Ο t β 1 ) β€ β 1 Ξ² πΌ y βΌ Ο u β s β e β r β [ r t ( y ) ] D_{KL}(\pi_{user}^{*}||\tilde{\pi}_{t}^{*})-D_{KL}(\pi_{user}^{*}||\pi_{\phi_{t-1}})\leq-\frac{1}{\beta}\mathbb{E}_{y\sim\pi_{user}^{*}}[r_{t}(y)]
(16)
Bounding Term B. Under the L L -Lipschitz smoothness assumption on the joint policy Ο β ( y β£ x , ΞΈ ) \pi(y\mid x,\theta) , the divergence between the target and actual policy is bounded by the squared Euclidean distance of the joint update:
Term B β€ D K β L ( Ο ~ t β | | Ο Ο t ) β€ L 2 | | Ο t β Ο t β 1 | | 2 2 = L 2 ( | | Ξ x t | | 2 2 + | | Ξ ΞΈ t | | 2 2 ) \text{Term B}\leq D_{KL}(\tilde{\pi}_{t}^{*}||\pi_{\phi_{t}})\leq\frac{L}{2}||\phi_{t}-\phi_{t-1}||_{2}^{2}=\frac{L}{2}\left(||\Delta x_{t}||_{2}^{2}+||\Delta\theta_{t}||_{2}^{2}\right)
(17)
Summing these bounds over t = 1 β β¦ β T t=1\dots T yields TheoremΒ 4.2 . β
Appendix C
Experimental Setup
To rigorously evaluate the efficacy, efficiency, and generalizability of ROSA2, we conducted comprehensive experiments across a wide spectrum of tasks and model architectures. This section details the datasets, models, evaluation metrics, and reward mechanisms employed in our study.
C.1
Datasets
We assessed ROSA2 on a diverse suite of challenging benchmarks categorized into four distinct domains: Mathematical Reasoning, General Reasoning, Code Generation, and Multilingual Reasoning. TableΒ 5 summarizes the statistics of these datasets.
Table 5 : Summary of benchmarks used for evaluation. βN/Aβ denotes datasets primarily used for testing that lack a standard pre-defined training split.
Domain
Dataset
Train Size
Test Size
Mathematical Reasoning
MATH
7,500
5,000
AIME25
N/A
30
MATH-500
N/A
500
General Reasoning
GPQA-diamond
N/A
198
MMLU-Redux
N/A
3,000
SuperGPQA
26,500
N/A
Code Generation
HumanEval
N/A
164
Multilingual Reasoning
MCLM
N/A
156
Mathematical Reasoning.
This domain targets complex, multi-step problem-solving. We employed three standard benchmarks:
β’
MATH Β (Hendrycks et al. , 2021b ) : A collection of 12,500 challenging high-school level competition problems spanning algebra, geometry, and calculus.
β’
AIME25 Β (AIME, 2025 ) : A curated subset of 25 extremely difficult problems from the American Invitational Mathematics Examination, designed to probe advanced reasoning limits.
β’
MATH-500 Β (Lightman et al. , 2023 ) : A widely recognized evaluation subset of the MATH test set, consisting of 500 problems selected for efficient model assessment.
General Reasoning.
To evaluate knowledge application across broad topics, we utilized three expert-level QA datasets:
β’
GPQA-diamond Β (Rein et al. , 2024 ) : A high-difficulty set of graduate-level questions written by domain experts; the βdiamondβ subset ensures the highest quality.
β’
MMLU-Redux Β (Hendrycks et al. , 2021a ) : A refined version of the Massive Multitask Language Understanding benchmark, covering 57 subjects ranging from elementary math to professional law.
β’
SuperGPQA Β (Team et al. , 2025 ) : An expansion of GPQA containing nearly 5,000 expert-validated questions across 285 graduate-level disciplines.
Code Generation.
We assessed code synthesis capabilities using HumanEval Β (Chen et al. , 2021 ) . This benchmark comprises 164 hand-written programming problems equipped with function signatures, docstrings, and unit tests to verify functional correctness.
Multilingual Reasoning.
Cross-lingual reasoning was evaluated using MCLM Β (Son et al. , 2025 ) , which translates challenging English benchmarks into multiple languages. Our evaluation specifically focuses on the multilingual versions of IMO, AIME, and MATH problems ( M-IMO , MT-AIME24 , and MT-MATH100 ).
Evaluation Protocol.
To simulate real-world deployment, our primary evaluation is conducted on official, held-out test sets. In cases where a dedicated test set is unavailable, or for specific ablation studies, we utilized corresponding training or development sets. Specifically, for SuperGPQA , we sampled a portion of the training data for testing purposes; for all other benchmarks, standard test sets were strictly observed.
C.2
Models
We selected a diverse array of open-source Large Language Models (LLMs) to ensure the robustness of our findings irrespective of model architecture or scale. As detailed in TableΒ 6 , our selection includes instruction-tuned variants designed for chat and instruction-following tasks. Note that to mitigate potential data contamination concerns with the Qwen2.5 series on specific benchmarks, we also validated results using the more recent Qwen3 and DeepSeek-R1 models.
Table 6 : Categorization of language models used in experiments.
Category
Model
Params
Type
Small-Scale
Qwen2.5-0.5B-Instruct
0.5B
Instruct
Qwen3-0.6B
0.6B
Base
Large-Scale
Qwen2.5-7B-Instruct
7B
Instruct
Qwen3-8B
8B
Base
Reasoning-Focused
DeepSeek-R1-Distill-Llama-8B
8B
Reasoning
DeepSeek-R1-Distill-Qwen-7B
7B
Reasoning
Small-Scale Models.
To evaluate ROSA2 in resource-constrained settings, we selected compact models from the Qwen family: Qwen2.5-0.5B-Instruct Β (Qwen et al. , 2025 ) , optimized for instruction following, and Qwen3-0.6B Β (Yang et al. , 2025 ) , representing the newer generation with architectural enhancements.
Large-Scale Models.
We tested scalability using capable base models: Qwen2.5-7B-Instruct Β (Qwen et al. , 2025 ) , a standard 7B parameter instruction-tuned model, and its successor Qwen3-8B Β (Yang et al. , 2025 ) .
Reasoning-Focused Models.
We specifically included the DeepSeek-R1 seriesΒ (DeepSeek-AI et al. , 2025 ) , which are optimized via reinforcement learning for complex reasoning. We utilized distilled variants based on both Llama ( DeepSeek-R1-Distill-Llama-8B ) and Qwen ( DeepSeek-R1-Distill-Qwen-7B ) architectures to allow for controlled architectural comparisons.
C.3
Evaluation Metrics
Our evaluation framework focuses on two critical dimensions: downstream task performance and computational efficiency.
Performance Metrics.
β’
Accuracy: Defined as the proportion of unique problems correctly solved within a maximum of K K conversational turns. Let π« \mathcal{P} be the set of problems and S i β { 0 , 1 } S_{i}\in{0,1} be an indicator variable where S i = 1 S_{i}=1 if problem i i is solved at any turn t β€ K t\leq K . Accuracy is calculated as:
Accuracy = β i β π« S i | π« | \text{Accuracy}=\frac{\sum_{i\in\mathcal{P}}S_{i}}{|\mathcal{P}|}
(18)
β’
Correction Uplift: This metric quantifies the modelβs capacity to self-correct. It represents the percentage of problems initially answered incorrectly that were subsequently solved in later turns. Let π« fail β π« \mathcal{P}_{\text{fail}}\subset\mathcal{P} denote problems failed at turn t = 1 t=1 . The metric is defined as:
Correction Uplift = β i β π« fail S i | π« fail | Γ 100 % \text{Correction Uplift}=\frac{\sum_{i\in\mathcal{P}_{\text{fail}}}S_{i}}{|\mathcal{P}_{\text{fail}}|}\times 100\%
(19)
Efficiency Metrics.
To measure computational overhead, we track:
β’
Avg Time : The average time solve per problem.
β’
Peak GPU Memory: The maximum VRAM usage observed during the inference and update process.
C.4
Reward Models
We employed two distinct reward mechanisms to simulate varying feedback granularities found in real-world applications.
Rule-Based Reward Model (Sparse Feedback).
This model simulates scenarios with definitive, binary judgments. It programmatically extracts the final answer (e.g., from a \boxed{} environment) and matches it against the ground truth. A reward of + 1.0 +1.0 is assigned for an exact match, and β 1.0 -1.0 otherwise. The core implementation logic is provided below.
Core logic for the rule-based reward model
β¬
class MathVerifyRewardModel :
def __init__ ( self , ground_truth_answer : str ):
self . ground_truth_answer = ground_truth_answer
def get_reward ( self , response_text : str ) -> float :
# Returns +1.0 for match, -1.0 otherwise
return 1.0 if compute_score ( response_text ,
self . ground_truth_answer ) == 1.0 else -1.0
def compute_score ( solution_str , ground_truth ) -> float :
retval = 0.0
try :
string_in_last_boxed = last_boxed_only_string ( solution_str )
if string_in_last_boxed is not None :
answer = remove_boxed ( string_in_last_boxed )
if is_equiv ( answer , ground_truth ):
retval = 1.0
except Exception :
pass
return retval
BETA
AI Summary: Based on hf metadata. Not a recommendation.
π‘οΈ Paper Transparency Report
Technical metadata sourced from upstream repositories.
π Identity & Source
- id
- arxiv-paper--unknown--2603.01375
- slug
- unknown--2603.01375
- source
- hf
- author
- Chenxing Wei, Hong Wang, Ying He
- license
- ArXiv
- tags
- paper, research
βοΈ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
π Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.