📄
Paper

Paper 2511.10648

by Jiahao Wang arxiv-paper--2511.10648
Nexus Index
0.0 Top 18%
S: Semantic 50
A: Authority 0
P: Popularity 0
R: Recency 0
Q: Quality 0
Tech Context
Vital Performance
0 DL / 30D
0.0%

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We...

High Impact - Citations
2025 Year
ArXiv Venue
Top 18% FNI Rank
Paper Information Summary
Entity Passport
Registry ID arxiv-paper--2511.10648
Provider arXiv
📜

Cite this paper

Academic & Research Attribution

BibTeX
@misc{arxiv_paper__2511.10648,
  author = {Jiahao Wang},
  title = {Paper 2511.10648 Paper},
  year = {2026},
  howpublished = {\url{https://arxiv.org/abs/2511.10648v1}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
Jiahao Wang. (2026). Paper 2511.10648 [Paper]. Free2AITools. https://arxiv.org/abs/2511.10648v1

đŸ”ŦTechnical Deep Dive

Full Specifications [+]

âš–ī¸ Nexus Index V2.0

0.0
TOP 18% SYSTEM IMPACT
Semantic (S) 50
Authority (A) 0
Popularity (P) 0
Recency (R) 0
Quality (Q) 0

đŸ’Ŧ Index Insight

FNI V2.0 for Paper 2511.10648: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live

📝 Executive Summary

"Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We..."

❝ Cite Node

@article{Wang2025ArXiv,
  title={ArXiv 2511.10648 Technical Profile},
  author={Jiahao Wang and Weiye Xu and Aijun Yang and Wengang Zhou and Lewei Lu and Houqiang Li and Xiaohua Wang and Jinguo Zhu},
  journal={arXiv preprint arXiv:arxiv-paper--2511.10648},
  year={2025}
}

đŸ‘Ĩ Collaborating Minds

Jiahao Wang Weiye Xu Aijun Yang Wengang Zhou Lewei Lu Houqiang Li Xiaohua Wang Jinguo Zhu

Abstract & Analysis

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseâ„šī¸ Verify with original source

đŸ›Ąī¸ Paper Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id
arxiv-paper--2511.10648
author
Jiahao Wang
tags
arxiv:cs.CVllm

âš™ī¸ Technical Specs

architecture
null
params billions
null
context length
null

📊 Engagement & Metrics

likes
0
downloads
0

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)