Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2601.23039 |
| License | ArXiv |
| Provider | hf |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2601.23039,
author = {Yizhi Liu},
title = {Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2601.23039}},
note = {Accessed via Free2AITools Knowledge Fortress}
} š¬Technical Deep Dive
Full Specifications [+]ā¾
āļø Nexus Index V2.0
š¬ Index Insight
FNI V2.0 for Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).
Verification Authority
š Executive Summary
ā Cite Node
@article{Unknown2026Avoiding,
title={Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2601.23039},
year={2026}
} Abstract & Analysis
Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference
Avoiding Premature Collapse:
Adaptive Annealing for Entropy-Regularized Structural Inference
Yizhi Liu Department of Computer Science, Stony Brook University [email protected]
(January 23, 2026)
Abstract
Differentiable matching layers and residual connection paradigms, often implemented via entropy-regularized Optimal Transport (OT) Cuturi ( 2013 ); PeyrĆ© and Cuturi ( 2019 ) , serve as critical mechanisms in structural prediction and architectural scaling. However, recovering discrete permutations or maintaining identity mappings via annealing ϵ ā 0 \epsilon\rightarrow 0 is notoriously unstable PeyrĆ© and Cuturi ( 2019 ) . We identify a fundamental mechanism for this failure: Premature Mode Collapse . By analyzing the non-normal dynamics of the Sinkhorn fixed-point map Trefethen and Embree ( 2005 ) , we reveal a theoretical thermodynamic speed limit: standard exponential cooling outpaces the contraction rate of the inference operator, which degrades as O ā ( 1 / ϵ ) O(1/\epsilon) Cominetti and San MartĆn ( 1994 ); Weed ( 2018 ) . To address this, we propose Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC) , an adaptive scheduling algorithm that monitors the stability of the inference process. We demonstrate that EPH-ASC is essential for stabilizing Manifold-Constrained Hyper-Connections (mHC) DeepSeek-AI et al. ( 2025 ) during large-scale training on the FineWeb-Edu dataset, preventing late-stage gradient explosions by enforcing a linear stability law.
1
Introduction Entropy-regularized Optimal Transport (OT) has become a standard surrogate for combinatorial inference Cuturi ( 2013 ); Altschuler et al. ( 2017 ) and modern macro-architecture design DeepSeek-AI et al. ( 2025 ); Zhu et al. ( 2024 ) . Practitioners often attempt to recover hard assignments or specialized routing by annealing the regularization parameter ϵ ā 0 \epsilon\rightarrow 0 Genevay et al. ( 2018 ); PeyrĆ© and Cuturi ( 2019 ) . Recently, studies such as Manifold-Constrained Hyper-Connections (mHC) DeepSeek-AI et al. ( 2025 ) have utilized the Sinkhorn-Knopp algorithm Sinkhorn and Knopp ( 1967 ) to project residual mappings onto the Birkhoff polytope, thereby restoring the identity mapping property in multi-stream architectures DeepSeek-AI et al. ( 2025 ) .
Empirically, however, this cooling process is fragile PeyrĆ© and Cuturi ( 2019 ); Blondel et al. ( 2018 ) . As ϵ \epsilon decreases, the sensitivity of the optimal plan to cost perturbations blows up as O ā ( 1 / ϵ ) O(1/\epsilon) Cominetti and San MartĆn ( 1994 ); Weed ( 2018 ) , a phenomenon we define as the Thermodynamic Speed Limit . In the presence of high-variance gradients from real-world datasets like FineWeb-Edu, standard exponential annealing Chizat ( 2024 ) violates this limit, causing the composite mapping in HC architectures to diverge or collapse DeepSeek-AI et al. ( 2025 ) . We propose EPH-ASC to reconcile this instability by monitoring the Primal Drift Weed ( 2018 ); Eisenberger et al. ( 2022 ) and triggering a "Thermodynamic Pause" when the distributional shift exceeds the solverās restoring capacity.
2
The Mechanism of Inference Collapse
2.1
Geometric Picture: Basin Shrinkage As the temperature ε ā 0 \varepsilon\to 0 , the entropic map š® ε ā ( C ) \mathcal{S}{\varepsilon}(C) sharpens soft belief states into near-permutations. Geometrically, the transport polytope decomposes into basins of attraction around permutation vertices. The failure mode we studyā early locking āoccurs when the current inference state P t P{t} is attracted into a wrong basin because the optimal posterior drifted faster than the fixed-point iteration could correct. FigureĀ 1 visualizes this premature collapse.
Figure 1 : Premature Mode Collapse. Standard annealing (blue) breaches the stability threshold R R (dotted), causing early locking into a spurious mode. Ours (red) detects the stability violation and pauses cooling. (Simulation)
2.2
Thermodynamic Sensitivity ( 1 / ε 1/\varepsilon ) To quantify the "velocity" of the distribution shift, we adopt a localized non-degeneracy assumption that fixes an active support set S S for small ε \varepsilon . Under this assumption, the sensitivity of the Sinkhorn map with respect to ε \varepsilon scales like 1 / ε 1/\varepsilon . This follows from implicit differentiation of the entropic optimality conditions Cominetti and San MartĆn ( 1994 ); Weed ( 2018 ) .
Remark (Linear Stability Scaling). Since the restoring force of the inference operator collapses and sensitivity scales as O ā ( 1 / ε ) O(1/\varepsilon) , the radius of the effective stability basin shrinks proportionally with ε \varepsilon . This suggests that the permissible distributional shift Ļ t \tau_{t} must follow a linear scaling law Ļ t ā ε \tau_{t}\propto\varepsilon , a property we exploit in Section 4 to design an efficient adaptive schedule.
2.3
Transient Inference Error and Pseudospectra Linearizing the Sinkhorn fixed-point map yields a Jacobian J ε J_{\varepsilon} . While its eigenvalues describe asymptotic contraction, its non-normal structure creates a "shear" effect that can amplify transient inference errors. Pseudospectral theory quantifies how contours of the resolvent extend beyond the spectral radius Trefethen and Embree ( 2005 ) . We show (Theorem A.2) that the modal condition number Īŗ ā ( V ) \kappa(V) of the Jacobian effectively compresses the basin of attraction by roughly Īŗ ā ( V ) \kappa(V) .
Proposition 2.1
(Linear Scaling of the Stability Basin) .
Let the cost matrix C C satisfy the localized non-degeneracy Assumption (stable active support S S ). For sufficiently small ϵ \epsilon , the spectral gap of the Sinkhorn Jacobian J ϵ J_{\epsilon} satisfies 1 ā Ļ ā ( J ϵ ) ℠γ ā ϵ 1-\rho(J_{\epsilon})\geq\gamma\cdot\epsilon . Consequently, to ensure the inference error e t e_{t} remains within a local linearization region R ā ( ϵ ) R(\epsilon) (whose size may depend on ϵ \epsilon but vanishes no faster than linearly as ϵ ā 0 \epsilon\to 0 ), the permissible drift Ļ m ā a ā x \tau_{max} must scale linearly:
Ļ m ā a ā x ā ( ϵ ) ⤠R ā ( ϵ ) ā
γ Īŗ ā ( V ) ā
ϵ \tau_{max}(\epsilon)\leq\frac{R(\epsilon)\cdot\gamma}{\kappa(V)}\cdot\epsilon
(1)
(a) Macroscopic: The Closing Funnel. As ϵ ā 0 \epsilon\to 0 , the stability basin (blue) shrinks. Collapse occurs when the distributional shift exceeds the basin radius (red āXā).
(b) Microscopic: Non-normal Instability. The ϵ \epsilon -pseudospectrum contours extend beyond the unit circle, quantifying the transient error amplification.
Figure 2 : The Dual View of Inference Collapse. (a) Geometric intuition: The inference fails when the target drifts faster than the shrinking basin allows. (b) Spectral reality: This shrinkage is quantified by the non-normal pseudospectrum.
3
Theoretical Analysis: The Thermodynamic Speed Limit To understand the mechanism of premature collapse, we model the annealing process as a discrete-time tracking problem. The algorithm must maintain the iterate P t P_{t} within the basin of attraction of the moving fixed point P ϵ t ā P^{*}{\epsilon{t}} .
3.1
Problem Setup and Sinkhorn Dynamics Let ϵ t \epsilon_{t} be the annealing schedule with step size Ī“ t := ϵ t ā ϵ t + 1 \delta_{t}:=\epsilon_{t}-\epsilon_{t+1} . The inference dynamics are governed by the recurrence P t + 1 = š® ϵ t + 1 ā ( C , P t ) P_{t+1}=\mathcal{S}{\epsilon{t+1}}(C,P_{t}) . We analyze the tracking error e t := P t ā P ϵ t ā e_{t}:=P_{t}-P^{*}{\epsilon{t}} relative to the linearization radius R R of the basin.
We rely on the localized non-degeneracy of the cost matrix (Assumption A.1 in Appendix), which implies the following fundamental properties of the Sinkhorn operator:
Proposition 3.1
(Sinkhorn Dynamics) .
Under the localized non-degeneracy assumption, for sufficiently small ϵ \epsilon :
1. Sensitivity: The fixed point drift scales as ā ā ϵ P ϵ ā ā = Ī ā ( ϵ ā 1 ) |\nabla_{\epsilon}P^{*}_{\epsilon}|=\Theta(\epsilon^{-1}) .
2. Restoring Force: The spectral gap of the Jacobian J ϵ J_{\epsilon} vanishes as 1 ā Ļ ā ( J ϵ ) = Ī ā ( ϵ ) 1-\rho(J_{\epsilon})=\Theta(\epsilon) .
3. Resolvent Growth: The resolvent norm scales as ā ( I ā J ϵ ) ā 1 ā = Ī ā ( ϵ ā 1 ) |(I-J_{\epsilon})^{-1}|=\Theta(\epsilon^{-1}) .
Proof.
See AppendixĀ A.6 for sensitivity analysis and AppendixĀ A.2 for spectral bounds. ā
3.2
The Adiabatic Tracking Theorem We now derive the necessary condition for the inference trajectory to remain stable. The error evolves as a competition between the distributional shift (drift) caused by Ī“ t \delta_{t} and the solverās contraction (restoring force).
Theorem 3.2
(Thermodynamic Speed Limit) .
Consider the tracking error dynamics linearized around the fixed point path. For the error to remain uniformly bounded within the basin radius R R (i.e., lim sup t ā ā ā e t ā ⤠R \limsup_{t\to\infty}|e_{t}|\leq R ), the annealing step size Ī“ t \delta_{t} must satisfy:
Ī“ t ⤠O ā ( ϵ t 2 ) ā
R \delta_{t}\leq O(\epsilon_{t}^{2})\cdot R
(2)
Specifically, the schedule must be at least quadratic ( Ī“ t ā ϵ 2 \delta_{t}\propto\epsilon^{2} ) to counteract the O ā ( ϵ ā 2 ) O(\epsilon^{-2}) amplification caused by the ratio of sensitivity to contraction.
Corollary 3.3
(Inevitability of Collapse for Exponential Schedules) .
Standard exponential annealing ϵ t + 1 = α ā ϵ t \epsilon_{t+1}=\alpha\epsilon_{t} implies a linear step size Ī“ t = ( 1 ā α ) ā ϵ t ā ϵ t \delta_{t}=(1-\alpha)\epsilon_{t}\propto\epsilon_{t} . Since the stability condition requires Ī“ t ā ϵ t 2 \delta_{t}\propto\epsilon_{t}^{2} , exponential annealing violates the speed limit by a factor of 1 / ϵ t 1/\epsilon_{t} . As ϵ t ā 0 \epsilon_{t}\to 0 , the tracking error diverges relative to the basin radius, rendering mode collapse theoretically inevitable.
The proofs for TheoremĀ 3.2 and CorollaryĀ 3.3 are provided in AppendixĀ A.2 .
4
Method: Efficient Piecewise Hybrid ASC We propose Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC) to reconcile topological stability with computational efficiency. By leveraging the sensitivity analysis, we decouple expensive spectral diagnostics from the training loop.
4.1
Approximating the Stability Constraint Precise verification requires computing the spectral radius Ļ ā ( J ϵ ) \rho(J_{\epsilon}) , incurring šŖ ā ( N 3 ) \mathcal{O}(N^{3}) cost. However, since sensitivity scales deterministically as šŖ ā ( 1 / ϵ ) \mathcal{O}(1/\epsilon) , the permissible drift threshold Ļ t \tau_{t} must follow a corresponding linear law. We approximate the stability constraint by enforcing a limit on the distributional shift:
ā Ī t ā F ā¤ Ļ m ā a ā x ā ( ϵ ) ā k s ā a ā f ā e ā
ϵ t \|\Delta_{t}\|_{F}\leq\tau_{max}(\epsilon)\approx k_{safe}\cdot\epsilon_{t}
(3)
where k s ā a ā f ā e k_{safe} is a dataset-specific safety slope. This approximation captures the essential dynamics: as the system "stiffens" ( ϵ ā 0 \epsilon\to 0 ), the tolerable shift must decrease proportionally.
4.2
Two-Phase Protocol
Phase I: Calibration (Offline).
We execute a diagnostic oracle (QSA) on a proxy subset using an aggressive schedule to intentionally trigger "Mode Collapse". We record the drift-to-temperature ratio at the moment of topological collapse to estimate k s ā a ā f ā e k_{safe} .
Phase II: Runtime Control (Adaptive Annealing).
During training, the controller monitors the instantaneous shift ā Ī t ā F |\Delta_{t}|_{F} and enforces Eq.Ā 3 :
⢠Stable State ( ā Ī t ā F ⤠k s ā a ā f ā e ā ϵ t |\Delta_{t}|{F}\leq k{safe}\cdot\epsilon_{t} ): The trajectory is safe. Proceed with standard cooling.
⢠Unstable State ( ā Ī t ā F > k s ā a ā f ā e ā ϵ t |\Delta_{t}|{F}>k{safe}\cdot\epsilon_{t} ): The distributional shift exceeds the basin capacity. The controller triggers a āThermodynamic Pauseā (Braking), holding ϵ t + 1 ā ϵ t \epsilon_{t+1}\leftarrow\epsilon_{t} constant. This pause allows the feature extractor to improve the signal-to-noise ratio of C C , naturally reducing drift until stability is regained.
Figure 3 : Mechanism of Adaptive Stability Control. The interplay between primal drift ā Ī t ā ||\Delta_{t}|| (red) and the stability threshold (dashed). The Stability Braking zone (Yellow) visualizes the algorithm strictly enforcing the thermodynamic speed limit. The controller detects imminent divergence and pauses cooling, preventing Premature Mode Collapse.
5
Experiments We validate EPH-ASC on SPair-71k , a benchmark for semantic keypoint matching under view variation.
5.1
Setup and Baselines We employ a ResNet-50 backbone with a Sinkhorn matching layer. We compare:
- Standard Log-Space: Exponential annealing ( α = 0.95 \alpha=0.95 ), serving as a baseline.
- Gumbel-Sinkhorn: Stochastic exploration via Gumbel noise.
- EPH-ASC (Ours): Deterministic controller with k s ā a ā f ā e = 0.5 k_{safe}=0.5 .
5.2
Results: Entropy and Convergence Preventing Collapse. FigureĀ 4 (Left) demonstrates the trap. Standard annealing (Blue) collapses early (Epoch ā 20 \approx 20 ), causing gradients to vanish and accuracy to flatline. The aggressive sharpening forces the plan into a spurious basin.
Preserving Uncertainty. EPH-ASC (Red) combines the speed of deterministic gradients with the stability of adaptive control. As shown in FigureĀ 4 (Right), the controller detects the drift spike and triggers "Stability Braking". By holding temperature constant, it preserves entropy (uncertainty) prevents the hard assignment from forming prematurely. Once the features mature, annealing resumes, achieving target accuracy in 47 epochs āa 1.60 Ć \times speedup over Gumbel-Sinkhorn.
Figure 4 : Training Dynamics on SPair-71k. Left: Standard annealing (Blue) hits the Trap, causing gradient collapse. Gumbel-Sinkhorn (Green) is stable but converges slowly due to variance. EPH-ASC (Red) achieves the fastest convergence. Right: The Adaptive Mechanism. Drift spikes (Gray) trigger the braking zone (Yellow), holding ϵ \epsilon constant (Red line) to maintain thermodynamic stability.
Table 1 : Efficiency on SPair-71k. EPH-ASC achieves a 1.60 Ć \mathbf{1.60\times} speedup over Gumbel-Sinkhorn with negligible overhead ( 0.51 % 0.51% ), while Standard annealing fails.
Method
Epochs to 90%
Speedup
Layer Overhead
Training Overhead
Standard (Log-Space)
Failed ( > 100 >100 )
N/A
0.00%
0.00%
Gumbel-Sinkhorn 75 1.0 Ć \times
ā 0.00 % \approx 0.00\%
ā 0.00 % \approx 0.00\%
EPH-ASC (Ours)
47
1.60 Ć \times
0.51 % 0.51%
0.05 % 0.05%
5.3
Thermodynamic Robustness in Large-Scale LM Training To rigorously assess the robustness of EPH-ASC under real-world conditions characterized by high gradient variance, we scale the experiment to a language modeling task using the FineWeb-Edu dataset[cite: 536].
Setup. We employ a lightweight NanoGemma architecture equipped with Manifold-Constrained Hyper-Connections (mHC). Unlike controlled benchmarks, this experiment utilizes a GPT-2 tokenizer and natural language data, introducing heavy-tailed noise into the optimization landscape[cite: 538]. We compare a standard exponential schedule (Naive) against EPH-ASC over 1,000 steps[cite: 539].
Late-Stage Collapse. As shown in Fig.Ā 5 (Left), the Naive schedule (Red) appears successful for 98% of the trajectory but suffers a catastrophic gradient explosion at Step 980[cite: 541, 542]. This confirms that Sinkhorn instability is a latent risk that manifests when ϵ \epsilon breaches the operatorās condition number limit.
Adaptive Intervention. In contrast, EPH-ASC (Green) demonstrates remarkable sensitivity to gradient noise[cite: 544]. Despite the jagged loss landscape, the controller detects critical distributional drift at Step 640 ālong before visible loss degradation. By triggering "Thermodynamic Braking" (Orange markers), it locks temperature at ϵ ā 0.04 \epsilon\approx 0.04 , securing a 340-step safety margin [cite: 546, 547]. The entropy plot (Fig.Ā 5 , Right) further reveals that EPH-ASC maintains a stable low-entropy regime, preventing the numerical underflow observed in the baseline[cite: 548].
Figure 5 : Verification on FineWeb-Edu. Left: Real-world training loss shows deceptive stability followed by sudden Naive collapse[cite: 554]. Center: EPH-ASC detects instability at Step 640, creating a massive safety margin via braking[cite: 555]. Right: Entropy preservation prevents numerical underflow in the feature maturity phase[cite: 556].
6
Conclusion We identified Premature Mode Collapse as a fundamental thermodynamic failure where distributional shift exceeds the inference operatorās contraction rate. EPH-ASC leverages the derived šŖ ā ( ϵ ) \mathcal{O}(\epsilon) stability law to resolve this via a lightweight adaptive schedule.
Appendix A
Appendix: Notation, precise matrix bounds, and detailed proofs
A.1
Preliminaries and notation We work with discrete measures of size n n and a finite cost matrix C ā ā n Ć n C\in\mathbb{R}^{n\times n} . For ε > 0 \varepsilon>0 define
P ε = š® ε ā ( C ) = diag ā” ( u ε ) ā K ε ā diag ā” ( v ε ) , K i ā j ε = exp ā” ( ā C i ā j / ε ) . P^{\varepsilon}=\mathcal{S}_{\varepsilon}(C)=\operatorname{diag}(u^{\varepsilon})\,K^{\varepsilon}\,\operatorname{diag}(v^{\varepsilon}),\qquad K^{\varepsilon}_{ij}=\exp(-C_{ij}/\varepsilon).
Dual potentials are written f ε = ε ā log ā” u ε , g ε = ε ā log ā” v ε f^{\varepsilon}=\varepsilon\log u^{\varepsilon},\ g^{\varepsilon}=\varepsilon\log v^{\varepsilon} . Row and column marginals:
r ε := P ε ā š , c ε := ( P ε ) ā¤ ā š . r^{\varepsilon}:=P^{\varepsilon}\mathbf{1},\qquad c^{\varepsilon}:=(P^{\varepsilon})^{\top}\mathbf{1}.
We use ā„ ā ā„ 2 |\cdot|{2} for the Euclidean vector norm, ā„ ā ā„ F |\cdot|{F} for Frobenius, and ā„ ā ā„ op |\cdot|_{\mathrm{op}} for the spectral/operator norm.
We assume the localized non-degeneracy/support-stability condition used in the main text, restated here for convenience.
Assumption A.1
(Localized non-degeneracy / support stability) .
Assumption A.1 is a local-in- ε \varepsilon and local-in- C C condition. Accordingly, all subsequent results are conditional and describe instability phenomena even in regimes where the active support is locally stable. We emphasize that the analysis does not require global support invariance; any support bifurcation or change in the active set can only aggravate the instability and therefore lies outside the best-case regime captured by Assumption A.1.
There exist ε min > 0 \varepsilon_{\min}>0 , Ī· > 0 \eta>0 and an index set S ā { 1 , ⦠, n } Ć { 1 , ⦠, n } S\subseteq{1,\dots,n}\times{1,\dots,n} (āactive supportā) such that for all 0
min ( i , j ) ā S ā” P i ā j ε ā„ Ī· , max ( i , j ) ā S ā” P i ā j ε ā¤ Ļ ā ( ε ) , \min_{(i,j)\in S}P^{\varepsilon}_{ij}\geq\eta,\qquad\max_{(i,j)\notin S}P^{\varepsilon}_{ij}\leq\tau(\varepsilon),
where Ļ ā ( ε ) ā 0 \tau(\varepsilon)\to 0 as ε ā 0 \varepsilon\downarrow 0 , and the support pattern on S S is constant for all ε ā ( 0 , ε min ] \varepsilon\in(0,\varepsilon_{\min}] .
Thus our result is a lower bound on failure: even in the best-behaved regime, exponential annealing fails. Remarks:
- S S is the set of index pairs that carry the asymptotically non-vanishing
mass; the assumption is compatible with many cost matrices that induce a unique (or well-separated) matching pattern.
- All subsequent constants are expressed in terms of Ī· , n \eta,n and (where
required) norms of P ε P^{\varepsilon} ; where a uniform-in- ε \varepsilon
bound is used we explicitly require ε ⤠ε 0 \varepsilon\leq\varepsilon_{0} for a (possibly smaller) ε 0 ⤠ε min \varepsilon_{0}\leq\varepsilon_{\min} .
A.2
Transient Stability Analysis: The Resolvent View
Theorem A.2
(Sensitivity-Stability Duality) .
Let J ϵ = D P ā š® ϵ ā ( C ) J_{\epsilon}=D_{P}\mathcal{S}{\epsilon}(C) be the Jacobian of the Sinkhorn fixed-point operator at the optimal plan P ā P^{*} . Let D ā S ϵ ā ( C ) = ā P ā ā C DS{\epsilon}(C)=\frac{\partial P^{*}}{\partial C} denote the sensitivity of the optimal plan to cost perturbations. Regardless of whether J ϵ J_{\epsilon} is diagonalizable, the following inequality holds connecting the inference sensitivity to the spectral properties of the solver:
dist ā ( 1 , Ļ ā ( J ϵ ) ) ⤠ā ā C Φ ā o ā p ā D ā S ϵ ā ( C ) ā o ā p \text{dist}(1,\sigma(J_{\epsilon}))\leq\frac{||\partial_{C}\Phi||_{op}}{||DS_{\epsilon}(C)||_{op}}
(4)
where dist ā ( 1 , Ļ ā ( J ) ) = min Ī» ā Ļ ā ( J ) ā” | 1 ā Ī» | \text{dist}(1,\sigma(J))=\min_{\lambda\in\sigma(J)}|1-\lambda| is the spectral distance to unit gain, and ā C Φ \partial_{C}\Phi is the partial derivative of the update rule with respect to C C . Consequently, if the sensitivity scales as ā D ā S ϵ ā ( C ) ā ā¼ Ī© ā ( 1 / ϵ ) ||DS_{\epsilon}(C)||\sim\Omega(1/\epsilon) (as shown in Lemma A.6), the spectral gap must vanish linearly: 1 ā Ļ ā ( J ϵ ) ⤠O ā ( ϵ ) 1-\rho(J_{\epsilon})\leq O(\epsilon) .
Proof.
We start from the fixed-point equation P ā = š® ϵ ā ( P ā , C ) P^{}=\mathcal{S}_{\epsilon}(P^{},C) . Differentiating implicitly with respect to C C yields the linear system:
d ā P ā \displaystyle dP^{*}
= D P ā š® ϵ ā
d ā P ā + ā C š® ϵ ā
d ā C \displaystyle=D_{P}\mathcal{S}_{\epsilon}\cdot dP^{*}+\partial_{C}\mathcal{S}_{\epsilon}\cdot dC
(5)
( I ā J ϵ ) ā d ā P ā \displaystyle(I-J_{\epsilon})dP^{*}
= ā C š® ϵ ā
d ā C \displaystyle=\partial_{C}\mathcal{S}_{\epsilon}\cdot dC
(6)
Assuming the system is locally stable ( Ļ ā ( J ϵ )
D ā S ϵ ā ( C ) = ( I ā J ϵ ) ā 1 ā ā C š® ϵ DS_{\epsilon}(C)=(I-J_{\epsilon})^{-1}\partial_{C}\mathcal{S}_{\epsilon}
(7)
Taking the operator norm on both sides and using the submultiplicative property ā A ā B ā ⤠ā A ā ā ā B ā ||AB||\leq||A||||B|| :
ā D ā S ϵ ā ( C ) ā ⤠ā ( I ā J ϵ ) ā 1 ā ā
ā ā C š® ϵ ā ||DS_{\epsilon}(C)||\leq||(I-J_{\epsilon})^{-1}||\cdot||\partial_{C}\mathcal{S}_{\epsilon}||
(8)
Recall that for any square matrix A A , the operator norm of the inverse is the reciprocal of the smallest singular value: ā A ā 1 ā = 1 / Ļ min ā ( A ) ||A^{-1}||=1/\sigma_{\min}(A) . Furthermore, by the variational characterization of singular values, Ļ min ā ( I ā J ϵ ) ⤠| Ī» min ā ( I ā J ϵ ) | = dist ā ( 1 , Ļ ā ( J ϵ ) ) \sigma_{\min}(I-J_{\epsilon})\leq|\lambda_{\min}(I-J_{\epsilon})|=\text{dist}(1,\sigma(J_{\epsilon})) . Therefore, we have the lower bound on the resolvent:
ā ( I ā J ϵ ) ā 1 ā = 1 Ļ min ā ( I ā J ϵ ) ā„ 1 dist ā ( 1 , Ļ ā ( J ϵ ) ) ||(I-J_{\epsilon})^{-1}||=\frac{1}{\sigma_{\min}(I-J_{\epsilon})}\geq\frac{1}{\text{dist}(1,\sigma(J_{\epsilon}))}
(9)
Substituting this back yields:
ā D ā S ϵ ā ( C ) ā ⤠ā ā C š® ϵ ā dist ā ( 1 , Ļ ā ( J ϵ ) ) ||DS_{\epsilon}(C)||\leq\frac{||\partial_{C}\mathcal{S}_{\epsilon}||}{\text{dist}(1,\sigma(J_{\epsilon}))}
(10)
Rearranging terms gives the stated duality bound. This implies that high sensitivity physically necessitates a vanishing spectral gap, purely from operator theoretic arguments, without assuming normality or diagonalizability. ā
Corollary A.3
(The Basin Mismatch) .
Under an exponential schedule ϵ t + 1 = α ā ϵ t \epsilon_{t+1}=\alpha\epsilon_{t} , the tracking error diverges.
Proof.
The drift magnitude is ā Ī t ā ā ā d ā P ā d ā ϵ ā ā Ī“ ā ϵ ā¼ O ā ( 1 / ϵ ) ā O ā ( ϵ ) = O ā ( 1 ) ||\Delta_{t}||\approx||\frac{dP^{*}}{d\epsilon}||\delta\epsilon\sim O(1/\epsilon)\cdot O(\epsilon)=O(1) . The restoring capacity of the solver is governed by the spectral gap: gap ā¼ O ā ( ϵ ) \text{gap}\sim O(\epsilon) . The steady-state tracking error equilibrium e s ā s e_{ss} scales as:
e s ā s ā Drift Gap ā¼ O ā ( 1 ) O ā ( ϵ ) = O ā ( 1 / ϵ ) e_{ss}\approx\frac{\text{Drift}}{\text{Gap}}\sim\frac{O(1)}{O(\epsilon)}=O(1/\epsilon)
However, the validity region of the linearization R ā ( ϵ ) R(\epsilon) scales as O ā ( ϵ ) O(\epsilon) due to the 1 / ϵ 2 1/\epsilon^{2} curvature of the Hessian (Lemma A.8). Collapse occurs when e s ā s > R ā ( ϵ ) e_{ss}>R(\epsilon) , i.e., O ā ( 1 / ϵ ) > O ā ( ϵ ) O(1/\epsilon)>O(\epsilon) , which is inevitable as ϵ ā 0 \epsilon\to 0 . ā
A.3
Implicit differentiation linear system Differentiate marginal constraints. Define the block operator
š ā ( ε ) = [ diag ā” ( r ε ) P ε ( P ε ) ⤠diag ā” ( c ε ) ] ā ā 2 ā n Ć 2 ā n . \mathcal{A}(\varepsilon)\;=\;\begin{bmatrix}\operatorname{diag}(r^{\varepsilon})&P^{\varepsilon}\\[3.0pt]
(P^{\varepsilon})^{\top}&\operatorname{diag}(c^{\varepsilon})\end{bmatrix}\in\mathbb{R}^{2n\times 2n}.
(11)
For a generic perturbation (in C C or in ε \varepsilon ) the derivative potentials ( ā f , ā g ) (\partial f,\partial g) satisfy a linear system of the form
š ā ( ε ) ā [ ā f ā g ] = 1 ε ā ⬠ā ( ā C ) + ā ā ( ā ε ) , \mathcal{A}(\varepsilon)\begin{bmatrix}\partial f\\[2.0pt]
\partial g\end{bmatrix}=\frac{1}{\varepsilon},\mathcal{B}(\partial C)+\mathcal{R}(\partial\varepsilon),
(12)
where:
-
ā C \partial C and is O ā ( 1 ) O(1) in P ε P^{\varepsilon} (i.e., it does not contain extra 1 / ε 1/\varepsilon factors);text
⬠ā ( ā C ) \mathcal{B}(\partial C) depends linearly on the entrywise change -
1 / ε 1/\varepsilon factors; again it is O ā ( 1 ) O(1) in P ε P^{\varepsilon} .text
ā ā ( ā ε ) \mathcal{R}(\partial\varepsilon) collects terms from differentiating
Solving ( 12 ) yields
[ ā f ā g ] = 1 ε ā š ā ( ε ) ā 1 ā ⬠ā ( ā C ) + š ā ( ε ) ā 1 ā ā ā ( ā ε ) . \begin{bmatrix}\partial f\\[2.0pt]
\partial g\end{bmatrix}=\frac{1}{\varepsilon}\mathcal{A}(\varepsilon)^{-1}\mathcal{B}(\partial C)+\mathcal{A}(\varepsilon)^{-1}\mathcal{R}(\partial\varepsilon).
Substituting this into the entrywise derivative of P ε P^{\varepsilon}
gives the decomposition used below.
A.4
Invertibility of š ā ( ε ) \mathcal{A}(\varepsilon) on the Active Support
As ε ā 0 \varepsilon\to 0 , the entropic optimal transport plan P ε P^{\varepsilon}
concentrates its mass on the active support S S specified in
AssumptionĀ A.1 .
Rows and columns not participating in S S may have marginals that vanish
as ε ā 0 \varepsilon\to 0 , rendering the full 2 ā n Ć 2 ā n 2n\times 2n operator
š ā ( ε ) \mathcal{A}(\varepsilon) ill-conditioned.
However, the sensitivity analysis carried out in LemmaĀ A.9 only involves perturbations supported on the active entries. We therefore restrict attention to the reduced linear system induced by the active support, on which uniform stability can be established.
Definition A.4
(Active reduced system) .
Let S ā { 1 , ⦠, n } Ć { 1 , ⦠, n } S\subseteq{1,\dots,n}\times{1,\dots,n} denote the active support from AssumptionĀ A.1 . Define the active row and column sets
I S := { i ⣠ā j , ( i , j ) ā S } , J S := { j ⣠ā i , ( i , j ) ā S } . I_{S}:=\{i\mid\exists j,\ (i,j)\in S\},\qquad J_{S}:=\{j\mid\exists i,\ (i,j)\in S\}.
We define š S ā ( ε ) \mathcal{A}{S}(\varepsilon) as the restriction of š ā ( ε ) \mathcal{A}(\varepsilon) to the variables ( f i ) i ā I S (f{i}){i\in I{S}} and ( g j ) j ā J S (g_{j}){j\in J{S}} . All operators are considered on the gauge-fixed subspace
{ ( f , g ) : ā i ā I S f i = 0 , ā j ā J S g j = 0 } . \left\{(f,g):\sum_{i\in I_{S}}f_{i}=0,\quad\sum_{j\in J_{S}}g_{j}=0\right\}.
Lemma A.5
(Uniform invertibility on the active subspace) .
Under Assumption A.1 , there exist ε 0 > 0 \varepsilon_{0}>0
and a finite constant M ā ( Ī· , | S | ) M(\eta,|S|) such that for all
0
ā š S ā ( ε ) ā 1 ā op ⤠M ā ( Ī· , | S | ) . \|\mathcal{A}_{S}(\varepsilon)^{-1}\|_{\mathrm{op}}\leq M(\eta,|S|).
Proof.
For any active row i ā I S i\in I_{S} , AssumptionĀ A.1 implies
r i ε = ā j P i ā j ε ā„ ā j : ( i , j ) ā S P i ā j ε ā„ Ī· . r_{i}^{\varepsilon}=\sum_{j}P^{\varepsilon}_{ij}\geq\sum_{j:(i,j)\in S}P^{\varepsilon}_{ij}\geq\eta.
An analogous bound holds for all active columns j ā J S j\in J_{S} . Hence all diagonal entries of š S ā ( ε ) \mathcal{A}_{S}(\varepsilon) are uniformly bounded below by Ī· \eta .
Moreover, š S ā ( ε ) \mathcal{A}{S}(\varepsilon) coincides with the Hessian of the entropically regularized transport objective restricted to the face of the transport polytope induced by S S . By AssumptionĀ A.1 , this face remains fixed for all ε ⤠ε 0 \varepsilon\leq\varepsilon{0} , and the objective is strictly convex on the corresponding affine subspace modulo gauge invariance.
It follows that š S ā ( ε ) \mathcal{A}_{S}(\varepsilon) is positive definite on the gauge-fixed subspace. Since the dimension of the reduced system depends only on | S | |S| and all entries are uniformly bounded, compactness yields the existence of a uniform inverse bound
ā š S ā ( ε ) ā 1 ā op ⤠M ā ( Ī· , | S | ) . \|\mathcal{A}_{S}(\varepsilon)^{-1}\|_{\mathrm{op}}\leq M(\eta,|S|).
ā
Remark A.6
.
Entries of P ε P^{\varepsilon} outside the active support S S are O ā ( Ļ ā ( ε ) ) O(\tau(\varepsilon)) and do not enter the reduced system. Their influence on the active dynamics is therefore suppressed and does not affect the stability of š S ā ( ε ) \mathcal{A}_{S}(\varepsilon) .
A.5
Directional sensitivity lower bound In this section we establish a lower bound on the sensitivity of the Sinkhorn map with respect to cost perturbations. Unlike a refined asymptotic expansion, our argument does not rely on a separation between leading and higher-order terms. Instead, we show that there exists a perturbation direction for which the Jacobian response grows at least on the order of ε ā 1 \varepsilon^{-1} .
Lemma A.7
.
Lemma A.5 (Constructive operator-norm lower bound). Let S denote the active support setā¦
Let S S denote the active support set at the fixed point P ā = P ā ā ( C , ε ) P^{\star}=P^{\star}(C,\varepsilon) and assume the gauge-fixed linear system for the implicit derivative is of the form
A S ā ( ε ) ā Ī ā z = 1 ε ā B S ā [ Ī ā C ] , A_{S}(\varepsilon)\,\Delta z\;=\;\frac{1}{\varepsilon}\,B_{S}[\Delta C],
(13)
where
⢠A S ā ( ε ) : šµ ā šµ A_{S}(\varepsilon):\mathcal{Z}\to\mathcal{Z} is the linear operator (matrix) on the gauge-fixed dual variable subspace šµ \mathcal{Z} arising from differentiating the Sinkhorn fixed-point equations with respect to the dual potentials;
⢠B S : š S ā šµ B_{S}:\mathcal{C}{S}\to\mathcal{Z} is the linear map that takes a perturbation of the cost restricted to the active support, Ī ā C ā š S \Delta C\in\mathcal{C}{S} , to the right-hand side of ( 13 );
⢠and the perturbation of the primal fixed point on the active support, Ī ā P S \Delta P_{S} , is related to Ī ā z \Delta z by a bounded linear map
Ī ā P S = R S ā [ Ī ā z ] , \Delta P_{S}\;=\;R_{S}[\Delta z],
(14)
for some linear operator R S : šµ ā š« S R_{S}:\mathcal{Z}\to\mathcal{P}{S} (here š« S \mathcal{P}{S} denotes the space of primal perturbations supported on S S ).
Assume furthermore the following regularity conditions hold for sufficiently small ε > 0 \varepsilon>0 :
(a) (Invertibility) A S ā ( ε ) A_{S}(\varepsilon) is invertible on šµ \mathcal{Z} and there exists a min > 0 a_{\min}>0 such that
Ļ min ā ( A S ā ( ε ) ) ā„ a min > 0 \sigma_{\min}\big(A_{S}(\varepsilon)\big)\;\geq\;a_{\min}>0
uniformly for the range of ε \varepsilon under consideration.
(b) (Non-degeneracy of cost-action) The composed operator
M ā ( ε ) := R S ā A S ā ( ε ) ā 1 ā B S M(\varepsilon)\;:=\;R_{S}\,A_{S}(\varepsilon)^{-1}\,B_{S}
is not identically zero as a map š S ā š« S \mathcal{C}{S}\to\mathcal{P}{S} (equivalently, there exists some Ī ā C ā š S \Delta C\in\mathcal{C}_{S} with M ā ( ε ) ā [ Ī ā C ] ā 0 M(\varepsilon)[\Delta C]\neq 0 ).
Define the fixed-point sensitivity operator
D ā S ε ā ( C ) : š S ā š« S , D ā S ε ā ( C ) ā [ Ī ā C ] = Ī ā P S . DS_{\varepsilon}(C)\;:\;\mathcal{C}_{S}\to\mathcal{P}_{S},\qquad DS_{\varepsilon}(C)[\Delta C]\;=\;\Delta P_{S}.
Then the following constructive lower bound holds:
ā D ā S ε ā ( C ) ā op = 1 ε ā ā M ā ( ε ) ā op ā„ c ā ( ε ) ε , \|DS_{\varepsilon}(C)\|_{\mathrm{op}}\;=\;\frac{1}{\varepsilon}\,\|M(\varepsilon)\|_{\mathrm{op}}\;\geq\;\frac{c(\varepsilon)}{\varepsilon},
(15)
where ā„ ā ā„ op |\cdot|_{\mathrm{op}} denotes the operator norm (induced by a chosen Euclidean norm on the finite-dimensional spaces) and
c ā ( ε ) := ā M ā ( ε ) ā op > ā0 . c(\varepsilon)\;:=\;\|M(\varepsilon)\|_{\mathrm{op}}\;>\;0.
Moreover, one may explicitly construct a test perturbation Ī ā C ā \Delta C^{\star} (a singular vector of M ā ( ε ) M(\varepsilon) ) that attains (or approximates) the lower bound in ( 15 ):
ā D ā S ε ā ( C ) ā [ Ī ā C ā ] ā ā Ī ā C ā ā ā ā M ā ( ε ) ā op ε . \frac{\|DS_{\varepsilon}(C)[\Delta C^{\star}]\|}{\|\Delta C^{\star}\|}\;\approx\;\frac{\|M(\varepsilon)\|_{\mathrm{op}}}{\varepsilon}.
In particular, under the uniform lower bound a min > 0 a_{\min}>0 on Ļ min ā ( A S ā ( ε ) ) \sigma_{\min}(A_{S}(\varepsilon)) and if the operator norms ā R S ā |R_{S}| and ā B S ā |B_{S}| remain O ā ( 1 ) O(1) as ε ā 0 \varepsilon\downarrow 0 , then c ā ( ε ) c(\varepsilon) is bounded away from zero and therefore
ā D ā S ε ā ( C ) ā op ā„ c 0 ε \|DS_{\varepsilon}(C)\|_{\mathrm{op}}\;\geq\;\frac{c_{0}}{\varepsilon}
for some constant c 0 > 0 c_{0}>0 independent of ε \varepsilon (for all sufficiently small ε \varepsilon ).
Proof.
The proof is constructive and algebraic; it follows directly from ( 13 )ā( 14 ).
From ( 13 ) and the invertibility of A S ā ( ε ) A_{S}(\varepsilon) we obtain
Ī ā z = A S ā ( ε ) ā 1 ā 1 ε ā B S ā [ Ī ā C ] . \Delta z\;=\;A_{S}(\varepsilon)^{-1}\,\frac{1}{\varepsilon}\,B_{S}[\Delta C].
Using ( 14 ) we then have the representation
Ī ā P S = R S ā [ Ī ā z ] = R S ā A S ā ( ε ) ā 1 ā 1 ε ā B S ā [ Ī ā C ] = 1 ε ā M ā ( ε ) ā [ Ī ā C ] . \Delta P_{S}\;=\;R_{S}[\Delta z]\;=\;R_{S}\,A_{S}(\varepsilon)^{-1}\,\frac{1}{\varepsilon}\,B_{S}[\Delta C]\;=\;\frac{1}{\varepsilon}\,M(\varepsilon)[\Delta C].
Since the map Ī ā C ⦠Πā P S \Delta C\mapsto\Delta P_{S} is exactly D ā S ε ā ( C ) DS_{\varepsilon}(C) (restricted to perturbations supported on S S ), the operator identity
D ā S ε ā ( C ) = 1 ε ā M ā ( ε ) DS_{\varepsilon}(C)\;=\;\frac{1}{\varepsilon}\,M(\varepsilon)
holds. Taking operator norms yields
ā D ā S ε ā ( C ) ā op = 1 ε ā ā M ā ( ε ) ā op . \|DS_{\varepsilon}(C)\|_{\mathrm{op}}\;=\;\frac{1}{\varepsilon}\,\|M(\varepsilon)\|_{\mathrm{op}}.
By assumption (b) M ā ( ε ) ā 0 M(\varepsilon)\neq 0 , hence its operator norm c ā ( ε ) := ā M ā ( ε ) ā op c(\varepsilon):=|M(\varepsilon)|_{\mathrm{op}} is strictly positive, which proves ( 15 ).
To make the bound constructive, let u ā š S u\in\mathcal{C}_{S} be a (right) unit-norm singular vector corresponding to the top singular value of M ā ( ε ) M(\varepsilon) , i.e.
ā M ā ( ε ) ā u ā = ā M ā ( ε ) ā op . \|M(\varepsilon)u\|\;=\;\|M(\varepsilon)\|_{\mathrm{op}}.
Choosing Ī ā C ā = u \Delta C^{\star}=u yields
ā D ā S ε ā ( C ) ā [ Ī ā C ā ] ā ā Ī ā C ā ā = 1 ε ā ā M ā ( ε ) ā u ā ā u ā = ā M ā ( ε ) ā op ε , \frac{\|DS_{\varepsilon}(C)[\Delta C^{\star}]\|}{\|\Delta C^{\star}\|}=\frac{1}{\varepsilon}\,\frac{\|M(\varepsilon)u\|}{\|u\|}=\frac{\|M(\varepsilon)\|_{\mathrm{op}}}{\varepsilon},
thus achieving the claimed value.
Finally, to obtain a useful uniform-in- ε \varepsilon lower bound, we note that
ā M ā ( ε ) ā op ā„ ā R S ā ā 1 ā ā B S ā ā A S ā ( ε ) ā op \|M(\varepsilon)\|_{\mathrm{op}}\;\geq\;\frac{\|R_{S}\|^{-1}\,\|B_{S}\|}{\|A_{S}(\varepsilon)\|_{\mathrm{op}}}
and, more directly by submultiplicativity,
ā M ā ( ε ) ā op ā„ ā R S ā ā 1 ā ā B S ā ā A S ā ( ε ) ā 1 ā ā 1 = ā R S ā ā ā A S ā ( ε ) ā 1 ā ā ā B S ā . \|M(\varepsilon)\|_{\mathrm{op}}\;\geq\;\frac{\|R_{S}\|^{-1}\,\|B_{S}\|}{\|A_{S}(\varepsilon)^{-1}\|^{-1}}\;=\;\|R_{S}\|\,\|A_{S}(\varepsilon)^{-1}\|\,\|B_{S}\|.
Under the uniform spectral lower bound Ļ min ā ( A S ā ( ε ) ) ā„ a min > 0 \sigma_{\min}(A_{S}(\varepsilon))\geq a_{\min}>0 we have ā A S ā ( ε ) ā 1 ā ⤠1 / a min |A_{S}(\varepsilon)^{-1}|\leq 1/a_{\min} , hence if ā R S ā |R_{S}| and ā B S ā |B_{S}| are O ā ( 1 ) O(1) (bounded below away from zero on the relevant directions), then ā M ā ( ε ) ā op |M(\varepsilon)|{\mathrm{op}} is bounded below by a positive constant c 0 > 0 c{0}>0 independent of ε \varepsilon . Consequently ā D ā S ε ā ( C ) ā op ā„ c 0 / ε |DS_{\varepsilon}(C)|{\mathrm{op}}\geq c{0}/\varepsilon for sufficiently small ε \varepsilon .
This completes the proof. ā
Remark.
ā¢
The operator M ā ( ε ) = R S ā A S ā ( ε ) ā 1 ā B S M(\varepsilon)=R_{S}A_{S}(\varepsilon)^{-1}B_{S} is explicitly constructible from the linearization matrices implicit in the fixed-point equations; thus c ā ( ε ) = ā M ā ( ε ) ā op c(\varepsilon)=|M(\varepsilon)|{\mathrm{op}} can be computed numerically in practice (compute A S ā ( ε ) A{S}(\varepsilon) from the linearized active-system, invert it numerically, compose with B S B_{S} and R S R_{S} , and then take the top singular value). In the submission we follow this recipe to produce the numerical estimates reported in SectionĀ 6 / AppendixĀ A.6.
⢠A simple explicit test perturbation that often suffices is to pick Ī ā C \Delta C supported on a single (or a pair of) active entries: for an active index ( i , j ) ā S (i,j)\in S choose Ī ā C \Delta C equal to the corresponding canonical basis vector; typically B S ā [ Ī ā C ] ā 0 B_{S}[\Delta C]\neq 0 and through the invertible A S ā ( ε ) A_{S}(\varepsilon) and nontrivial R S R_{S} this produces a nonzero Ī ā P S \Delta P_{S} , certifying M ā ( ε ) ā 0 M(\varepsilon)\neq 0 .
Lemma A.8
(Ā A.8 ā resolvent bound and spectral-distance consequence) .
Let Φ ā ( P , C ) \Phi(P,C) be the Sinkhorn fixed-point mapping in the active subspace (so that a fixed point P ā P^{\star} satisfies P ā = Φ ā ( P ā , C ) P^{\star}=\Phi(P^{\star},C) ), and denote
J ā D P ā Φ ā ( P ā , C ) J\coloneqq D_{P}\Phi(P^{\star},C)
the Jacobian of Φ \Phi with respect to P P evaluated at the fixed point. Assume I ā J I-J is invertible (equivalently 1 ā Ļ ā ( J ) 1\not\in\sigma(J) ). Define the linear operator
D ā S ε ā ( C ) = ā P ā ā C , DS_{\varepsilon}(C)\;=\;\frac{\partial P^{\star}}{\partial C},
i.e. the sensitivity of the fixed point with respect to C C (dependence on ε \varepsilon is implicit). Let
ā C Φ ā D C ā Φ ā ( P ā , C ) \partial_{C}\Phi\coloneqq D_{C}\Phi(P^{\star},C)
be the derivative of Φ \Phi with respect to C C at the fixed point, and denote by ā„ ā ā„ |\cdot| the operator norm induced by the Euclidean norm (or any fixed matrix/operator norm).
Then the following hold.
1.
(Identity) The implicit-differentiation identity
D ā S ε ā ( C ) = ( I ā J ) ā 1 ā ā C Φ DS_{\varepsilon}(C)\;=\;(I-J)^{-1}\,\partial_{C}\Phi
is valid.
2. (Norm factorization) Consequently,
ā D ā S ε ā ( C ) ā ⤠ā ( I ā J ) ā 1 ā ā ā ā C Φ ā . \|DS_{\varepsilon}(C)\|\;\leq\;\|(I-J)^{-1}\|\,\|\partial_{C}\Phi\|.
3. (Resolvent lower bound) Let Ļ ā ( J ) \sigma(J) denote the spectrum of J J , and define the spectral distance
dist ā ( 1 , Ļ ā ( J ) ) = min Ī» ā Ļ ā ( J ) ā” | 1 ā Ī» | . \mathrm{dist}\big(1,\sigma(J)\big)\;=\;\min_{\lambda\in\sigma(J)}|1-\lambda|.
Then
ā ( I ā J ) ā 1 ā ā„ 1 dist ā ( 1 , Ļ ā ( J ) ) . \|(I-J)^{-1}\|\;\geq\;\frac{1}{\mathrm{dist}\big(1,\sigma(J)\big)}.
Hence a divergence of ā D ā S ε ā ( C ) ā |DS_{\varepsilon}(C)| forces dist ā ( 1 , Ļ ā ( J ) ) ā 0 \mathrm{dist}(1,\sigma(J))\to 0 .
4. (Upper bound under diagonalizability / conditioning) If, additionally, J J is diagonalizable, i.e. J = V ā Ī ā V ā 1 J=V\Lambda V^{-1} with Ī = diag ā” ( Ī» i ) \Lambda=\operatorname{diag}(\lambda_{i}) , then
ā ( I ā J ) ā 1 ā = ā V ā ( I ā Ī ) ā 1 ā V ā 1 ā ⤠κ ā ( V ) ā max i ā” 1 | 1 ā Ī» i | = Īŗ ā ( V ) dist ā ( 1 , Ļ ā ( J ) ) , \|(I-J)^{-1}\|\;=\;\big\|V(I-\Lambda)^{-1}V^{-1}\big\|\;\leq\;\kappa(V)\,\max_{i}\frac{1}{|1-\lambda_{i}|}\;=\;\frac{\kappa(V)}{\mathrm{dist}\big(1,\sigma(J)\big)},
where Īŗ ā ( V ) = ā V ā ā ā V ā 1 ā \kappa(V)=|V|,|V^{-1}| is the condition number of the modal matrix V V . In particular, if J J is normal then Īŗ ā ( V ) = 1 \kappa(V)=1 and the equality
ā ( I ā J ) ā 1 ā = 1 / dist ā ( 1 , Ļ ā ( J ) ) |(I-J)^{-1}|=1/\mathrm{dist}(1,\sigma(J))
holds.
5. (Consequence for ε \varepsilon -scalings) Suppose there exists constants M Φ > 0 M_{\Phi}>0 and c > 0 c>0 (independent of small ε \varepsilon ) such that
ā ā C Φ ā ⤠M Φ and ā D ā S ε ā ( C ) ā ā„ c ε \|\partial_{C}\Phi\|\leq M_{\Phi}\qquad\text{and}\qquad\|DS_{\varepsilon}(C)\|\geq\frac{c}{\varepsilon}
for sufficiently small ε > 0 \varepsilon>0 . Then from (2) and (3) we obtain the quantitative bound
dist ā ( 1 , Ļ ā ( J ) ) ⤠M Φ ā D ā S ε ā ( C ) ā ⤠M Φ c ā ε . \mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{M_{\Phi}}{\|DS_{\varepsilon}(C)\|}\;\leq\;\frac{M_{\Phi}}{c}\,\varepsilon.
Moreover, if J J is diagonalizable with uniformly bounded condition number Īŗ ā ( V ) ⤠κ max \kappa(V)\leq\kappa_{\max} , then
max i ā” | 1 ā Ī» i | = dist ā ( 1 , Ļ ā ( J ) ) ⤠M Φ c ā ε , \max_{i}|1-\lambda_{i}|\;=\;\mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{M_{\Phi}}{c}\,\varepsilon,
and in particular the spectral radius satisfies
1 ā max i ā” | Ī» i | = ā1 ā Ī» max ⤠C ā ε 1-\max_{i}|\lambda_{i}|\;=\;1-\lambda_{\max}\;\leq\;C\,\varepsilon
for some constant C C depending only on M Φ M_{\Phi} , c c and (if used) κ max \kappa_{\max} .
Proof.
We give a self-contained proof of each item.
(Identity).
Differentiate the fixed-point equation F ā ( P , C ) ā P ā Φ ā ( P , C ) = 0 F(P,C)\coloneqq P-\Phi(P,C)=0 with respect to C C at the point ( P ā , C ) (P^{\star},C) in the direction Ī ā C \Delta C . The total derivative yields
D P ā F ā [ Ī ā P ] + D C ā F ā [ Ī ā C ] = 0 , D_{P}F\,[\Delta P]+D_{C}F\,[\Delta C]=0,
i.e.
( I ā D P ā Φ ) ā Ī ā P ā ā C Φ ā [ Ī ā C ] = 0 . (I-D_{P}\Phi)\,\Delta P-\partial_{C}\Phi[\Delta C]=0.
Rearranging gives
Ī ā P = ( I ā J ) ā 1 ā ā C Φ ā [ Ī ā C ] , \Delta P\;=\;(I-J)^{-1}\,\partial_{C}\Phi[\Delta C],
and since this holds for arbitrary Ī ā C \Delta C the identity D ā S ε ā ( C ) = ( I ā J ) ā 1 ā ā C Φ DS_{\varepsilon}(C)=(I-J)^{-1}\partial_{C}\Phi follows.
(Norm factorization).
Taking operator norms of the identity gives
ā D ā S ε ā ( C ) ā = ā ( I ā J ) ā 1 ā ā C Φ ā ⤠ā ( I ā J ) ā 1 ā ā ā ā C Φ ā , \|DS_{\varepsilon}(C)\|\;=\;\|(I-J)^{-1}\partial_{C}\Phi\|\;\leq\;\|(I-J)^{-1}\|\,\|\partial_{C}\Phi\|,
which proves (2).
(Resolvent lower bound).
Recall that for any invertible matrix A A the operator norm of its inverse equals the reciprocal of the smallest singular value:
ā A ā 1 ā = 1 Ļ min ā ( A ) . \|A^{-1}\|\;=\;\frac{1}{\sigma_{\min}(A)}.
Apply this with A = I ā J A=I-J to obtain ā ( I ā J ) ā 1 ā = 1 / Ļ min ā ( I ā J ) |(I-J)^{-1}|=1/\sigma_{\min}(I-J) . For any eigenvalue Ī» ā Ļ ā ( J ) \lambda\in\sigma(J) and any unit eigenvector v v for that eigenvalue (i.e. J ā v = Ī» ā v Jv=\lambda v with ā v ā = 1 |v|=1 ), we have
ā ( I ā J ) ā v ā = | 1 ā Ī» | . \|(I-J)v\|\;=\;|1-\lambda|.
Therefore, by the variational characterization of the smallest singular value,
Ļ min ā ( I ā J ) = min ā x ā = 1 ā” ā ( I ā J ) ā x ā ⤠min Ī» ā Ļ ā ( J ) ā” | 1 ā Ī» | = dist ā ( 1 , Ļ ā ( J ) ) . \sigma_{\min}(I-J)\;=\;\min_{\|x\|=1}\|(I-J)x\|\;\leq\;\min_{\lambda\in\sigma(J)}|1-\lambda|\;=\;\mathrm{dist}\big(1,\sigma(J)\big).
Taking reciprocals yields
ā ( I ā J ) ā 1 ā = 1 Ļ min ā ( I ā J ) ā„ 1 dist ā ( 1 , Ļ ā ( J ) ) , \|(I-J)^{-1}\|\;=\;\frac{1}{\sigma_{\min}(I-J)}\;\geq\;\frac{1}{\mathrm{dist}\big(1,\sigma(J)\big)},
which proves (3). This inequality is non-asymptotic and requires no normality/diagonalizability hypotheses.
(Upper bound under diagonalizability).
If J = V ā Ī ā V ā 1 J=V\Lambda V^{-1} with Ī = diag ā” ( Ī» i ) \Lambda=\operatorname{diag}(\lambda_{i}) then
( I ā J ) ā 1 = V ā ( I ā Ī ) ā 1 ā V ā 1 . (I-J)^{-1}=V(I-\Lambda)^{-1}V^{-1}.
Hence
ā ( I ā J ) ā 1 ā ⤠ā V ā ā ā V ā 1 ā ā ā ( I ā Ī ) ā 1 ā = Īŗ ā ( V ) ā max i ā” 1 | 1 ā Ī» i | = Īŗ ā ( V ) dist ā ( 1 , Ļ ā ( J ) ) . \|(I-J)^{-1}\|\leq\|V\|\,\|V^{-1}\|\,\|(I-\Lambda)^{-1}\|=\kappa(V)\,\max_{i}\frac{1}{|1-\lambda_{i}|}=\frac{\kappa(V)}{\mathrm{dist}\big(1,\sigma(J)\big)}.
If J J is normal then V V can be chosen unitary, Īŗ ā ( V ) = 1 \kappa(V)=1 , and the expression is exact: ā ( I ā J ) ā 1 ā = 1 / dist ā ( 1 , Ļ ā ( J ) ) |(I-J)^{-1}|=1/\mathrm{dist}(1,\sigma(J)) .
(Consequence for ε \varepsilon -scalings).
Combining (2) and (3) we have
ā D ā S ε ā ( C ) ā ⤠ā ( I ā J ) ā 1 ā ā ā ā C Φ ā ⤠ā ā C Φ ā dist ā ( 1 , Ļ ā ( J ) ) . \|DS_{\varepsilon}(C)\|\;\leq\;\|(I-J)^{-1}\|\,\|\partial_{C}\Phi\|\;\leq\;\frac{\|\partial_{C}\Phi\|}{\mathrm{dist}(1,\sigma(J))}.
Rearranging gives
dist ā ( 1 , Ļ ā ( J ) ) ⤠ā ā C Φ ā ā D ā S ε ā ( C ) ā . \mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{\|\partial_{C}\Phi\|}{\|DS_{\varepsilon}(C)\|}.
Under the assumptions ā ā C Φ ā ⤠M Φ |\partial_{C}\Phi|\leq M_{\Phi} and ā D ā S ε ā ( C ) ā ā„ c / ε |DS_{\varepsilon}(C)|\geq c/\varepsilon the displayed inequality becomes
dist ā ( 1 , Ļ ā ( J ) ) ⤠M Φ c ā ε , \mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{M_{\Phi}}{c}\,\varepsilon,
which proves the stated O ā ( ε ) O(\varepsilon) bound. If J J is diagonalizable with conditioning Īŗ ā ( V ) ⤠κ max \kappa(V)\leq\kappa_{\max} , then the upper bound in item (4) implies the same proportional dependence for the individual eigenvalue gaps, and hence an O ā ( ε ) O(\varepsilon) bound on 1 ā Ī» max 1-\lambda_{\max} up to the multiplicative factor Īŗ max \kappa_{\max} .
This completes the proof. ā
Remark.
The chain of inequalities above makes manifest two facts that are important for interpreting the sensitivity blow-up:
⢠The inequality ā ( I ā J ) ā 1 ā ā„ 1 / dist ā ( 1 , Ļ ā ( J ) ) |(I-J)^{-1}|\geq 1/\mathrm{dist}(1,\sigma(J)) always holds and therefore any divergence of ā D ā S ε ā |DS_{\varepsilon}| forces the spectrum of J J to approach the point 1 1 in the complex plane (spectral distance goes to zero).
⢠To upgrade the spectral-distance statement into a statement about the largest eigenvalue (for instance to conclude 1 ā Ī» max ā¼ C ā ε 1-\lambda_{\max}\sim C\varepsilon ) one needs extra regularity such as bounded diagonalization conditioning Īŗ ā ( V ) \kappa(V) or near-normality of J J . In non-normal cases pseudospectral effects can make ā ( I ā J ) ā 1 ā |(I-J)^{-1}| much larger than 1 / dist ā ( 1 , Ļ ā ( J ) ) 1/\mathrm{dist}(1,\sigma(J)) , so numerical spectral/pseudospectral diagnostics are recommended to validate any stronger asymptotic claim in practice.
Lemma A.9
(Explicit bound on second-order remainder) .
Let ā ā ( ε ) = D 2 ā š® ε \mathcal{H}(\varepsilon)=D^{2}\mathcal{S}_{\varepsilon} denote the Hessian tensor of the Sinkhorn map with respect to the cost matrix C C (and parameter ε \varepsilon ). Under AssumptionĀ A.1 , for all 0
ā D 2 ā š® ε ā op ⤠K quad ε 2 , \|D^{2}\mathcal{S}_{\varepsilon}\|_{\mathrm{op}}\leq\frac{K_{\mathrm{quad}}}{\varepsilon^{2}},
where K quad K_{\mathrm{quad}} depends on the active support size | S | |S| , the non-degeneracy bound Ī· \eta , and the condition number M M from LemmaĀ A.5 . Consequently, the Taylor remainder for a perturbation Ī“ = ( Ī“ ā C , Ī“ ā ε ) \delta=(\delta C,\delta\varepsilon) satisfies ā ā ā ( Ī“ ) ā ⤠K ε 2 ā ā Ī“ ā 2 |\mathcal{R}(\delta)|\leq\frac{K}{\varepsilon^{2}}|\delta|^{2} .
Proof.
We derive the bound by explicitly differentiating the Jacobian operator derived in AppendixĀ A.3 . Recall that the first variation P Ė = D ā š® ε ā [ C Ė ] \dot{P}=D\mathcal{S}_{\varepsilon}[\dot{C}] is determined by the linear system on the active support:
š S ā ( ε ) ā ( f Ė g Ė ) = 1 ε ā ⬠ā ( C Ė ) , \mathcal{A}_{S}(\varepsilon)\begin{pmatrix}\dot{f}\\
\dot{g}\end{pmatrix}=\frac{1}{\varepsilon}\mathcal{B}(\dot{C}),
(16)
where š S ā ( ε ) \mathcal{A}{S}(\varepsilon) involves blocks of P ε P^{\varepsilon} . To find the second derivative (Hessian), we differentiate ( 16 ) again with respect to the parameters. Let ā \partial denote the differentiation operator. Applying the product rule to š S ā š³ = š \mathcal{A}{S}\mathbf{z}=\mathbf{b} (where š³ \mathbf{z} are dual potentials):
š S ā ( ā š³ ) + ( ā š S ) ā š³ = ā š . \mathcal{A}_{S}(\partial\mathbf{z})+(\partial\mathcal{A}_{S})\mathbf{z}=\partial\mathbf{b}.
Rearranging for the second-order variation ā š³ \partial\mathbf{z} :
ā š³ = š S ā 1 ā ( ā š ā ( ā š S ) ā š³ ) . \partial\mathbf{z}=\mathcal{A}_{S}^{-1}\left(\partial\mathbf{b}-(\partial\mathcal{A}_{S})\mathbf{z}\right).
We now bound the norms of each term on the RHS:
⢠Invertibility: By LemmaĀ A.5 , ā š S ā 1 ā op ⤠M
⢠First-order scaling: From LemmaĀ A.7 , we know that the first-order potentials scale as ā š³ ā ā¼ O ā ( 1 / ε ) |\mathbf{z}|\sim O(1/\varepsilon) .
⢠Derivative of the Operator ā š S \partial\mathcal{A}{S} : The matrix š S \mathcal{A}{S} contains entries P i ā j ε P_{ij}^{\varepsilon} and 0 . Since P i ā j ε = exp ā” ( ( f i + g j ā C i ā j ) / ε ) P_{ij}^{\varepsilon}=\exp((f_{i}+g_{j}-C_{ij})/\varepsilon) , its derivative is:
ā P i ā j ε = 1 ε ā P i ā j ε ā ( ā f i + ā g j ā ā C i ā j ) ā 1 ε 2 ā P i ā j ε ā ( ⦠) ā Ī“ ā ε . \partial P_{ij}^{\varepsilon}=\frac{1}{\varepsilon}P_{ij}^{\varepsilon}(\partial f_{i}+\partial g_{j}-\partial C_{ij})-\frac{1}{\varepsilon^{2}}P_{ij}^{\varepsilon}(\dots)\delta\varepsilon.
Crucially, differentiating the exponential map introduces a factor of 1 / ε 1/\varepsilon . Thus, the operator derivative scales as:
ā ā š S ā op ⤠C 1 ε ā ā P ε ā op ⤠C 1 ā p max ε . \|\partial\mathcal{A}_{S}\|_{\mathrm{op}}\leq\frac{C_{1}}{\varepsilon}\|P^{\varepsilon}\|_{\mathrm{op}}\leq\frac{C_{1}p_{\max}}{\varepsilon}.
Combining these factors into the expression for ā š³ \partial\mathbf{z} :
ā ā š³ ā ⤠M ā ( O ā ( 1 / ε 2 ) + ā ā š S ā ā ā¼ 1 / ε ā
ā š³ ā ā ā¼ 1 / ε ) ⤠K Ⲡε 2 . \|\partial\mathbf{z}\|\leq M\left(O(1/\varepsilon^{2})+\underbrace{\|\partial\mathcal{A}_{S}\|}_{\sim 1/\varepsilon}\cdot\underbrace{\|\mathbf{z}\|}_{\sim 1/\varepsilon}\right)\leq\frac{K^{\prime}}{\varepsilon^{2}}.
Finally, the Hessian of the primal plan P P involves ā š³ \partial\mathbf{z} (term of order 1 / ε 2 1/\varepsilon^{2} ) and products of first derivatives ( f Ė ā g Ė ) / ε 2 (\dot{f}\dot{g})/\varepsilon^{2} . Both contributions scale as O ā ( 1 / ε 2 ) O(1/\varepsilon^{2}) . The bound relies explicitly on M M staying finite (Assumption A.1 ), ensuring the 1 / ε 2 1/\varepsilon^{2} explosion is not "cancelled out" by a vanishing inverse. ā
A.6
Refined Sensitivity Analysis: The O ā ( ϵ ā 1 ) O(\epsilon^{-1}) Scaling In this section, we provide a rigorous derivation of the sensitivity of the optimal transport plan P ϵ ā P^{*}_{\epsilon} with respect to the regularization parameter ϵ \epsilon . We explicitly address why the sensitivity scales as O ā ( ϵ ā 1 ) O(\epsilon^{-1}) rather than the O ā ( ϵ ā 2 ) O(\epsilon^{-2}) scaling one might expect from the Gibbs kernel derivative.
Lemma A.10
(Thermodynamic Sensitivity Scaling) .
Let P ϵ ā P^{*}_{\epsilon} be the unique solution to the entropy-regularized optimal transport problem with cost matrix C C . Under the non-degeneracy assumption (Assumption A.1), the Frobenius norm of the derivative of the optimal plan with respect to temperature satisfies:
ā ā P ϵ ā ā ϵ ā F = Ī ā ( ϵ ā 1 ) asĀ ā ϵ ā 0 . \left\|\frac{\partial P^{*}_{\epsilon}}{\partial\epsilon}\right\|_{F}=\Theta(\epsilon^{-1})\quad\text{as }\epsilon\to 0.
(17)
Proof.
Recall the primal-dual relationship for the optimal plan entries on the active support:
P i ā j ā ā ( ϵ ) = exp ā” ( f i ā ( ϵ ) + g j ā ( ϵ ) ā C i ā j ϵ ) , P^{*}_{ij}(\epsilon)=\exp\left(\frac{f_{i}(\epsilon)+g_{j}(\epsilon)-C_{ij}}{\epsilon}\right),
(18)
where f ā ( ϵ ) , g ā ( ϵ ) f(\epsilon),g(\epsilon) are the optimal dual potentials. Taking the logarithm of both sides:
log ā” P i ā j ā ā ( ϵ ) = f i ā ( ϵ ) + g j ā ( ϵ ) ā C i ā j ϵ . \log P^{*}_{ij}(\epsilon)=\frac{f_{i}(\epsilon)+g_{j}(\epsilon)-C_{ij}}{\epsilon}.
(19)
Differentiating both sides with respect to ϵ \epsilon using the total derivative:
1 P i ā j ā ā ā P i ā j ā ā ϵ = ( f Ė i + g Ė j ) ā ϵ ā ( f i + g j ā C i ā j ) ϵ 2 , \frac{1}{P^{*}_{ij}}\frac{\partial P^{*}_{ij}}{\partial\epsilon}=\frac{(\dot{f}_{i}+\dot{g}_{j})\epsilon-(f_{i}+g_{j}-C_{ij})}{\epsilon^{2}},
(20)
where f Ė := ā f / ā ϵ \dot{f}:=\partial f/\partial\epsilon . Rearranging terms and substituting f i + g j ā C i ā j ϵ = log ā” P i ā j ā \frac{f_{i}+g_{j}-C_{ij}}{\epsilon}=\log P^{*}_{ij} :
ā P i ā j ā ā ϵ = P i ā j ā ϵ ā [ ( f Ė i ā ( ϵ ) + g Ė j ā ( ϵ ) ) ā log ā” P i ā j ā ā ( ϵ ) ] . \frac{\partial P^{*}_{ij}}{\partial\epsilon}=\frac{P^{*}_{ij}}{\epsilon}\left[(\dot{f}_{i}(\epsilon)+\dot{g}_{j}(\epsilon))-\log P^{*}_{ij}(\epsilon)\right].
(21)
We now analyze the asymptotic magnitude of each term in the bracket as ϵ ā 0 \epsilon\to 0 :
The Log-Probability Term ( log ā” P i ā j ā \log P^{}_{ij} ): On the active support S S (Assumption A.1), the mass P i ā j ā P^{}{ij} converges to a strictly positive constant (the solution to the unregularized linear program). Thus, log ā” P i ā j ā = O ā ( 1 ) \log P^{*}{ij}=O(1) . (For inactive entries, P i ā j ā ā 0 P^{*}_{ij}\to 0 exponentially fast, making the derivative vanish regardless of the polynomial factor).
The Dual Potential Derivatives ( f Ė , g Ė \dot{f},\dot{g} ): According to the asymptotic expansion theory for entropic optimal transport [Cominetti and San MartĆn, 1994], the dual potentials admit a Taylor expansion of the form:
textf i ā ( ϵ ) = Ļ i + ϵ ā h i + O ā ( ϵ 2 ) , f_{i}(\epsilon)=\phi_{i}+\epsilon\cdot h_{i}+O(\epsilon^{2}),
(22)
where Ļ i \phi_{i} are the Kantorovich potentials of the unregularized problem. Differentiating this expansion with respect to ϵ \epsilon yields:
f Ė i ā ( ϵ ) = h i + O ā ( ϵ ) = O ā ( 1 ) . \dot{f}_{i}(\epsilon)=h_{i}+O(\epsilon)=O(1).
(23)
The same holds for g j ā ( ϵ ) g_{j}(\epsilon) . Consequently, the term ( f Ė i + g Ė j ) (\dot{f}{i}+\dot{g}{j}) is bounded by a constant O ā ( 1 ) O(1) .
Conclusion: Substituting these scalings back into Eq.Ā ( 21 ):
ā P i ā j ā ā ϵ = P i ā j ā ϵ ā O ā ( ϵ ā 1 ) ā
[ O ā ( 1 ) ā O ā ( 1 ) ] ā O ā ( 1 ) = O ā ( ϵ ā 1 ) . \frac{\partial P^{*}_{ij}}{\partial\epsilon}=\underbrace{\frac{P^{*}_{ij}}{\epsilon}}_{O(\epsilon^{-1})}\cdot\underbrace{\left[O(1)-O(1)\right]}_{O(1)}=O(\epsilon^{-1}).
(24)
Thus, the sensitivity is dominated by the 1 / ϵ 1/\epsilon factor. ā
Remark A.11
(Why not O ā ( ϵ ā 2 ) O(\epsilon^{-2}) ?) .
It is tempting to assume the sensitivity scales as O ā ( ϵ ā 2 ) O(\epsilon^{-2}) by inspecting the Gibbs kernel K i ā j = e ā C i ā j / ϵ K_{ij}=e^{-C_{ij}/\epsilon} , whose derivative is C i ā j ϵ 2 ā K i ā j \frac{C_{ij}}{\epsilon^{2}}K_{ij} . However, this view ignores the marginal constraints . The dual potentials f ā ( ϵ ) f(\epsilon) and g ā ( ϵ ) g(\epsilon) are adaptive; they shift specifically to counteract the mass displacement caused by the changing kernel. Mathematically, this is represented by the subtraction of the log ā” P i ā j ā \log P^{*}_{ij} term in Eq.Ā ( 21 ), which effectively cancels the higher-order singularity, reducing the divergence from ϵ ā 2 \epsilon^{-2} to ϵ ā 1 \epsilon^{-1} .
A.7
Practical constant estimation recipe To report instance-specific constants in experiments:
1. For each ε \varepsilon of interest compute P ε P^{\varepsilon} and evaluate p ā ( ε ) = ā P ε ā op p(\varepsilon)=|P^{\varepsilon}|_{\mathrm{op}} (via SVD).
2.
Form the block matrix š ā ( ε ) \mathcal{A}(\varepsilon) and evaluate
the minimal eigenvalue numerically to get Ī» min ā ( š ) \lambda_{\min}(\mathcal{A})
and thus M num ā ( ε ) = 1 / Ī» min ā ( š ) M_{\mathrm{num}}(\varepsilon)=1/\lambda_{\min}(\mathcal{A}) .
3. Compute the forcing vector š¦ k ā l \mathbf{m}{kl} for representative ( k , l ) ā S (k,l)\in S (or take a supremum over ( k , l ) ā S (k,l)\in S ), and evaluate numerically ā š ā ( ε ) ā 1 ā š¦ k ā l ā 2 |\mathcal{A}(\varepsilon)^{-1}\mathbf{m}{kl}|_{2} .
4. Plug numeric quantities into the formulas in LemmaĀ A.9 to obtain concrete C 1 ā ( Ī· , n ) C_{1}(\eta,n) and check the separation condition C 1 ā ( Ī· , n )
A.8
Experimental and numerical notes (reproducibility) We preserve and expand the implementation guidance:
Log-domain and stability.
For very small ε \varepsilon compute Sinkhorn in the log-domain using log-sum-exp to avoid kernel underflow. Keep track of numerical tolerances and the Sinkhorn maximum iteration count.
Implicit differentiation vs finite difference.
For Jacobian estimates use the implicit differentiation linear solves described above (cost ā O ā ( n 3 ) \approx O(n^{3}) per right-hand side) rather than naive finite differences ( O ā ( n 4 ) O(n^{4}) total in naive implementations). Use a robust linear solver (LU with pivoting or iterative solver with preconditioner) to solve systems involving š ā ( ε ) \mathcal{A}(\varepsilon) .
Repetition and reporting.
Report means and standard deviations across multiple random seeds (we recommend 5ā10) when presenting empirical scaling laws for ε \varepsilon -dependence.
A.9
Raw numeric estimates and CSV Include the CSV with columns eps, op_norm, C0_est, dS_de_norm, K1_est, K2_est, rho_mean, rho_std . When reproducing tables in the supplement please make sure the CSV file name used by matches the distributed file.
A.10
Reproducibility checklist ā¢
Code and CSV for numeric tables and plots included in the supplement.
⢠Random seeds and environment (Python/NumPy versions, BLAS/LAPACK backend) recorded in the repository.
⢠Implementation notes: whether log-domain or direct kernel was used, tolerance and max-iteration values for Sinkhorn, and whether implicit differentiation or finite differences were used for Jacobian estimates.
Appendix B
Proof of the Thermodynamic Speed Limit In this section, we provide the formal proof for TheoremĀ 3.2 and CorollaryĀ 3.3 , establishing the O ā ( ϵ 2 ) O(\epsilon^{2}) scaling law required to prevent premature mode collapse.
B.1
Error Dynamics Decomposition Let P t ā P^{}{t} denote the optimal plan at step t t (corresponding to ϵ t \epsilon{t} ). The tracking error e t = P t ā P t ā e_{t}=P_{t}-P^{}_{t} evolves according to the discrete dynamics:
e t + 1 \displaystyle e_{t+1}
= P t + 1 ā P t + 1 ā \displaystyle=P_{t+1}-P^{*}_{t+1}
= š® ϵ t + 1 ā ( P t ) ā P t + 1 ā \displaystyle=\mathcal{S}_{\epsilon_{t+1}}(P_{t})-P^{*}_{t+1}
ā P t + 1 ā + J t + 1 ā ( P t ā P t + 1 ā ) ā P t + 1 ā (Linearization) \displaystyle\approx P^{*}_{t+1}+J_{t+1}(P_{t}-P^{*}_{t+1})-P^{*}_{t+1}\quad\text{(Linearization)}
= J t + 1 ā ( P t ā P t ā + P t ā ā P t + 1 ā ) \displaystyle=J_{t+1}(P_{t}-P^{*}_{t}+P^{*}_{t}-P^{*}_{t+1})
= J t + 1 ā e t + J t + 1 ā ( P t ā ā P t + 1 ā ) ā DriftĀ ā Ī t \displaystyle=J_{t+1}e_{t}+J_{t+1}\underbrace{(P^{*}_{t}-P^{*}_{t+1})}_{\text{Drift }\Delta_{t}}
(25)
where J t + 1 J_{t+1} is the Jacobian of the Sinkhorn map at the local optimum. The term Ī t \Delta_{t} represents the shift of the target distribution between iterations due to the temperature change Ī“ t = ϵ t ā ϵ t + 1 \delta_{t}=\epsilon_{t}-\epsilon_{t+1} .
B.2
Lemma: Equilibrium Error Bound
Lemma B.1
.
Consider the recurrence e t + 1 = J ā e t + u e_{t+1}=Je_{t}+u where J J is a contraction matrix ( Ļ ā ( J )
ā e ā ā ⤠ā ( I ā J ) ā 1 ā ā ā u ā \|e_{\infty}\|\leq\|(I-J)^{-1}\|\|u\|
(26)
Proof.
Unrolling the recurrence yields e k = ā i = 0 k ā 1 J i ā u e_{k}=\sum_{i=0}^{k-1}J^{i}u . Taking the limit as k ā ā k\to\infty gives the Neumann series ā i = 0 ā J i = ( I ā J ) ā 1 \sum_{i=0}^{\infty}J^{i}=(I-J)^{-1} . Applying the operator norm consistency yields the bound. ā
B.3
Proof of TheoremĀ 3.2
We apply LemmaĀ B.1 to the Sinkhorn error dynamics derived above.
Quantifying the Drift ( Ī t \Delta_{t} ): Using a first-order Taylor expansion and the sensitivity result from PropositionĀ 3.1 (Item 1), the distributional drift is:
ā Ī t ā = ā P t ā ā P t + 1 ā ā ā ā ā ϵ P ā ā ā | ϵ t ā ϵ t + 1 | = Ī ā ( ϵ t ā 1 ) ā Ī“ t |\Delta_{t}|=|P^{}_{t}-P^{}{t+1}|\approx|\nabla{\epsilon}P^{*}|\cdot|\epsilon_{t}-\epsilon_{t+1}|=\Theta(\epsilon_{t}^{-1})\cdot\delta_{t}
(27)
Quantifying the Resolvent ( ā ( I ā J ) ā 1 ā |(I-J)^{-1}| ): From PropositionĀ 3.1 (Item 3), the vanishing spectral gap dictates the resolvent growth:
ā ( I ā J ϵ ) ā 1 ā = Ī ā ( ϵ t ā 1 ) |(I-J_{\epsilon})^{-1}|=\Theta(\epsilon_{t}^{-1})
(28)
Note: In the non-normal regime, pseudospectral effects may cause this term to grow even faster, but Ī ā ( ϵ ā 1 ) \Theta(\epsilon^{-1}) serves as a conservative lower bound for the necessary condition.
Combining Terms: Substituting these into the bound from LemmaĀ B.1 :
ā e s ā s ā ⤠Πā ( ϵ t ā 1 ) Ć ( Ī ā ( ϵ t ā 1 ) ā Ī“ t ) = Ī ā ( ϵ t ā 2 ) ā Ī“ t |e_{ss}|\leq\Theta(\epsilon_{t}^{-1})\times\left(\Theta(\epsilon_{t}^{-1})\cdot\delta_{t}\right)=\Theta(\epsilon_{t}^{-2})\cdot\delta_{t}
(29)
Stability Condition: To prevent the trajectory from escaping the basin of attraction (Mode Collapse), the steady-state error ā e s ā s ā |e_{ss}| must remain smaller than the linearization basin radius R R . Assuming R R shrinks slowly or is O ā ( 1 ) O(1) :
textĪ ā ( ϵ t ā 2 ) ā Ī“ t ⤠R ā¹ Ī“ t ⤠O ā ( ϵ t 2 ā R ) \Theta(\epsilon_{t}^{-2})\cdot\delta_{t}\leq R\implies\delta_{t}\leq O(\epsilon_{t}^{2}R)
(30)
This confirms that the step size Ī“ t \delta_{t} must scale quadratically with ϵ \epsilon to maintain tracking. ā
B.4
Proof of CorollaryĀ 3.3
Assume a standard exponential annealing schedule ϵ t + 1 = α ā ϵ t \epsilon_{t+1}=\alpha\epsilon_{t} with decay rate 0
Ī“ t = ϵ t ā α ā ϵ t = ( 1 ā α ) ā ϵ t \delta_{t}=\epsilon_{t}-\alpha\epsilon_{t}=(1-\alpha)\epsilon_{t}
(31)
Substituting this actual step size into the stability ratio derived in the proof of TheoremĀ 3.2 :
ā e s ā s ā R ā ϵ t ā 2 ā
( 1 ā α ) ā ϵ t R = 1 ā α R ā
ϵ t \frac{\|e_{ss}\|}{R}\propto\frac{\epsilon_{t}^{-2}\cdot(1-\alpha)\epsilon_{t}}{R}=\frac{1-\alpha}{R\cdot\epsilon_{t}}
(32)
Analyzing the limit as ϵ t ā 0 \epsilon_{t}\to 0 , we see that this ratio diverges to infinity:
lim ϵ t ā 0 ā e s ā s ā R = ā \lim_{\epsilon_{t}\to 0}\frac{\|e_{ss}\|}{R}=\infty
(33)
Thus, for any fixed decay rate α \alpha and basin radius R R , there exists a critical temperature ϵ crit \epsilon_{\text{crit}} below which the tracking error inevitably exceeds the basin capacity ( R R ), triggering premature collapse. ā
Main-text claim
Appendix reference (proof / technical details)
Sensitivity of the Sinkhorn fixed point scales as O ā ( 1 / ε ) O(1/\varepsilon)
AppendixĀ A.3 , Eq.Ā ( 12 ); LemmaĀ A.7 (AppendixĀ A.5 )
Existence of a stable active support set under small ε \varepsilon
AssumptionĀ A.1 (AppendixĀ A.1 )
Reduction to the active Schur-complement operator š S ā ( ε ) \mathcal{A}_{S}(\varepsilon)
AppendixĀ A.4 ; LemmaĀ A.5
Implicit differentiation identity D ā S ε = ( I ā J ε ) ā 1 ā ā C Φ DS_{\varepsilon}=(I-J_{\varepsilon})^{-1}\partial_{C}\Phi
LemmaĀ A.8 (AppendixĀ A.5 )
Resolvent norm ā ( I ā J ε ) ā 1 ā |(I-J_{\varepsilon})^{-1}| must diverge as ε ā 0 \varepsilon\to 0
LemmaĀ A.8 ; quantitative constants in AppendixĀ A.7
Non-normal transient amplification despite spectral contraction
Theorem A.2 ; full proof in AppendixĀ A.2
Effective basin shrinkage by modal condition number Īŗ ā ( V ) \kappa(V)
AppendixĀ A.2 , transient growth bound
Discrete-time tracking formulation for annealing
AppendixĀ B , Section āDiscrete Tracking Modelā
Existence of a dynamical annealing speed limit
TheoremĀ 3.2 ; proof and constants in AppendixĀ B
O ā ( 1 ) O(1) per-step drift under exponential schedule ε t + 1 = α ā ε t \varepsilon_{t+1}=\alpha\varepsilon_{t}
LemmaĀ A.7 ; discussion in AppendixĀ A.5
Second-order remainder controlled by O ā ( 1 / ε 2 ) O(1/\varepsilon^{2}) Hessian bound
LemmaĀ A.9 (AppendixĀ A.5 )
Definition and computation of numerical sensitivity constant M num ā ( ε ) M_{\mathrm{num}}(\varepsilon)
AppendixĀ A.7 , Section āNumerical Constantsā
Justification of QSA drift estimation via implicit differentiation
AppendixĀ A.3 ; error control in AppendixĀ B
ASC admissible drift threshold ā Ī t ā ⤠1 ā Ļ ā ( J ε ) Īŗ ā ( V ) ā R |\Delta_{t}|\leq\frac{1-\rho(J_{\varepsilon})}{\kappa(V)}R
AppendixĀ B , stability inequality derivation
Practical choice of ( α 0 , β , R ) (\alpha_{0},\beta,R) in ASC
AppendixĀ A.7
Failure of linear diagnostics under support bifurcation
AppendixĀ A.1 , discussion following AssumptionĀ A.1
Table 2 : Alignment between main-text claims and appendix results. Each nontrivial statement in the main paper is backed by an explicit lemma, theorem, or derivation in the appendix.
References
[1] J. Altschuler, J. Weed, and P. Rigollet (2017)
Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration .
In Advances in Neural Information Processing Systems (NeurIPS) ,
Vol. 30 .
Cited by: §1 .
[2] M. Blondel, V. SƩguy, and A. Rolet (2018)
Smooth and sparse optimal transport .
In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) ,
pp.Ā 880ā889 .
Cited by: §1 .
[3] L. Chizat (2024)
Annealed Sinkhorn for optimal transport: convergence, regularization path and debiasing .
arXiv preprint arXiv:2408.11620 .
Cited by: §1 .
[4] R. Cominetti and J. San MartĆn (1994)
Asymptotic analysis of the exponential penalty trajectory in linear programming .
Mathematical Programming 67 ( 1-3 ), pp.Ā 169ā187 .
Cited by: §1 , §2.2 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .
[5] M. Cuturi (2013)
Sinkhorn distances: lightspeed computation of optimal transport .
Advances in Neural Information Processing Systems (NeurIPS) 26 .
Cited by: §1 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .
[6] DeepSeek-AI, W. Liang, et al. (2025)
MHC: manifold-constrained hyper-connections .
arXiv preprint arXiv:2512.24880 .
Cited by: §1 , §1 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .
[7] M. Eisenberger, A. Toker, L. Leal-TaixƩ, F. Bernard, and D. Cremers (2022)
A unified framework for implicit Sinkhorn differentiation .
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ,
pp.Ā 509ā518 .
Cited by: §1 .
[8] A. Genevay, G. PeyrƩ, and M. Cuturi (2018)
Learning generative models with Sinkhorn divergences .
In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) ,
pp.Ā 1608ā1617 .
Cited by: §1 .
[9] G. PeyrƩ and M. Cuturi (2019)
Computational optimal transport: with applications to data science .
Now Publishers .
Cited by: §1 , §1 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .
[10] R. Sinkhorn and P. Knopp (1967)
Concerning nonnegative matrices and doubly stochastic matrices .
Pacific Journal of Mathematics 21 ( 2 ), pp.Ā 343ā348 .
Cited by: §1 .
[11] L. N. Trefethen and M. Embree (2005)
Spectra and pseudospectra: the behavior of nonnormal matrices and operators .
Princeton University Press .
Cited by: §2.3 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .
[12] J. Weed (2018)
An explicit analysis of the entropic penalty in linear programming .
In Proceedings of the 31st Conference On Learning Theory (COLT) ,
pp.Ā 1841ā1855 .
Cited by: §1 , §2.2 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .
[13] D. Zhu, H. Huang, Z. Huang, Y. Zeng, Y. Mao, B. Wu, Q. Min, and X. Zhou (2024)
Hyper-connections .
arXiv preprint arXiv:2409.19606 .
Cited by: §1 .
AI Summary: Based on hf metadata. Not a recommendation.
š”ļø Paper Transparency Report
Technical metadata sourced from upstream repositories.
š Identity & Source
- id
- arxiv-paper--unknown--2601.23039
- slug
- unknown--2601.23039
- source
- hf
- author
- Yizhi Liu
- license
- ArXiv
- tags
- paper, research
āļø Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
š Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.