📄

Paper

Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference

by Yizhi Liu arxiv-paper--unknown--2601.23039

Nexus Index

37.0 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 0

R: Recency 100

Q: Quality 45

Tech Context

Vital Performance

0 DL / 30D

0.0%

High Impact 0 Citations

2024 Year

ArXiv Venue

- FNI Rank

Paper Information Summary
Entity Passport
Registry ID	arxiv-paper--unknown--2601.23039
License	ArXiv
Provider	hf

📜

Cite this paper

Academic & Research Attribution

BibTeX

@misc{arxiv_paper__unknown__2601.23039,
  author = {Yizhi Liu},
  title = {Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference Paper},
  year = {2026},
  howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2601.23039}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Yizhi Liu. (2026). Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference [Paper]. Free2AITools. https://free2aitools.com/paper/arxiv-paper--unknown--2601.23039

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

37.0

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 100

Quality (Q) 45

💬 Index Insight

FNI V2.0 for Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:45).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

📝 Executive Summary

"Technical abstract for this publication is currently being indexed."

❝ Cite Node

@article{Unknown2026Avoiding,
  title={Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference},
  author={},
  journal={arXiv preprint arXiv:arxiv-paper--unknown--2601.23039},
  year={2026}
}

Abstract & Analysis

Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference

Avoiding Premature Collapse:

Adaptive Annealing for Entropy-Regularized Structural Inference

Yizhi Liu Department of Computer Science, Stony Brook University [email protected]

(January 23, 2026)

Abstract

Differentiable matching layers and residual connection paradigms, often implemented via entropy-regularized Optimal Transport (OT) Cuturi ( 2013 ); Peyré and Cuturi ( 2019 ) , serve as critical mechanisms in structural prediction and architectural scaling. However, recovering discrete permutations or maintaining identity mappings via annealing ϵ → 0 \epsilon\rightarrow 0 is notoriously unstable Peyré and Cuturi ( 2019 ) . We identify a fundamental mechanism for this failure: Premature Mode Collapse . By analyzing the non-normal dynamics of the Sinkhorn fixed-point map Trefethen and Embree ( 2005 ) , we reveal a theoretical thermodynamic speed limit: standard exponential cooling outpaces the contraction rate of the inference operator, which degrades as O ( 1 / ϵ ) O(1/\epsilon) Cominetti and San Martín ( 1994 ); Weed ( 2018 ) . To address this, we propose Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC) , an adaptive scheduling algorithm that monitors the stability of the inference process. We demonstrate that EPH-ASC is essential for stabilizing Manifold-Constrained Hyper-Connections (mHC) DeepSeek-AI et al. ( 2025 ) during large-scale training on the FineWeb-Edu dataset, preventing late-stage gradient explosions by enforcing a linear stability law.

1

Introduction Entropy-regularized Optimal Transport (OT) has become a standard surrogate for combinatorial inference Cuturi ( 2013 ); Altschuler et al. ( 2017 ) and modern macro-architecture design DeepSeek-AI et al. ( 2025 ); Zhu et al. ( 2024 ) . Practitioners often attempt to recover hard assignments or specialized routing by annealing the regularization parameter ϵ → 0 \epsilon\rightarrow 0 Genevay et al. ( 2018 ); Peyré and Cuturi ( 2019 ) . Recently, studies such as Manifold-Constrained Hyper-Connections (mHC) DeepSeek-AI et al. ( 2025 ) have utilized the Sinkhorn-Knopp algorithm Sinkhorn and Knopp ( 1967 ) to project residual mappings onto the Birkhoff polytope, thereby restoring the identity mapping property in multi-stream architectures DeepSeek-AI et al. ( 2025 ) .

Empirically, however, this cooling process is fragile Peyré and Cuturi ( 2019 ); Blondel et al. ( 2018 ) . As ϵ \epsilon decreases, the sensitivity of the optimal plan to cost perturbations blows up as O ( 1 / ϵ ) O(1/\epsilon) Cominetti and San Martín ( 1994 ); Weed ( 2018 ) , a phenomenon we define as the Thermodynamic Speed Limit . In the presence of high-variance gradients from real-world datasets like FineWeb-Edu, standard exponential annealing Chizat ( 2024 ) violates this limit, causing the composite mapping in HC architectures to diverge or collapse DeepSeek-AI et al. ( 2025 ) . We propose EPH-ASC to reconcile this instability by monitoring the Primal Drift Weed ( 2018 ); Eisenberger et al. ( 2022 ) and triggering a "Thermodynamic Pause" when the distributional shift exceeds the solver’s restoring capacity.

2

The Mechanism of Inference Collapse

2.1

Geometric Picture: Basin Shrinkage As the temperature ε → 0 \varepsilon\to 0 , the entropic map 𝒮 ε ( C ) \mathcal{S}{\varepsilon}(C) sharpens soft belief states into near-permutations. Geometrically, the transport polytope decomposes into basins of attraction around permutation vertices. The failure mode we study— early locking —occurs when the current inference state P t P{t} is attracted into a wrong basin because the optimal posterior drifted faster than the fixed-point iteration could correct. Figure 1 visualizes this premature collapse.

Figure 1 : Premature Mode Collapse. Standard annealing (blue) breaches the stability threshold R R (dotted), causing early locking into a spurious mode. Ours (red) detects the stability violation and pauses cooling. (Simulation)

2.2

Thermodynamic Sensitivity ( 1 / ε 1/\varepsilon ) To quantify the "velocity" of the distribution shift, we adopt a localized non-degeneracy assumption that fixes an active support set S S for small ε \varepsilon . Under this assumption, the sensitivity of the Sinkhorn map with respect to ε \varepsilon scales like 1 / ε 1/\varepsilon . This follows from implicit differentiation of the entropic optimality conditions Cominetti and San Martín ( 1994 ); Weed ( 2018 ) .

Remark (Linear Stability Scaling). Since the restoring force of the inference operator collapses and sensitivity scales as O ( 1 / ε ) O(1/\varepsilon) , the radius of the effective stability basin shrinks proportionally with ε \varepsilon . This suggests that the permissible distributional shift τ t \tau_{t} must follow a linear scaling law τ t ∝ ε \tau_{t}\propto\varepsilon , a property we exploit in Section 4 to design an efficient adaptive schedule.

2.3

Transient Inference Error and Pseudospectra Linearizing the Sinkhorn fixed-point map yields a Jacobian J ε J_{\varepsilon} . While its eigenvalues describe asymptotic contraction, its non-normal structure creates a "shear" effect that can amplify transient inference errors. Pseudospectral theory quantifies how contours of the resolvent extend beyond the spectral radius Trefethen and Embree ( 2005 ) . We show (Theorem A.2) that the modal condition number κ ( V ) \kappa(V) of the Jacobian effectively compresses the basin of attraction by roughly κ ( V ) \kappa(V) .

Proposition 2.1

text

(Linear Scaling of the Stability Basin) .

Let the cost matrix C C satisfy the localized non-degeneracy Assumption (stable active support S S ). For sufficiently small ϵ \epsilon , the spectral gap of the Sinkhorn Jacobian J ϵ J_{\epsilon} satisfies 1 − ρ ( J ϵ ) ≥ γ ⋅ ϵ 1-\rho(J_{\epsilon})\geq\gamma\cdot\epsilon . Consequently, to ensure the inference error e t e_{t} remains within a local linearization region R ( ϵ ) R(\epsilon) (whose size may depend on ϵ \epsilon but vanishes no faster than linearly as ϵ → 0 \epsilon\to 0 ), the permissible drift τ m a x \tau_{max} must scale linearly:

text

   τ   m    a    x       (  ϵ  )    ≤      R     (  ϵ  )    ⋅  γ    κ     (  V  )     ⋅  ϵ    \tau_{max}(\epsilon)\leq\frac{R(\epsilon)\cdot\gamma}{\kappa(V)}\cdot\epsilon

(1)

(a) Macroscopic: The Closing Funnel. As ϵ → 0 \epsilon\to 0 , the stability basin (blue) shrinks. Collapse occurs when the distributional shift exceeds the basin radius (red ’X’).

(b) Microscopic: Non-normal Instability. The ϵ \epsilon -pseudospectrum contours extend beyond the unit circle, quantifying the transient error amplification.

Figure 2 : The Dual View of Inference Collapse. (a) Geometric intuition: The inference fails when the target drifts faster than the shrinking basin allows. (b) Spectral reality: This shrinkage is quantified by the non-normal pseudospectrum.

3

Theoretical Analysis: The Thermodynamic Speed Limit To understand the mechanism of premature collapse, we model the annealing process as a discrete-time tracking problem. The algorithm must maintain the iterate P t P_{t} within the basin of attraction of the moving fixed point P ϵ t ∗ P^{*}{\epsilon{t}} .

3.1

Problem Setup and Sinkhorn Dynamics Let ϵ t \epsilon_{t} be the annealing schedule with step size δ t := ϵ t − ϵ t + 1 \delta_{t}:=\epsilon_{t}-\epsilon_{t+1} . The inference dynamics are governed by the recurrence P t + 1 = 𝒮 ϵ t + 1 ( C , P t ) P_{t+1}=\mathcal{S}{\epsilon{t+1}}(C,P_{t}) . We analyze the tracking error e t := P t − P ϵ t ∗ e_{t}:=P_{t}-P^{*}{\epsilon{t}} relative to the linearization radius R R of the basin.

We rely on the localized non-degeneracy of the cost matrix (Assumption A.1 in Appendix), which implies the following fundamental properties of the Sinkhorn operator:

Proposition 3.1

text

(Sinkhorn Dynamics) .

Under the localized non-degeneracy assumption, for sufficiently small ϵ \epsilon :

1. Sensitivity: The fixed point drift scales as ‖ ∇ ϵ P ϵ ∗ ‖ = Θ ( ϵ − 1 ) |\nabla_{\epsilon}P^{*}_{\epsilon}|=\Theta(\epsilon^{-1}) .

2. Restoring Force: The spectral gap of the Jacobian J ϵ J_{\epsilon} vanishes as 1 − ρ ( J ϵ ) = Θ ( ϵ ) 1-\rho(J_{\epsilon})=\Theta(\epsilon) .

3. Resolvent Growth: The resolvent norm scales as ‖ ( I − J ϵ ) − 1 ‖ = Θ ( ϵ − 1 ) |(I-J_{\epsilon})^{-1}|=\Theta(\epsilon^{-1}) .

Proof.

See Appendix A.6 for sensitivity analysis and Appendix A.2 for spectral bounds. ∎

3.2

The Adiabatic Tracking Theorem We now derive the necessary condition for the inference trajectory to remain stable. The error evolves as a competition between the distributional shift (drift) caused by δ t \delta_{t} and the solver’s contraction (restoring force).

Theorem 3.2

text

(Thermodynamic Speed Limit) .

Consider the tracking error dynamics linearized around the fixed point path. For the error to remain uniformly bounded within the basin radius R R (i.e., lim sup t → ∞ ‖ e t ‖ ≤ R \limsup_{t\to\infty}|e_{t}|\leq R ), the annealing step size δ t \delta_{t} must satisfy:

text

  δ  t   ≤    O     (   ϵ  t  2   )    ⋅  R    \delta_{t}\leq O(\epsilon_{t}^{2})\cdot R

(2)

Specifically, the schedule must be at least quadratic ( δ t ∝ ϵ 2 \delta_{t}\propto\epsilon^{2} ) to counteract the O ( ϵ − 2 ) O(\epsilon^{-2}) amplification caused by the ratio of sensitivity to contraction.

Corollary 3.3

text

(Inevitability of Collapse for Exponential Schedules) .

Standard exponential annealing ϵ t + 1 = α ϵ t \epsilon_{t+1}=\alpha\epsilon_{t} implies a linear step size δ t = ( 1 − α ) ϵ t ∝ ϵ t \delta_{t}=(1-\alpha)\epsilon_{t}\propto\epsilon_{t} . Since the stability condition requires δ t ∝ ϵ t 2 \delta_{t}\propto\epsilon_{t}^{2} , exponential annealing violates the speed limit by a factor of 1 / ϵ t 1/\epsilon_{t} . As ϵ t → 0 \epsilon_{t}\to 0 , the tracking error diverges relative to the basin radius, rendering mode collapse theoretically inevitable.

The proofs for Theorem 3.2 and Corollary 3.3 are provided in Appendix A.2 .

4

Method: Efficient Piecewise Hybrid ASC We propose Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC) to reconcile topological stability with computational efficiency. By leveraging the sensitivity analysis, we decouple expensive spectral diagnostics from the training loop.

4.1

Approximating the Stability Constraint Precise verification requires computing the spectral radius ρ ( J ϵ ) \rho(J_{\epsilon}) , incurring 𝒪 ( N 3 ) \mathcal{O}(N^{3}) cost. However, since sensitivity scales deterministically as 𝒪 ( 1 / ϵ ) \mathcal{O}(1/\epsilon) , the permissible drift threshold τ t \tau_{t} must follow a corresponding linear law. We approximate the stability constraint by enforcing a limit on the distributional shift:

text

   ‖   Δ  t   ‖   F   ≤    τ   m    a    x       (  ϵ  )    ≈    k   s    a    f    e    ⋅   ϵ  t     \|\Delta_{t}\|_{F}\leq\tau_{max}(\epsilon)\approx k_{safe}\cdot\epsilon_{t}

(3)

where k s a f e k_{safe} is a dataset-specific safety slope. This approximation captures the essential dynamics: as the system "stiffens" ( ϵ → 0 \epsilon\to 0 ), the tolerable shift must decrease proportionally.

4.2

Two-Phase Protocol

Phase I: Calibration (Offline).

We execute a diagnostic oracle (QSA) on a proxy subset using an aggressive schedule to intentionally trigger "Mode Collapse". We record the drift-to-temperature ratio at the moment of topological collapse to estimate k s a f e k_{safe} .

Phase II: Runtime Control (Adaptive Annealing).

During training, the controller monitors the instantaneous shift ‖ Δ t ‖ F |\Delta_{t}|_{F} and enforces Eq. 3 :

• Stable State ( ‖ Δ t ‖ F ≤ k s a f e ⋅ ϵ t |\Delta_{t}|{F}\leq k{safe}\cdot\epsilon_{t} ): The trajectory is safe. Proceed with standard cooling.

• Unstable State ( ‖ Δ t ‖ F > k s a f e ⋅ ϵ t |\Delta_{t}|{F}>k{safe}\cdot\epsilon_{t} ): The distributional shift exceeds the basin capacity. The controller triggers a “Thermodynamic Pause” (Braking), holding ϵ t + 1 ← ϵ t \epsilon_{t+1}\leftarrow\epsilon_{t} constant. This pause allows the feature extractor to improve the signal-to-noise ratio of C C , naturally reducing drift until stability is regained.

Figure 3 : Mechanism of Adaptive Stability Control. The interplay between primal drift ‖ Δ t ‖ ||\Delta_{t}|| (red) and the stability threshold (dashed). The Stability Braking zone (Yellow) visualizes the algorithm strictly enforcing the thermodynamic speed limit. The controller detects imminent divergence and pauses cooling, preventing Premature Mode Collapse.

5

Experiments We validate EPH-ASC on SPair-71k , a benchmark for semantic keypoint matching under view variation.

5.1

Setup and Baselines We employ a ResNet-50 backbone with a Sinkhorn matching layer. We compare:

Standard Log-Space: Exponential annealing ( α = 0.95 \alpha=0.95 ), serving as a baseline.
Gumbel-Sinkhorn: Stochastic exploration via Gumbel noise.
EPH-ASC (Ours): Deterministic controller with k s a f e = 0.5 k_{safe}=0.5 .

5.2

Results: Entropy and Convergence Preventing Collapse. Figure 4 (Left) demonstrates the trap. Standard annealing (Blue) collapses early (Epoch ≈ 20 \approx 20 ), causing gradients to vanish and accuracy to flatline. The aggressive sharpening forces the plan into a spurious basin.

Preserving Uncertainty. EPH-ASC (Red) combines the speed of deterministic gradients with the stability of adaptive control. As shown in Figure 4 (Right), the controller detects the drift spike and triggers "Stability Braking". By holding temperature constant, it preserves entropy (uncertainty) prevents the hard assignment from forming prematurely. Once the features mature, annealing resumes, achieving target accuracy in 47 epochs —a 1.60 × \times speedup over Gumbel-Sinkhorn.

Figure 4 : Training Dynamics on SPair-71k. Left: Standard annealing (Blue) hits the Trap, causing gradient collapse. Gumbel-Sinkhorn (Green) is stable but converges slowly due to variance. EPH-ASC (Red) achieves the fastest convergence. Right: The Adaptive Mechanism. Drift spikes (Gray) trigger the braking zone (Yellow), holding ϵ \epsilon constant (Red line) to maintain thermodynamic stability.

Table 1 : Efficiency on SPair-71k. EPH-ASC achieves a 1.60 × \mathbf{1.60\times} speedup over Gumbel-Sinkhorn with negligible overhead ( 0.51 % 0.51% ), while Standard annealing fails.

Method
Epochs to 90%
Speedup
Layer Overhead
Training Overhead

Standard (Log-Space) Failed ( > 100 >100 ) N/A 0.00%
0.00%

Gumbel-Sinkhorn 75 1.0 × \times

text

   ≈   0.00  %    \approx 0.00\%    
   ≈   0.00  %    \approx 0.00\%

EPH-ASC (Ours)
47
1.60 × \times
0.51 % 0.51%
0.05 % 0.05%

5.3

Thermodynamic Robustness in Large-Scale LM Training To rigorously assess the robustness of EPH-ASC under real-world conditions characterized by high gradient variance, we scale the experiment to a language modeling task using the FineWeb-Edu dataset[cite: 536].

Setup. We employ a lightweight NanoGemma architecture equipped with Manifold-Constrained Hyper-Connections (mHC). Unlike controlled benchmarks, this experiment utilizes a GPT-2 tokenizer and natural language data, introducing heavy-tailed noise into the optimization landscape[cite: 538]. We compare a standard exponential schedule (Naive) against EPH-ASC over 1,000 steps[cite: 539].

Late-Stage Collapse. As shown in Fig. 5 (Left), the Naive schedule (Red) appears successful for 98% of the trajectory but suffers a catastrophic gradient explosion at Step 980[cite: 541, 542]. This confirms that Sinkhorn instability is a latent risk that manifests when ϵ \epsilon breaches the operator’s condition number limit.

Adaptive Intervention. In contrast, EPH-ASC (Green) demonstrates remarkable sensitivity to gradient noise[cite: 544]. Despite the jagged loss landscape, the controller detects critical distributional drift at Step 640 —long before visible loss degradation. By triggering "Thermodynamic Braking" (Orange markers), it locks temperature at ϵ ≈ 0.04 \epsilon\approx 0.04 , securing a 340-step safety margin [cite: 546, 547]. The entropy plot (Fig. 5 , Right) further reveals that EPH-ASC maintains a stable low-entropy regime, preventing the numerical underflow observed in the baseline[cite: 548].

Figure 5 : Verification on FineWeb-Edu. Left: Real-world training loss shows deceptive stability followed by sudden Naive collapse[cite: 554]. Center: EPH-ASC detects instability at Step 640, creating a massive safety margin via braking[cite: 555]. Right: Entropy preservation prevents numerical underflow in the feature maturity phase[cite: 556].

6

Conclusion We identified Premature Mode Collapse as a fundamental thermodynamic failure where distributional shift exceeds the inference operator’s contraction rate. EPH-ASC leverages the derived 𝒪 ( ϵ ) \mathcal{O}(\epsilon) stability law to resolve this via a lightweight adaptive schedule.

Appendix A

Appendix: Notation, precise matrix bounds, and detailed proofs

A.1

Preliminaries and notation We work with discrete measures of size n n and a finite cost matrix C ∈ ℝ n × n C\in\mathbb{R}^{n\times n} . For ε > 0 \varepsilon>0 define

text

    P  ε   =    𝒮  ε      (  C  )    =    diag  ⁡   (   u  ε   )       K  ε      diag  ⁡   (   v  ε   )      ,    K   i    j   ε   =   exp  ⁡   (   −    C   i    j    /  ε    )      .   P^{\varepsilon}=\mathcal{S}_{\varepsilon}(C)=\operatorname{diag}(u^{\varepsilon})\,K^{\varepsilon}\,\operatorname{diag}(v^{\varepsilon}),\qquad K^{\varepsilon}_{ij}=\exp(-C_{ij}/\varepsilon).

Dual potentials are written f ε = ε log ⁡ u ε , g ε = ε log ⁡ v ε f^{\varepsilon}=\varepsilon\log u^{\varepsilon},\ g^{\varepsilon}=\varepsilon\log v^{\varepsilon} . Row and column marginals:

text

    r  ε   :=    P  ε     𝟏    ,    c  ε   :=     (   P  ε   )   ⊤     𝟏     .   r^{\varepsilon}:=P^{\varepsilon}\mathbf{1},\qquad c^{\varepsilon}:=(P^{\varepsilon})^{\top}\mathbf{1}.

We use ∥ ⋅ ∥ 2 |\cdot|{2} for the Euclidean vector norm, ∥ ⋅ ∥ F |\cdot|{F} for Frobenius, and ∥ ⋅ ∥ op |\cdot|_{\mathrm{op}} for the spectral/operator norm.

We assume the localized non-degeneracy/support-stability condition used in the main text, restated here for convenience.

Assumption A.1

text

(Localized non-degeneracy / support stability) .

Assumption A.1 is a local-in- ε \varepsilon and local-in- C C condition. Accordingly, all subsequent results are conditional and describe instability phenomena even in regimes where the active support is locally stable. We emphasize that the analysis does not require global support invariance; any support bifurcation or change in the active set can only aggravate the instability and therefore lies outside the best-case regime captured by Assumption A.1.

There exist ε min > 0 \varepsilon_{\min}>0 , η > 0 \eta>0 and an index set S ⊆ { 1 , … , n } × { 1 , … , n } S\subseteq{1,\dots,n}\times{1,\dots,n} (“active support”) such that for all 0

text

     min    (  i  ,  j  )   ∈  S    ⁡   P   i    j   ε    ≥  η   ,     max    (  i  ,  j  )   ∉  S    ⁡   P   i    j   ε    ≤   τ     (  ε  )      ,   \min_{(i,j)\in S}P^{\varepsilon}_{ij}\geq\eta,\qquad\max_{(i,j)\notin S}P^{\varepsilon}_{ij}\leq\tau(\varepsilon),

where τ ( ε ) → 0 \tau(\varepsilon)\to 0 as ε ↓ 0 \varepsilon\downarrow 0 , and the support pattern on S S is constant for all ε ∈ ( 0 , ε min ] \varepsilon\in(0,\varepsilon_{\min}] .

Thus our result is a lower bound on failure: even in the best-behaved regime, exponential annealing fails. Remarks:

S S is the set of index pairs that carry the asymptotically non-vanishing

mass; the assumption is compatible with many cost matrices that induce a unique (or well-separated) matching pattern.

All subsequent constants are expressed in terms of η , n \eta,n and (where required) norms of P ε P^{\varepsilon} ; where a uniform-in- ε \varepsilon
bound is used we explicitly require ε ≤ ε 0 \varepsilon\leq\varepsilon_{0} for a (possibly smaller) ε 0 ≤ ε min \varepsilon_{0}\leq\varepsilon_{\min} .

A.2

Transient Stability Analysis: The Resolvent View

Theorem A.2

text

(Sensitivity-Stability Duality) .

Let J ϵ = D P 𝒮 ϵ ( C ) J_{\epsilon}=D_{P}\mathcal{S}{\epsilon}(C) be the Jacobian of the Sinkhorn fixed-point operator at the optimal plan P ∗ P^{*} . Let D S ϵ ( C ) = ∂ P ∗ ∂ C DS{\epsilon}(C)=\frac{\partial P^{*}}{\partial C} denote the sensitivity of the optimal plan to cost perturbations. Regardless of whether J ϵ J_{\epsilon} is diagonalizable, the following inequality holds connecting the inference sensitivity to the spectral properties of the solver:

text

  dist     (  1  ,   σ     (   J  ϵ   )    )    ≤     ‖    ∂  C   Φ   ‖    o    p      ‖   D     S  ϵ      (  C  )    ‖    o    p      \text{dist}(1,\sigma(J_{\epsilon}))\leq\frac{||\partial_{C}\Phi||_{op}}{||DS_{\epsilon}(C)||_{op}}

(4)

where dist ( 1 , σ ( J ) ) = min λ ∈ σ ( J ) ⁡ | 1 − λ | \text{dist}(1,\sigma(J))=\min_{\lambda\in\sigma(J)}|1-\lambda| is the spectral distance to unit gain, and ∂ C Φ \partial_{C}\Phi is the partial derivative of the update rule with respect to C C . Consequently, if the sensitivity scales as ‖ D S ϵ ( C ) ‖ ∼ Ω ( 1 / ϵ ) ||DS_{\epsilon}(C)||\sim\Omega(1/\epsilon) (as shown in Lemma A.6), the spectral gap must vanish linearly: 1 − ρ ( J ϵ ) ≤ O ( ϵ ) 1-\rho(J_{\epsilon})\leq O(\epsilon) .

Proof.

We start from the fixed-point equation P ∗ = 𝒮 ϵ ( P ∗ , C ) P^{}=\mathcal{S}_{\epsilon}(P^{},C) . Differentiating implicitly with respect to C C yields the linear system:

text

 d     P  ∗    \displaystyle dP^{*}    
   =       D  P      𝒮  ϵ    ⋅  d      P  ∗    +    ∂  C      𝒮  ϵ   ⋅  d     C      \displaystyle=D_{P}\mathcal{S}_{\epsilon}\cdot dP^{*}+\partial_{C}\mathcal{S}_{\epsilon}\cdot dC

(5)

text

  (   I  −   J  ϵ    )     d     P  ∗    \displaystyle(I-J_{\epsilon})dP^{*}    
   =    ∂  C      𝒮  ϵ   ⋅  d     C     \displaystyle=\partial_{C}\mathcal{S}_{\epsilon}\cdot dC

(6)

Assuming the system is locally stable ( ρ ( J ϵ )

text

  D     S  ϵ      (  C  )    =     (   I  −   J  ϵ    )    −  1        ∂  C    𝒮  ϵ      DS_{\epsilon}(C)=(I-J_{\epsilon})^{-1}\partial_{C}\mathcal{S}_{\epsilon}

(7)

Taking the operator norm on both sides and using the submultiplicative property ‖ A B ‖ ≤ ‖ A ‖ ‖ B ‖ ||AB||\leq||A||||B|| :

text

  ‖   D     S  ϵ      (  C  )    ‖   ≤    ‖    (   I  −   J  ϵ    )    −  1    ‖   ⋅   ‖    ∂  C    𝒮  ϵ    ‖     ||DS_{\epsilon}(C)||\leq||(I-J_{\epsilon})^{-1}||\cdot||\partial_{C}\mathcal{S}_{\epsilon}||

(8)

Recall that for any square matrix A A , the operator norm of the inverse is the reciprocal of the smallest singular value: ‖ A − 1 ‖ = 1 / σ min ( A ) ||A^{-1}||=1/\sigma_{\min}(A) . Furthermore, by the variational characterization of singular values, σ min ( I − J ϵ ) ≤ | λ min ( I − J ϵ ) | = dist ( 1 , σ ( J ϵ ) ) \sigma_{\min}(I-J_{\epsilon})\leq|\lambda_{\min}(I-J_{\epsilon})|=\text{dist}(1,\sigma(J_{\epsilon})) . Therefore, we have the lower bound on the resolvent:

text

  ‖    (   I  −   J  ϵ    )    −  1    ‖   =   1    σ  min      (   I  −   J  ϵ    )     ≥   1   dist     (  1  ,   σ     (   J  ϵ   )    )      ||(I-J_{\epsilon})^{-1}||=\frac{1}{\sigma_{\min}(I-J_{\epsilon})}\geq\frac{1}{\text{dist}(1,\sigma(J_{\epsilon}))}

(9)

Substituting this back yields:

text

  ‖   D     S  ϵ      (  C  )    ‖   ≤    ‖    ∂  C    𝒮  ϵ    ‖    dist     (  1  ,   σ     (   J  ϵ   )    )      ||DS_{\epsilon}(C)||\leq\frac{||\partial_{C}\mathcal{S}_{\epsilon}||}{\text{dist}(1,\sigma(J_{\epsilon}))}

(10)

Rearranging terms gives the stated duality bound. This implies that high sensitivity physically necessitates a vanishing spectral gap, purely from operator theoretic arguments, without assuming normality or diagonalizability. ∎

Corollary A.3

text

(The Basin Mismatch) .

Under an exponential schedule ϵ t + 1 = α ϵ t \epsilon_{t+1}=\alpha\epsilon_{t} , the tracking error diverges.

Proof.

The drift magnitude is ‖ Δ t ‖ ≈ ‖ d P ∗ d ϵ ‖ δ ϵ ∼ O ( 1 / ϵ ) ⋅ O ( ϵ ) = O ( 1 ) ||\Delta_{t}||\approx||\frac{dP^{*}}{d\epsilon}||\delta\epsilon\sim O(1/\epsilon)\cdot O(\epsilon)=O(1) . The restoring capacity of the solver is governed by the spectral gap: gap ∼ O ( ϵ ) \text{gap}\sim O(\epsilon) . The steady-state tracking error equilibrium e s s e_{ss} scales as:

text

  e   s    s    ≈   Drift  Gap   ∼    O     (  1  )     O     (  ϵ  )     =   O     (   1  /  ϵ   )     e_{ss}\approx\frac{\text{Drift}}{\text{Gap}}\sim\frac{O(1)}{O(\epsilon)}=O(1/\epsilon)

However, the validity region of the linearization R ( ϵ ) R(\epsilon) scales as O ( ϵ ) O(\epsilon) due to the 1 / ϵ 2 1/\epsilon^{2} curvature of the Hessian (Lemma A.8). Collapse occurs when e s s > R ( ϵ ) e_{ss}>R(\epsilon) , i.e., O ( 1 / ϵ ) > O ( ϵ ) O(1/\epsilon)>O(\epsilon) , which is inevitable as ϵ → 0 \epsilon\to 0 . ∎

A.3

Implicit differentiation linear system Differentiate marginal constraints. Define the block operator

text

   𝒜     (  ε  )    =   [      diag  ⁡   (   r  ε   )       P  ε         (   P  ε   )   ⊤      diag  ⁡   (   c  ε   )       ]   ∈   ℝ     2    n   ×  2     n     .   \mathcal{A}(\varepsilon)\;=\;\begin{bmatrix}\operatorname{diag}(r^{\varepsilon})&P^{\varepsilon}\\[3.0pt]

(P^{\varepsilon})^{\top}&\operatorname{diag}(c^{\varepsilon})\end{bmatrix}\in\mathbb{R}^{2n\times 2n}.

(11)

For a generic perturbation (in C C or in ε \varepsilon ) the derivative potentials ( ∂ f , ∂ g ) (\partial f,\partial g) satisfy a linear system of the form

text

   𝒜     (  ε  )      [      ∂  f        ∂  g      ]    =     1  ε     ℬ     (   ∂  C   )    +   ℛ     (   ∂  ε   )      ,   \mathcal{A}(\varepsilon)\begin{bmatrix}\partial f\\[2.0pt]

\partial g\end{bmatrix}=\frac{1}{\varepsilon},\mathcal{B}(\partial C)+\mathcal{R}(\partial\varepsilon),

(12)

where:

text
```
ℬ     (   ∂  C   )    \mathcal{B}(\partial C)    depends linearly on the entrywise change
```
∂ C \partial C and is O ( 1 ) O(1) in P ε P^{\varepsilon} (i.e., it does not contain extra 1 / ε 1/\varepsilon factors);
text
```
ℛ     (   ∂  ε   )    \mathcal{R}(\partial\varepsilon)    collects terms from differentiating
```
1 / ε 1/\varepsilon factors; again it is O ( 1 ) O(1) in P ε P^{\varepsilon} .

Solving ( 12 ) yields

text

   [      ∂  f        ∂  g      ]   =     1  ε     𝒜      (  ε  )    −  1      ℬ     (   ∂  C   )    +   𝒜      (  ε  )    −  1      ℛ     (   ∂  ε   )      .   \begin{bmatrix}\partial f\\[2.0pt]

\partial g\end{bmatrix}=\frac{1}{\varepsilon}\mathcal{A}(\varepsilon)^{-1}\mathcal{B}(\partial C)+\mathcal{A}(\varepsilon)^{-1}\mathcal{R}(\partial\varepsilon).

Substituting this into the entrywise derivative of P ε P^{\varepsilon}
gives the decomposition used below.

A.4

Invertibility of 𝒜 ( ε ) \mathcal{A}(\varepsilon) on the Active Support As ε → 0 \varepsilon\to 0 , the entropic optimal transport plan P ε P^{\varepsilon}
concentrates its mass on the active support S S specified in Assumption A.1 . Rows and columns not participating in S S may have marginals that vanish as ε → 0 \varepsilon\to 0 , rendering the full 2 n × 2 n 2n\times 2n operator 𝒜 ( ε ) \mathcal{A}(\varepsilon) ill-conditioned.

However, the sensitivity analysis carried out in Lemma A.9 only involves perturbations supported on the active entries. We therefore restrict attention to the reduced linear system induced by the active support, on which uniform stability can be established.

Definition A.4

text

(Active reduced system) .

Let S ⊆ { 1 , … , n } × { 1 , … , n } S\subseteq{1,\dots,n}\times{1,\dots,n} denote the active support from Assumption A.1 . Define the active row and column sets

text

    I  S   :=   {  i  ∣     ∃  j   ,   (  i  ,  j  )    ∈  S   }    ,    J  S   :=   {  j  ∣     ∃  i   ,   (  i  ,  j  )    ∈  S   }     .   I_{S}:=\{i\mid\exists j,\ (i,j)\in S\},\qquad J_{S}:=\{j\mid\exists i,\ (i,j)\in S\}.

We define 𝒜 S ( ε ) \mathcal{A}{S}(\varepsilon) as the restriction of 𝒜 ( ε ) \mathcal{A}(\varepsilon) to the variables ( f i ) i ∈ I S (f{i}){i\in I{S}} and ( g j ) j ∈ J S (g_{j}){j\in J{S}} . All operators are considered on the gauge-fixed subspace

text

  {   (  f  ,  g  )   :      ∑   i  ∈   I  S      f  i    =  0   ,     ∑   j  ∈   J  S      g  j    =  0    }   .   \left\{(f,g):\sum_{i\in I_{S}}f_{i}=0,\quad\sum_{j\in J_{S}}g_{j}=0\right\}.

Lemma A.5

text

(Uniform invertibility on the active subspace) .

Under Assumption A.1 , there exist ε 0 > 0 \varepsilon_{0}>0
and a finite constant M ( η , | S | ) M(\eta,|S|) such that for all 0

text

    ‖    𝒜  S       (  ε  )    −  1     ‖   op   ≤   M     (  η  ,   |  S  |   )     .   \|\mathcal{A}_{S}(\varepsilon)^{-1}\|_{\mathrm{op}}\leq M(\eta,|S|).

Proof.

For any active row i ∈ I S i\in I_{S} , Assumption A.1 implies

text

   r  i  ε   =    ∑  j    P   i    j   ε    ≥    ∑   j  :    (  i  ,  j  )   ∈  S      P   i    j   ε    ≥  η   .   r_{i}^{\varepsilon}=\sum_{j}P^{\varepsilon}_{ij}\geq\sum_{j:(i,j)\in S}P^{\varepsilon}_{ij}\geq\eta.

An analogous bound holds for all active columns j ∈ J S j\in J_{S} . Hence all diagonal entries of 𝒜 S ( ε ) \mathcal{A}_{S}(\varepsilon) are uniformly bounded below by η \eta .

Moreover, 𝒜 S ( ε ) \mathcal{A}{S}(\varepsilon) coincides with the Hessian of the entropically regularized transport objective restricted to the face of the transport polytope induced by S S . By Assumption A.1 , this face remains fixed for all ε ≤ ε 0 \varepsilon\leq\varepsilon{0} , and the objective is strictly convex on the corresponding affine subspace modulo gauge invariance.

It follows that 𝒜 S ( ε ) \mathcal{A}_{S}(\varepsilon) is positive definite on the gauge-fixed subspace. Since the dimension of the reduced system depends only on | S | |S| and all entries are uniformly bounded, compactness yields the existence of a uniform inverse bound

text

    ‖    𝒜  S       (  ε  )    −  1     ‖   op   ≤   M     (  η  ,   |  S  |   )     .   \|\mathcal{A}_{S}(\varepsilon)^{-1}\|_{\mathrm{op}}\leq M(\eta,|S|).

∎

Remark A.6

Entries of P ε P^{\varepsilon} outside the active support S S are O ( τ ( ε ) ) O(\tau(\varepsilon)) and do not enter the reduced system. Their influence on the active dynamics is therefore suppressed and does not affect the stability of 𝒜 S ( ε ) \mathcal{A}_{S}(\varepsilon) .

A.5

Directional sensitivity lower bound In this section we establish a lower bound on the sensitivity of the Sinkhorn map with respect to cost perturbations. Unlike a refined asymptotic expansion, our argument does not rely on a separation between leading and higher-order terms. Instead, we show that there exists a perturbation direction for which the Jacobian response grows at least on the order of ε − 1 \varepsilon^{-1} .

Lemma A.7

Lemma A.5 (Constructive operator-norm lower bound). Let S denote the active support set…

Let S S denote the active support set at the fixed point P ⋆ = P ⋆ ( C , ε ) P^{\star}=P^{\star}(C,\varepsilon) and assume the gauge-fixed linear system for the implicit derivative is of the form

text

    A  S      (  ε  )     Δ    z   =    1  ε      B  S      [   Δ    C   ]     ,   A_{S}(\varepsilon)\,\Delta z\;=\;\frac{1}{\varepsilon}\,B_{S}[\Delta C],

(13)

where

• A S ( ε ) : 𝒵 → 𝒵 A_{S}(\varepsilon):\mathcal{Z}\to\mathcal{Z} is the linear operator (matrix) on the gauge-fixed dual variable subspace 𝒵 \mathcal{Z} arising from differentiating the Sinkhorn fixed-point equations with respect to the dual potentials;

• B S : 𝒞 S → 𝒵 B_{S}:\mathcal{C}{S}\to\mathcal{Z} is the linear map that takes a perturbation of the cost restricted to the active support, Δ C ∈ 𝒞 S \Delta C\in\mathcal{C}{S} , to the right-hand side of ( 13 );

• and the perturbation of the primal fixed point on the active support, Δ P S \Delta P_{S} , is related to Δ z \Delta z by a bounded linear map

text

   Δ     P  S    =    R  S      [   Δ    z   ]     ,   \Delta P_{S}\;=\;R_{S}[\Delta z],

(14)

for some linear operator R S : 𝒵 → 𝒫 S R_{S}:\mathcal{Z}\to\mathcal{P}{S} (here 𝒫 S \mathcal{P}{S} denotes the space of primal perturbations supported on S S ).

Assume furthermore the following regularity conditions hold for sufficiently small ε > 0 \varepsilon>0 :

(a) (Invertibility) A S ( ε ) A_{S}(\varepsilon) is invertible on 𝒵 \mathcal{Z} and there exists a min > 0 a_{\min}>0 such that

text

   σ  min      (    A  S      (  ε  )    )    ≥   a  min   >  0   \sigma_{\min}\big(A_{S}(\varepsilon)\big)\;\geq\;a_{\min}>0

uniformly for the range of ε \varepsilon under consideration.

(b) (Non-degeneracy of cost-action) The composed operator

text

  M     (  ε  )    :=    R  S      A  S       (  ε  )    −  1       B  S     M(\varepsilon)\;:=\;R_{S}\,A_{S}(\varepsilon)^{-1}\,B_{S}

is not identically zero as a map 𝒞 S → 𝒫 S \mathcal{C}{S}\to\mathcal{P}{S} (equivalently, there exists some Δ C ∈ 𝒞 S \Delta C\in\mathcal{C}_{S} with M ( ε ) [ Δ C ] ≠ 0 M(\varepsilon)[\Delta C]\neq 0 ).

Define the fixed-point sensitivity operator

text

   D     S  ε      (  C  )    :     𝒞  S   →   𝒫  S    ,    D     S  ε      (  C  )      [   Δ    C   ]    =   Δ     P  S       .   DS_{\varepsilon}(C)\;:\;\mathcal{C}_{S}\to\mathcal{P}_{S},\qquad DS_{\varepsilon}(C)[\Delta C]\;=\;\Delta P_{S}.

Then the following constructive lower bound holds:

text

    ‖   D     S  ε      (  C  )    ‖   op   =    1  ε       ‖   M     (  ε  )    ‖   op    ≥    c     (  ε  )    ε    ,   \|DS_{\varepsilon}(C)\|_{\mathrm{op}}\;=\;\frac{1}{\varepsilon}\,\|M(\varepsilon)\|_{\mathrm{op}}\;\geq\;\frac{c(\varepsilon)}{\varepsilon},

(15)

where ∥ ⋅ ∥ op |\cdot|_{\mathrm{op}} denotes the operator norm (induced by a chosen Euclidean norm on the finite-dimensional spaces) and

text

   c     (  ε  )    :=    ‖   M     (  ε  )    ‖   op   >   0   .   c(\varepsilon)\;:=\;\|M(\varepsilon)\|_{\mathrm{op}}\;>\;0.

Moreover, one may explicitly construct a test perturbation Δ C ⋆ \Delta C^{\star} (a singular vector of M ( ε ) M(\varepsilon) ) that attains (or approximates) the lower bound in ( 15 ):

text

    ‖   D     S  ε      (  C  )      [   Δ     C  ⋆    ]    ‖    ‖   Δ     C  ⋆    ‖    ≈     ‖   M     (  ε  )    ‖   op   ε    .   \frac{\|DS_{\varepsilon}(C)[\Delta C^{\star}]\|}{\|\Delta C^{\star}\|}\;\approx\;\frac{\|M(\varepsilon)\|_{\mathrm{op}}}{\varepsilon}.

In particular, under the uniform lower bound a min > 0 a_{\min}>0 on σ min ( A S ( ε ) ) \sigma_{\min}(A_{S}(\varepsilon)) and if the operator norms ‖ R S ‖ |R_{S}| and ‖ B S ‖ |B_{S}| remain O ( 1 ) O(1) as ε ↓ 0 \varepsilon\downarrow 0 , then c ( ε ) c(\varepsilon) is bounded away from zero and therefore

text

   ‖   D     S  ε      (  C  )    ‖   op   ≥    c  0   ε    \|DS_{\varepsilon}(C)\|_{\mathrm{op}}\;\geq\;\frac{c_{0}}{\varepsilon}

for some constant c 0 > 0 c_{0}>0 independent of ε \varepsilon (for all sufficiently small ε \varepsilon ).

Proof.

The proof is constructive and algebraic; it follows directly from ( 13 )–( 14 ).

From ( 13 ) and the invertibility of A S ( ε ) A_{S}(\varepsilon) we obtain

text

   Δ    z   =    A  S       (  ε  )    −  1       1  ε      B  S      [   Δ    C   ]     .   \Delta z\;=\;A_{S}(\varepsilon)^{-1}\,\frac{1}{\varepsilon}\,B_{S}[\Delta C].

Using ( 14 ) we then have the representation

text

   Δ     P  S    =    R  S      [   Δ    z   ]    =    R  S      A  S       (  ε  )    −  1       1  ε      B  S      [   Δ    C   ]    =    1  ε     M     (  ε  )      [   Δ    C   ]     .   \Delta P_{S}\;=\;R_{S}[\Delta z]\;=\;R_{S}\,A_{S}(\varepsilon)^{-1}\,\frac{1}{\varepsilon}\,B_{S}[\Delta C]\;=\;\frac{1}{\varepsilon}\,M(\varepsilon)[\Delta C].

Since the map Δ C ↦ Δ P S \Delta C\mapsto\Delta P_{S} is exactly D S ε ( C ) DS_{\varepsilon}(C) (restricted to perturbations supported on S S ), the operator identity

text

  D     S  ε      (  C  )    =    1  ε     M     (  ε  )     DS_{\varepsilon}(C)\;=\;\frac{1}{\varepsilon}\,M(\varepsilon)

holds. Taking operator norms yields

text

    ‖   D     S  ε      (  C  )    ‖   op   =    1  ε       ‖   M     (  ε  )    ‖   op     .   \|DS_{\varepsilon}(C)\|_{\mathrm{op}}\;=\;\frac{1}{\varepsilon}\,\|M(\varepsilon)\|_{\mathrm{op}}.

By assumption (b) M ( ε ) ≠ 0 M(\varepsilon)\neq 0 , hence its operator norm c ( ε ) := ‖ M ( ε ) ‖ op c(\varepsilon):=|M(\varepsilon)|_{\mathrm{op}} is strictly positive, which proves ( 15 ).

To make the bound constructive, let u ∈ 𝒞 S u\in\mathcal{C}_{S} be a (right) unit-norm singular vector corresponding to the top singular value of M ( ε ) M(\varepsilon) , i.e.

text

   ‖   M     (  ε  )     u   ‖   =    ‖   M     (  ε  )    ‖   op    .   \|M(\varepsilon)u\|\;=\;\|M(\varepsilon)\|_{\mathrm{op}}.

Choosing Δ C ⋆ = u \Delta C^{\star}=u yields

text

    ‖   D     S  ε      (  C  )      [   Δ     C  ⋆    ]    ‖    ‖   Δ     C  ⋆    ‖    =    1  ε       ‖   M     (  ε  )     u   ‖    ‖  u  ‖     =     ‖   M     (  ε  )    ‖   op   ε    ,   \frac{\|DS_{\varepsilon}(C)[\Delta C^{\star}]\|}{\|\Delta C^{\star}\|}=\frac{1}{\varepsilon}\,\frac{\|M(\varepsilon)u\|}{\|u\|}=\frac{\|M(\varepsilon)\|_{\mathrm{op}}}{\varepsilon},

thus achieving the claimed value.

Finally, to obtain a useful uniform-in- ε \varepsilon lower bound, we note that

text

   ‖   M     (  ε  )    ‖   op   ≥      ‖   R  S   ‖    −  1       ‖   B  S   ‖      ‖    A  S      (  ε  )    ‖   op     \|M(\varepsilon)\|_{\mathrm{op}}\;\geq\;\frac{\|R_{S}\|^{-1}\,\|B_{S}\|}{\|A_{S}(\varepsilon)\|_{\mathrm{op}}}

and, more directly by submultiplicativity,

text

    ‖   M     (  ε  )    ‖   op   ≥      ‖   R  S   ‖    −  1       ‖   B  S   ‖      ‖    A  S       (  ε  )    −  1     ‖    −  1     =    ‖   R  S   ‖      ‖    A  S       (  ε  )    −  1     ‖      ‖   B  S   ‖     .   \|M(\varepsilon)\|_{\mathrm{op}}\;\geq\;\frac{\|R_{S}\|^{-1}\,\|B_{S}\|}{\|A_{S}(\varepsilon)^{-1}\|^{-1}}\;=\;\|R_{S}\|\,\|A_{S}(\varepsilon)^{-1}\|\,\|B_{S}\|.

Under the uniform spectral lower bound σ min ( A S ( ε ) ) ≥ a min > 0 \sigma_{\min}(A_{S}(\varepsilon))\geq a_{\min}>0 we have ‖ A S ( ε ) − 1 ‖ ≤ 1 / a min |A_{S}(\varepsilon)^{-1}|\leq 1/a_{\min} , hence if ‖ R S ‖ |R_{S}| and ‖ B S ‖ |B_{S}| are O ( 1 ) O(1) (bounded below away from zero on the relevant directions), then ‖ M ( ε ) ‖ op |M(\varepsilon)|{\mathrm{op}} is bounded below by a positive constant c 0 > 0 c{0}>0 independent of ε \varepsilon . Consequently ‖ D S ε ( C ) ‖ op ≥ c 0 / ε |DS_{\varepsilon}(C)|{\mathrm{op}}\geq c{0}/\varepsilon for sufficiently small ε \varepsilon .

This completes the proof. ∎

Remark.

•

The operator M ( ε ) = R S A S ( ε ) − 1 B S M(\varepsilon)=R_{S}A_{S}(\varepsilon)^{-1}B_{S} is explicitly constructible from the linearization matrices implicit in the fixed-point equations; thus c ( ε ) = ‖ M ( ε ) ‖ op c(\varepsilon)=|M(\varepsilon)|{\mathrm{op}} can be computed numerically in practice (compute A S ( ε ) A{S}(\varepsilon) from the linearized active-system, invert it numerically, compose with B S B_{S} and R S R_{S} , and then take the top singular value). In the submission we follow this recipe to produce the numerical estimates reported in Section 6 / Appendix A.6.

• A simple explicit test perturbation that often suffices is to pick Δ C \Delta C supported on a single (or a pair of) active entries: for an active index ( i , j ) ∈ S (i,j)\in S choose Δ C \Delta C equal to the corresponding canonical basis vector; typically B S [ Δ C ] ≠ 0 B_{S}[\Delta C]\neq 0 and through the invertible A S ( ε ) A_{S}(\varepsilon) and nontrivial R S R_{S} this produces a nonzero Δ P S \Delta P_{S} , certifying M ( ε ) ≠ 0 M(\varepsilon)\neq 0 .

Lemma A.8

text

(   A.8   — resolvent bound and spectral-distance consequence) .

Let Φ ( P , C ) \Phi(P,C) be the Sinkhorn fixed-point mapping in the active subspace (so that a fixed point P ⋆ P^{\star} satisfies P ⋆ = Φ ( P ⋆ , C ) P^{\star}=\Phi(P^{\star},C) ), and denote

text

 J  ≔    D  P     Φ     (   P  ⋆   ,  C  )     J\coloneqq D_{P}\Phi(P^{\star},C)

the Jacobian of Φ \Phi with respect to P P evaluated at the fixed point. Assume I − J I-J is invertible (equivalently 1 ∉ σ ( J ) 1\not\in\sigma(J) ). Define the linear operator

text

   D     S  ε      (  C  )    =    ∂   P  ⋆     ∂  C     ,   DS_{\varepsilon}(C)\;=\;\frac{\partial P^{\star}}{\partial C},

i.e. the sensitivity of the fixed point with respect to C C (dependence on ε \varepsilon is implicit). Let

text

   ∂  C   Φ   ≔    D  C     Φ     (   P  ⋆   ,  C  )     \partial_{C}\Phi\coloneqq D_{C}\Phi(P^{\star},C)

be the derivative of Φ \Phi with respect to C C at the fixed point, and denote by ∥ ⋅ ∥ |\cdot| the operator norm induced by the Euclidean norm (or any fixed matrix/operator norm).

Then the following hold.

(Identity) The implicit-differentiation identity

text

  D     S  ε      (  C  )    =     (   I  −  J   )    −  1        ∂  C   Φ     DS_{\varepsilon}(C)\;=\;(I-J)^{-1}\,\partial_{C}\Phi

is valid.

2. (Norm factorization) Consequently,

text

   ‖   D     S  ε      (  C  )    ‖   ≤    ‖    (   I  −  J   )    −  1    ‖      ‖    ∂  C   Φ   ‖     .   \|DS_{\varepsilon}(C)\|\;\leq\;\|(I-J)^{-1}\|\,\|\partial_{C}\Phi\|.

3. (Resolvent lower bound) Let σ ( J ) \sigma(J) denote the spectrum of J J , and define the spectral distance

text

   dist     (  1  ,   σ     (  J  )    )    =    min   λ  ∈   σ     (  J  )      ⁡   |   1  −  λ   |     .   \mathrm{dist}\big(1,\sigma(J)\big)\;=\;\min_{\lambda\in\sigma(J)}|1-\lambda|.

Then

text

   ‖    (   I  −  J   )    −  1    ‖   ≥   1   dist     (  1  ,   σ     (  J  )    )      .   \|(I-J)^{-1}\|\;\geq\;\frac{1}{\mathrm{dist}\big(1,\sigma(J)\big)}.

Hence a divergence of ‖ D S ε ( C ) ‖ |DS_{\varepsilon}(C)| forces dist ( 1 , σ ( J ) ) → 0 \mathrm{dist}(1,\sigma(J))\to 0 .

4. (Upper bound under diagonalizability / conditioning) If, additionally, J J is diagonalizable, i.e. J = V Λ V − 1 J=V\Lambda V^{-1} with Λ = diag ⁡ ( λ i ) \Lambda=\operatorname{diag}(\lambda_{i}) , then

text

   ‖    (   I  −  J   )    −  1    ‖   =   ‖   V      (   I  −  Λ   )    −  1       V   −  1     ‖   ≤   κ     (  V  )       max  i   ⁡   1   |   1  −   λ  i    |      =    κ     (  V  )     dist     (  1  ,   σ     (  J  )    )      ,   \|(I-J)^{-1}\|\;=\;\big\|V(I-\Lambda)^{-1}V^{-1}\big\|\;\leq\;\kappa(V)\,\max_{i}\frac{1}{|1-\lambda_{i}|}\;=\;\frac{\kappa(V)}{\mathrm{dist}\big(1,\sigma(J)\big)},

where κ ( V ) = ‖ V ‖ ‖ V − 1 ‖ \kappa(V)=|V|,|V^{-1}| is the condition number of the modal matrix V V . In particular, if J J is normal then κ ( V ) = 1 \kappa(V)=1 and the equality ‖ ( I − J ) − 1 ‖ = 1 / dist ( 1 , σ ( J ) ) |(I-J)^{-1}|=1/\mathrm{dist}(1,\sigma(J))
holds.

5. (Consequence for ε \varepsilon -scalings) Suppose there exists constants M Φ > 0 M_{\Phi}>0 and c > 0 c>0 (independent of small ε \varepsilon ) such that

text

   ‖    ∂  C   Φ   ‖   ≤    M  Φ     and        ‖   D     S  ε      (  C  )    ‖   ≥   c  ε     \|\partial_{C}\Phi\|\leq M_{\Phi}\qquad\text{and}\qquad\|DS_{\varepsilon}(C)\|\geq\frac{c}{\varepsilon}

for sufficiently small ε > 0 \varepsilon>0 . Then from (2) and (3) we obtain the quantitative bound

text

   dist     (  1  ,   σ     (  J  )    )    ≤    M  Φ    ‖   D     S  ε      (  C  )    ‖    ≤     M  Φ   c     ε    .   \mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{M_{\Phi}}{\|DS_{\varepsilon}(C)\|}\;\leq\;\frac{M_{\Phi}}{c}\,\varepsilon.

Moreover, if J J is diagonalizable with uniformly bounded condition number κ ( V ) ≤ κ max \kappa(V)\leq\kappa_{\max} , then

text

    max  i   ⁡   |   1  −   λ  i    |    =   dist     (  1  ,   σ     (  J  )    )    ≤     M  Φ   c     ε    ,   \max_{i}|1-\lambda_{i}|\;=\;\mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{M_{\Phi}}{c}\,\varepsilon,

and in particular the spectral radius satisfies

text

  1  −    max  i   ⁡   |   λ  i   |     =    1  −   λ  max    ≤   C    ε    1-\max_{i}|\lambda_{i}|\;=\;1-\lambda_{\max}\;\leq\;C\,\varepsilon

for some constant C C depending only on M Φ M_{\Phi} , c c and (if used) κ max \kappa_{\max} .

Proof.

We give a self-contained proof of each item.

(Identity).

Differentiate the fixed-point equation F ( P , C ) ≔ P − Φ ( P , C ) = 0 F(P,C)\coloneqq P-\Phi(P,C)=0 with respect to C C at the point ( P ⋆ , C ) (P^{\star},C) in the direction Δ C \Delta C . The total derivative yields

text

     D  P     F     [   Δ    P   ]    +    D  C     F     [   Δ    C   ]     =  0   ,   D_{P}F\,[\Delta P]+D_{C}F\,[\Delta C]=0,

i.e.

text

     (   I  −    D  P     Φ    )     Δ    P   −    ∂  C    Φ     [   Δ    C   ]      =  0   .   (I-D_{P}\Phi)\,\Delta P-\partial_{C}\Phi[\Delta C]=0.

Rearranging gives

text

   Δ    P   =     (   I  −  J   )    −  1        ∂  C    Φ     [   Δ    C   ]       ,   \Delta P\;=\;(I-J)^{-1}\,\partial_{C}\Phi[\Delta C],

and since this holds for arbitrary Δ C \Delta C the identity D S ε ( C ) = ( I − J ) − 1 ∂ C Φ DS_{\varepsilon}(C)=(I-J)^{-1}\partial_{C}\Phi follows.

(Norm factorization).

Taking operator norms of the identity gives

text

   ‖   D     S  ε      (  C  )    ‖   =   ‖     (   I  −  J   )    −  1        ∂  C   Φ    ‖   ≤    ‖    (   I  −  J   )    −  1    ‖      ‖    ∂  C   Φ   ‖     ,   \|DS_{\varepsilon}(C)\|\;=\;\|(I-J)^{-1}\partial_{C}\Phi\|\;\leq\;\|(I-J)^{-1}\|\,\|\partial_{C}\Phi\|,

which proves (2).

(Resolvent lower bound).

Recall that for any invertible matrix A A the operator norm of its inverse equals the reciprocal of the smallest singular value:

text

   ‖   A   −  1    ‖   =   1    σ  min      (  A  )      .   \|A^{-1}\|\;=\;\frac{1}{\sigma_{\min}(A)}.

Apply this with A = I − J A=I-J to obtain ‖ ( I − J ) − 1 ‖ = 1 / σ min ( I − J ) |(I-J)^{-1}|=1/\sigma_{\min}(I-J) . For any eigenvalue λ ∈ σ ( J ) \lambda\in\sigma(J) and any unit eigenvector v v for that eigenvalue (i.e. J v = λ v Jv=\lambda v with ‖ v ‖ = 1 |v|=1 ), we have

text

   ‖    (   I  −  J   )     v   ‖   =   |   1  −  λ   |    .   \|(I-J)v\|\;=\;|1-\lambda|.

Therefore, by the variational characterization of the smallest singular value,

text

    σ  min      (   I  −  J   )    =    min    ‖  x  ‖   =  1    ⁡   ‖    (   I  −  J   )     x   ‖    ≤    min   λ  ∈   σ     (  J  )      ⁡   |   1  −  λ   |    =   dist     (  1  ,   σ     (  J  )    )     .   \sigma_{\min}(I-J)\;=\;\min_{\|x\|=1}\|(I-J)x\|\;\leq\;\min_{\lambda\in\sigma(J)}|1-\lambda|\;=\;\mathrm{dist}\big(1,\sigma(J)\big).

Taking reciprocals yields

text

   ‖    (   I  −  J   )    −  1    ‖   =   1    σ  min      (   I  −  J   )     ≥   1   dist     (  1  ,   σ     (  J  )    )      ,   \|(I-J)^{-1}\|\;=\;\frac{1}{\sigma_{\min}(I-J)}\;\geq\;\frac{1}{\mathrm{dist}\big(1,\sigma(J)\big)},

which proves (3). This inequality is non-asymptotic and requires no normality/diagonalizability hypotheses.

(Upper bound under diagonalizability).

If J = V Λ V − 1 J=V\Lambda V^{-1} with Λ = diag ⁡ ( λ i ) \Lambda=\operatorname{diag}(\lambda_{i}) then

text

    (   I  −  J   )    −  1    =   V      (   I  −  Λ   )    −  1       V   −  1      .   (I-J)^{-1}=V(I-\Lambda)^{-1}V^{-1}.

Hence

text

   ‖    (   I  −  J   )    −  1    ‖   ≤    ‖  V  ‖      ‖   V   −  1    ‖      ‖    (   I  −  Λ   )    −  1    ‖    =   κ     (  V  )       max  i   ⁡   1   |   1  −   λ  i    |      =    κ     (  V  )     dist     (  1  ,   σ     (  J  )    )      .   \|(I-J)^{-1}\|\leq\|V\|\,\|V^{-1}\|\,\|(I-\Lambda)^{-1}\|=\kappa(V)\,\max_{i}\frac{1}{|1-\lambda_{i}|}=\frac{\kappa(V)}{\mathrm{dist}\big(1,\sigma(J)\big)}.

If J J is normal then V V can be chosen unitary, κ ( V ) = 1 \kappa(V)=1 , and the expression is exact: ‖ ( I − J ) − 1 ‖ = 1 / dist ( 1 , σ ( J ) ) |(I-J)^{-1}|=1/\mathrm{dist}(1,\sigma(J)) .

(Consequence for ε \varepsilon -scalings).

Combining (2) and (3) we have

text

   ‖   D     S  ε      (  C  )    ‖   ≤    ‖    (   I  −  J   )    −  1    ‖      ‖    ∂  C   Φ   ‖    ≤    ‖    ∂  C   Φ   ‖    dist     (  1  ,   σ     (  J  )    )      .   \|DS_{\varepsilon}(C)\|\;\leq\;\|(I-J)^{-1}\|\,\|\partial_{C}\Phi\|\;\leq\;\frac{\|\partial_{C}\Phi\|}{\mathrm{dist}(1,\sigma(J))}.

Rearranging gives

text

   dist     (  1  ,   σ     (  J  )    )    ≤    ‖    ∂  C   Φ   ‖    ‖   D     S  ε      (  C  )    ‖     .   \mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{\|\partial_{C}\Phi\|}{\|DS_{\varepsilon}(C)\|}.

Under the assumptions ‖ ∂ C Φ ‖ ≤ M Φ |\partial_{C}\Phi|\leq M_{\Phi} and ‖ D S ε ( C ) ‖ ≥ c / ε |DS_{\varepsilon}(C)|\geq c/\varepsilon the displayed inequality becomes

text

   dist     (  1  ,   σ     (  J  )    )    ≤     M  Φ   c     ε    ,   \mathrm{dist}\big(1,\sigma(J)\big)\;\leq\;\frac{M_{\Phi}}{c}\,\varepsilon,

which proves the stated O ( ε ) O(\varepsilon) bound. If J J is diagonalizable with conditioning κ ( V ) ≤ κ max \kappa(V)\leq\kappa_{\max} , then the upper bound in item (4) implies the same proportional dependence for the individual eigenvalue gaps, and hence an O ( ε ) O(\varepsilon) bound on 1 − λ max 1-\lambda_{\max} up to the multiplicative factor κ max \kappa_{\max} .

This completes the proof. ∎

Remark.

The chain of inequalities above makes manifest two facts that are important for interpreting the sensitivity blow-up:

• The inequality ‖ ( I − J ) − 1 ‖ ≥ 1 / dist ( 1 , σ ( J ) ) |(I-J)^{-1}|\geq 1/\mathrm{dist}(1,\sigma(J)) always holds and therefore any divergence of ‖ D S ε ‖ |DS_{\varepsilon}| forces the spectrum of J J to approach the point 1 1 in the complex plane (spectral distance goes to zero).

• To upgrade the spectral-distance statement into a statement about the largest eigenvalue (for instance to conclude 1 − λ max ∼ C ε 1-\lambda_{\max}\sim C\varepsilon ) one needs extra regularity such as bounded diagonalization conditioning κ ( V ) \kappa(V) or near-normality of J J . In non-normal cases pseudospectral effects can make ‖ ( I − J ) − 1 ‖ |(I-J)^{-1}| much larger than 1 / dist ( 1 , σ ( J ) ) 1/\mathrm{dist}(1,\sigma(J)) , so numerical spectral/pseudospectral diagnostics are recommended to validate any stronger asymptotic claim in practice.

Lemma A.9

text

(Explicit bound on second-order remainder) .

Let ℋ ( ε ) = D 2 𝒮 ε \mathcal{H}(\varepsilon)=D^{2}\mathcal{S}_{\varepsilon} denote the Hessian tensor of the Sinkhorn map with respect to the cost matrix C C (and parameter ε \varepsilon ). Under Assumption A.1 , for all 0

text

    ‖    D  2      𝒮  ε    ‖   op   ≤    K  quad    ε  2     ,   \|D^{2}\mathcal{S}_{\varepsilon}\|_{\mathrm{op}}\leq\frac{K_{\mathrm{quad}}}{\varepsilon^{2}},

where K quad K_{\mathrm{quad}} depends on the active support size | S | |S| , the non-degeneracy bound η \eta , and the condition number M M from Lemma A.5 . Consequently, the Taylor remainder for a perturbation δ = ( δ C , δ ε ) \delta=(\delta C,\delta\varepsilon) satisfies ‖ ℛ ( δ ) ‖ ≤ K ε 2 ‖ δ ‖ 2 |\mathcal{R}(\delta)|\leq\frac{K}{\varepsilon^{2}}|\delta|^{2} .

Proof.

We derive the bound by explicitly differentiating the Jacobian operator derived in Appendix A.3 . Recall that the first variation P ˙ = D 𝒮 ε [ C ˙ ] \dot{P}=D\mathcal{S}_{\varepsilon}[\dot{C}] is determined by the linear system on the active support:

text

    𝒜  S      (  ε  )      (      f  ˙        g  ˙      )    =    1  ε     ℬ     (   C  ˙   )     ,   \mathcal{A}_{S}(\varepsilon)\begin{pmatrix}\dot{f}\\

\dot{g}\end{pmatrix}=\frac{1}{\varepsilon}\mathcal{B}(\dot{C}),

(16)

where 𝒜 S ( ε ) \mathcal{A}{S}(\varepsilon) involves blocks of P ε P^{\varepsilon} . To find the second derivative (Hessian), we differentiate ( 16 ) again with respect to the parameters. Let ∂ \partial denote the differentiation operator. Applying the product rule to 𝒜 S 𝐳 = 𝐛 \mathcal{A}{S}\mathbf{z}=\mathbf{b} (where 𝐳 \mathbf{z} are dual potentials):

text

     𝒜  S      (   ∂  𝐳   )    +    (   ∂   𝒜  S    )     𝐳    =   ∂  𝐛    .   \mathcal{A}_{S}(\partial\mathbf{z})+(\partial\mathcal{A}_{S})\mathbf{z}=\partial\mathbf{b}.

Rearranging for the second-order variation ∂ 𝐳 \partial\mathbf{z} :

text

   ∂  𝐳   =    𝒜  S   −  1       (    ∂  𝐛   −    (   ∂   𝒜  S    )     𝐳    )     .   \partial\mathbf{z}=\mathcal{A}_{S}^{-1}\left(\partial\mathbf{b}-(\partial\mathcal{A}_{S})\mathbf{z}\right).

We now bound the norms of each term on the RHS:

• Invertibility: By Lemma A.5 , ‖ 𝒜 S − 1 ‖ op ≤ M

• First-order scaling: From Lemma A.7 , we know that the first-order potentials scale as ‖ 𝐳 ‖ ∼ O ( 1 / ε ) |\mathbf{z}|\sim O(1/\varepsilon) .

• Derivative of the Operator ∂ 𝒜 S \partial\mathcal{A}{S} : The matrix 𝒜 S \mathcal{A}{S} contains entries P i j ε P_{ij}^{\varepsilon} and 0 . Since P i j ε = exp ⁡ ( ( f i + g j − C i j ) / ε ) P_{ij}^{\varepsilon}=\exp((f_{i}+g_{j}-C_{ij})/\varepsilon) , its derivative is:

text

   ∂   P   i    j   ε    =     1  ε      P   i    j   ε      (     ∂   f  i    +   ∂   g  j     −   ∂   C   i    j      )    −    1   ε  2       P   i    j   ε      (  …  )     δ    ε     .   \partial P_{ij}^{\varepsilon}=\frac{1}{\varepsilon}P_{ij}^{\varepsilon}(\partial f_{i}+\partial g_{j}-\partial C_{ij})-\frac{1}{\varepsilon^{2}}P_{ij}^{\varepsilon}(\dots)\delta\varepsilon.

Crucially, differentiating the exponential map introduces a factor of 1 / ε 1/\varepsilon . Thus, the operator derivative scales as:

text

    ‖   ∂   𝒜  S    ‖   op   ≤     C  1   ε       ‖   P  ε   ‖   op    ≤     C  1      p  max    ε    .   \|\partial\mathcal{A}_{S}\|_{\mathrm{op}}\leq\frac{C_{1}}{\varepsilon}\|P^{\varepsilon}\|_{\mathrm{op}}\leq\frac{C_{1}p_{\max}}{\varepsilon}.

Combining these factors into the expression for ∂ 𝐳 \partial\mathbf{z} :

text

   ‖   ∂  𝐳   ‖   ≤   M     (    O     (   1  /   ε  2    )    +      ‖   ∂   𝒜  S    ‖   ⏟      ∼   1  /  ε     ⋅     ‖  𝐳  ‖   ⏟      ∼   1  /  ε       )    ≤    K  ′    ε  2     .   \|\partial\mathbf{z}\|\leq M\left(O(1/\varepsilon^{2})+\underbrace{\|\partial\mathcal{A}_{S}\|}_{\sim 1/\varepsilon}\cdot\underbrace{\|\mathbf{z}\|}_{\sim 1/\varepsilon}\right)\leq\frac{K^{\prime}}{\varepsilon^{2}}.

Finally, the Hessian of the primal plan P P involves ∂ 𝐳 \partial\mathbf{z} (term of order 1 / ε 2 1/\varepsilon^{2} ) and products of first derivatives ( f ˙ g ˙ ) / ε 2 (\dot{f}\dot{g})/\varepsilon^{2} . Both contributions scale as O ( 1 / ε 2 ) O(1/\varepsilon^{2}) . The bound relies explicitly on M M staying finite (Assumption A.1 ), ensuring the 1 / ε 2 1/\varepsilon^{2} explosion is not "cancelled out" by a vanishing inverse. ∎

A.6

Refined Sensitivity Analysis: The O ( ϵ − 1 ) O(\epsilon^{-1}) Scaling In this section, we provide a rigorous derivation of the sensitivity of the optimal transport plan P ϵ ∗ P^{*}_{\epsilon} with respect to the regularization parameter ϵ \epsilon . We explicitly address why the sensitivity scales as O ( ϵ − 1 ) O(\epsilon^{-1}) rather than the O ( ϵ − 2 ) O(\epsilon^{-2}) scaling one might expect from the Gibbs kernel derivative.

Lemma A.10

text

(Thermodynamic Sensitivity Scaling) .

Let P ϵ ∗ P^{*}_{\epsilon} be the unique solution to the entropy-regularized optimal transport problem with cost matrix C C . Under the non-degeneracy assumption (Assumption A.1), the Frobenius norm of the derivative of the optimal plan with respect to temperature satisfies:

text

     ‖    ∂   P  ϵ  ∗     ∂  ϵ    ‖   F   =   Θ     (   ϵ   −  1    )         as     ϵ   →  0    .   \left\|\frac{\partial P^{*}_{\epsilon}}{\partial\epsilon}\right\|_{F}=\Theta(\epsilon^{-1})\quad\text{as }\epsilon\to 0.

(17)

Proof.

Recall the primal-dual relationship for the optimal plan entries on the active support:

text

    P   i    j   ∗      (  ϵ  )    =   exp  ⁡   (       f  i      (  ϵ  )    +    g  j      (  ϵ  )     −   C   i    j     ϵ   )     ,   P^{*}_{ij}(\epsilon)=\exp\left(\frac{f_{i}(\epsilon)+g_{j}(\epsilon)-C_{ij}}{\epsilon}\right),

(18)

where f ( ϵ ) , g ( ϵ ) f(\epsilon),g(\epsilon) are the optimal dual potentials. Taking the logarithm of both sides:

text

    log  ⁡   P   i    j   ∗       (  ϵ  )    =       f  i      (  ϵ  )    +    g  j      (  ϵ  )     −   C   i    j     ϵ    .   \log P^{*}_{ij}(\epsilon)=\frac{f_{i}(\epsilon)+g_{j}(\epsilon)-C_{ij}}{\epsilon}.

(19)

Differentiating both sides with respect to ϵ \epsilon using the total derivative:

text

    1   P   i    j   ∗        ∂   P   i    j   ∗     ∂  ϵ     =      (     f  ˙   i   +    g  ˙   j    )     ϵ   −   (     f  i   +   g  j    −   C   i    j     )     ϵ  2     ,   \frac{1}{P^{*}_{ij}}\frac{\partial P^{*}_{ij}}{\partial\epsilon}=\frac{(\dot{f}_{i}+\dot{g}_{j})\epsilon-(f_{i}+g_{j}-C_{ij})}{\epsilon^{2}},

(20)

where f ˙ := ∂ f / ∂ ϵ \dot{f}:=\partial f/\partial\epsilon . Rearranging terms and substituting f i + g j − C i j ϵ = log ⁡ P i j ∗ \frac{f_{i}+g_{j}-C_{ij}}{\epsilon}=\log P^{*}_{ij} :

text

    ∂   P   i    j   ∗     ∂  ϵ    =     P   i    j   ∗   ϵ      [    (      f  ˙   i      (  ϵ  )    +     g  ˙   j      (  ϵ  )     )   −    log  ⁡   P   i    j   ∗       (  ϵ  )     ]     .   \frac{\partial P^{*}_{ij}}{\partial\epsilon}=\frac{P^{*}_{ij}}{\epsilon}\left[(\dot{f}_{i}(\epsilon)+\dot{g}_{j}(\epsilon))-\log P^{*}_{ij}(\epsilon)\right].

(21)

We now analyze the asymptotic magnitude of each term in the bracket as ϵ → 0 \epsilon\to 0 :

The Log-Probability Term ( log ⁡ P i j ∗ \log P^{}_{ij} ): On the active support S S (Assumption A.1), the mass P i j ∗ P^{}{ij} converges to a strictly positive constant (the solution to the unregularized linear program). Thus, log ⁡ P i j ∗ = O ( 1 ) \log P^{*}{ij}=O(1) . (For inactive entries, P i j ∗ → 0 P^{*}_{ij}\to 0 exponentially fast, making the derivative vanish regardless of the polynomial factor).
The Dual Potential Derivatives ( f ˙ , g ˙ \dot{f},\dot{g} ): According to the asymptotic expansion theory for entropic optimal transport [Cominetti and San Martín, 1994], the dual potentials admit a Taylor expansion of the form:
text
```
 f  i      (  ϵ  )    =    ϕ  i   +   ϵ  ⋅   h  i    +   O     (   ϵ  2   )      ,   f_{i}(\epsilon)=\phi_{i}+\epsilon\cdot h_{i}+O(\epsilon^{2}),
```

(22)

where ϕ i \phi_{i} are the Kantorovich potentials of the unregularized problem. Differentiating this expansion with respect to ϵ \epsilon yields:

text

     f  ˙   i      (  ϵ  )    =    h  i   +   O     (  ϵ  )     =   O     (  1  )     .   \dot{f}_{i}(\epsilon)=h_{i}+O(\epsilon)=O(1).

(23)

The same holds for g j ( ϵ ) g_{j}(\epsilon) . Consequently, the term ( f ˙ i + g ˙ j ) (\dot{f}{i}+\dot{g}{j}) is bounded by a constant O ( 1 ) O(1) .

Conclusion: Substituting these scalings back into Eq. ( 21 ):

text

    ∂   P   i    j   ∗     ∂  ϵ    =       P   i    j   ∗   ϵ   ⏟    O     (   ϵ   −  1    )     ⋅     [    O     (  1  )    −   O     (  1  )     ]   ⏟    O     (  1  )      =   O     (   ϵ   −  1    )     .   \frac{\partial P^{*}_{ij}}{\partial\epsilon}=\underbrace{\frac{P^{*}_{ij}}{\epsilon}}_{O(\epsilon^{-1})}\cdot\underbrace{\left[O(1)-O(1)\right]}_{O(1)}=O(\epsilon^{-1}).

(24)

Thus, the sensitivity is dominated by the 1 / ϵ 1/\epsilon factor. ∎

Remark A.11

text

(Why not     O     (   ϵ   −  2    )    O(\epsilon^{-2})   ?) .

It is tempting to assume the sensitivity scales as O ( ϵ − 2 ) O(\epsilon^{-2}) by inspecting the Gibbs kernel K i j = e − C i j / ϵ K_{ij}=e^{-C_{ij}/\epsilon} , whose derivative is C i j ϵ 2 K i j \frac{C_{ij}}{\epsilon^{2}}K_{ij} . However, this view ignores the marginal constraints . The dual potentials f ( ϵ ) f(\epsilon) and g ( ϵ ) g(\epsilon) are adaptive; they shift specifically to counteract the mass displacement caused by the changing kernel. Mathematically, this is represented by the subtraction of the log ⁡ P i j ∗ \log P^{*}_{ij} term in Eq. ( 21 ), which effectively cancels the higher-order singularity, reducing the divergence from ϵ − 2 \epsilon^{-2} to ϵ − 1 \epsilon^{-1} .

A.7

Practical constant estimation recipe To report instance-specific constants in experiments:

1. For each ε \varepsilon of interest compute P ε P^{\varepsilon} and evaluate p ( ε ) = ‖ P ε ‖ op p(\varepsilon)=|P^{\varepsilon}|_{\mathrm{op}} (via SVD).

2. Form the block matrix 𝒜 ( ε ) \mathcal{A}(\varepsilon) and evaluate the minimal eigenvalue numerically to get λ min ( 𝒜 ) \lambda_{\min}(\mathcal{A})
and thus M num ( ε ) = 1 / λ min ( 𝒜 ) M_{\mathrm{num}}(\varepsilon)=1/\lambda_{\min}(\mathcal{A}) .

3. Compute the forcing vector 𝐦 k l \mathbf{m}{kl} for representative ( k , l ) ∈ S (k,l)\in S (or take a supremum over ( k , l ) ∈ S (k,l)\in S ), and evaluate numerically ‖ 𝒜 ( ε ) − 1 𝐦 k l ‖ 2 |\mathcal{A}(\varepsilon)^{-1}\mathbf{m}{kl}|_{2} .

4. Plug numeric quantities into the formulas in Lemma A.9 to obtain concrete C 1 ( η , n ) C_{1}(\eta,n) and check the separation condition C 1 ( η , n )

A.8

Experimental and numerical notes (reproducibility) We preserve and expand the implementation guidance:

Log-domain and stability.

For very small ε \varepsilon compute Sinkhorn in the log-domain using log-sum-exp to avoid kernel underflow. Keep track of numerical tolerances and the Sinkhorn maximum iteration count.

Implicit differentiation vs finite difference.

For Jacobian estimates use the implicit differentiation linear solves described above (cost ≈ O ( n 3 ) \approx O(n^{3}) per right-hand side) rather than naive finite differences ( O ( n 4 ) O(n^{4}) total in naive implementations). Use a robust linear solver (LU with pivoting or iterative solver with preconditioner) to solve systems involving 𝒜 ( ε ) \mathcal{A}(\varepsilon) .

Repetition and reporting.

Report means and standard deviations across multiple random seeds (we recommend 5–10) when presenting empirical scaling laws for ε \varepsilon -dependence.

A.9

Raw numeric estimates and CSV Include the CSV with columns eps, op_norm, C0_est, dS_de_norm, K1_est, K2_est, rho_mean, rho_std . When reproducing tables in the supplement please make sure the CSV file name used by matches the distributed file.

A.10

Reproducibility checklist •

Code and CSV for numeric tables and plots included in the supplement.

• Random seeds and environment (Python/NumPy versions, BLAS/LAPACK backend) recorded in the repository.

• Implementation notes: whether log-domain or direct kernel was used, tolerance and max-iteration values for Sinkhorn, and whether implicit differentiation or finite differences were used for Jacobian estimates.

Appendix B

Proof of the Thermodynamic Speed Limit In this section, we provide the formal proof for Theorem 3.2 and Corollary 3.3 , establishing the O ( ϵ 2 ) O(\epsilon^{2}) scaling law required to prevent premature mode collapse.

B.1

Error Dynamics Decomposition Let P t ∗ P^{}{t} denote the optimal plan at step t t (corresponding to ϵ t \epsilon{t} ). The tracking error e t = P t − P t ∗ e_{t}=P_{t}-P^{}_{t} evolves according to the discrete dynamics:

text

 e   t  +  1    \displaystyle e_{t+1}    
   =    P   t  +  1    −   P   t  +  1   ∗     \displaystyle=P_{t+1}-P^{*}_{t+1}    





   =     𝒮   ϵ   t  +  1        (   P  t   )    −   P   t  +  1   ∗     \displaystyle=\mathcal{S}_{\epsilon_{t+1}}(P_{t})-P^{*}_{t+1}    





   ≈      P   t  +  1   ∗   +    J   t  +  1       (    P  t   −   P   t  +  1   ∗    )     −   P   t  +  1   ∗      (Linearization)    \displaystyle\approx P^{*}_{t+1}+J_{t+1}(P_{t}-P^{*}_{t+1})-P^{*}_{t+1}\quad\text{(Linearization)}    





   =    J   t  +  1       (      P  t   −   P  t  ∗    +   P  t  ∗    −   P   t  +  1   ∗    )     \displaystyle=J_{t+1}(P_{t}-P^{*}_{t}+P^{*}_{t}-P^{*}_{t+1})    





   =     J   t  +  1       e  t    +    J   t  +  1         (    P  t  ∗   −   P   t  +  1   ∗    )   ⏟    Drift      Δ  t        \displaystyle=J_{t+1}e_{t}+J_{t+1}\underbrace{(P^{*}_{t}-P^{*}_{t+1})}_{\text{Drift }\Delta_{t}}

(25)

where J t + 1 J_{t+1} is the Jacobian of the Sinkhorn map at the local optimum. The term Δ t \Delta_{t} represents the shift of the target distribution between iterations due to the temperature change δ t = ϵ t − ϵ t + 1 \delta_{t}=\epsilon_{t}-\epsilon_{t+1} .

B.2

Lemma: Equilibrium Error Bound

Lemma B.1

Consider the recurrence e t + 1 = J e t + u e_{t+1}=Je_{t}+u where J J is a contraction matrix ( ρ ( J )

text

  ‖   e  ∞   ‖   ≤    ‖    (   I  −  J   )    −  1    ‖      ‖  u  ‖     \|e_{\infty}\|\leq\|(I-J)^{-1}\|\|u\|

(26)

Proof.

Unrolling the recurrence yields e k = ∑ i = 0 k − 1 J i u e_{k}=\sum_{i=0}^{k-1}J^{i}u . Taking the limit as k → ∞ k\to\infty gives the Neumann series ∑ i = 0 ∞ J i = ( I − J ) − 1 \sum_{i=0}^{\infty}J^{i}=(I-J)^{-1} . Applying the operator norm consistency yields the bound. ∎

B.3

Proof of Theorem 3.2

We apply Lemma B.1 to the Sinkhorn error dynamics derived above.

Quantifying the Drift ( Δ t \Delta_{t} ): Using a first-order Taylor expansion and the sensitivity result from Proposition 3.1 (Item 1), the distributional drift is:

‖ Δ t ‖ = ‖ P t ∗ − P t + 1 ∗ ‖ ≈ ‖ ∇ ϵ P ∗ ‖ ⋅ | ϵ t − ϵ t + 1 | = Θ ( ϵ t − 1 ) ⋅ δ t |\Delta_{t}|=|P^{}_{t}-P^{}{t+1}|\approx|\nabla{\epsilon}P^{*}|\cdot|\epsilon_{t}-\epsilon_{t+1}|=\Theta(\epsilon_{t}^{-1})\cdot\delta_{t}

(27)

Quantifying the Resolvent ( ‖ ( I − J ) − 1 ‖ |(I-J)^{-1}| ): From Proposition 3.1 (Item 3), the vanishing spectral gap dictates the resolvent growth:

‖ ( I − J ϵ ) − 1 ‖ = Θ ( ϵ t − 1 ) |(I-J_{\epsilon})^{-1}|=\Theta(\epsilon_{t}^{-1})

(28)

Note: In the non-normal regime, pseudospectral effects may cause this term to grow even faster, but Θ ( ϵ − 1 ) \Theta(\epsilon^{-1}) serves as a conservative lower bound for the necessary condition.

Combining Terms: Substituting these into the bound from Lemma B.1 :

‖ e s s ‖ ≤ Θ ( ϵ t − 1 ) × ( Θ ( ϵ t − 1 ) ⋅ δ t ) = Θ ( ϵ t − 2 ) ⋅ δ t |e_{ss}|\leq\Theta(\epsilon_{t}^{-1})\times\left(\Theta(\epsilon_{t}^{-1})\cdot\delta_{t}\right)=\Theta(\epsilon_{t}^{-2})\cdot\delta_{t}

(29)

Stability Condition: To prevent the trajectory from escaping the basin of attraction (Mode Collapse), the steady-state error ‖ e s s ‖ |e_{ss}| must remain smaller than the linearization basin radius R R . Assuming R R shrinks slowly or is O ( 1 ) O(1) :
text
```
Θ     (   ϵ  t   −  2    )    ⋅   δ  t    ≤  R  ⟹   δ  t   ≤   O     (    ϵ  t  2     R   )     \Theta(\epsilon_{t}^{-2})\cdot\delta_{t}\leq R\implies\delta_{t}\leq O(\epsilon_{t}^{2}R)
```

(30)

This confirms that the step size δ t \delta_{t} must scale quadratically with ϵ \epsilon to maintain tracking. ∎

B.4

Proof of Corollary 3.3

Assume a standard exponential annealing schedule ϵ t + 1 = α ϵ t \epsilon_{t+1}=\alpha\epsilon_{t} with decay rate 0

text

  δ  t   =    ϵ  t   −   α     ϵ  t     =    (   1  −  α   )      ϵ  t     \delta_{t}=\epsilon_{t}-\alpha\epsilon_{t}=(1-\alpha)\epsilon_{t}

(31)

Substituting this actual step size into the stability ratio derived in the proof of Theorem 3.2 :

text

   ‖   e   s    s    ‖   R   ∝      ϵ  t   −  2    ⋅   (   1  −  α   )       ϵ  t    R   =    1  −  α    R  ⋅   ϵ  t      \frac{\|e_{ss}\|}{R}\propto\frac{\epsilon_{t}^{-2}\cdot(1-\alpha)\epsilon_{t}}{R}=\frac{1-\alpha}{R\cdot\epsilon_{t}}

(32)

Analyzing the limit as ϵ t → 0 \epsilon_{t}\to 0 , we see that this ratio diverges to infinity:

text

   lim    ϵ  t   →  0      ‖   e   s    s    ‖   R    =  ∞   \lim_{\epsilon_{t}\to 0}\frac{\|e_{ss}\|}{R}=\infty

(33)

Thus, for any fixed decay rate α \alpha and basin radius R R , there exists a critical temperature ϵ crit \epsilon_{\text{crit}} below which the tracking error inevitably exceeds the basin capacity ( R R ), triggering premature collapse. ∎

Main-text claim

Appendix reference (proof / technical details)

Sensitivity of the Sinkhorn fixed point scales as O ( 1 / ε ) O(1/\varepsilon)

Appendix A.3 , Eq. ( 12 ); Lemma A.7 (Appendix A.5 )

Existence of a stable active support set under small ε \varepsilon

Assumption A.1 (Appendix A.1 )

Reduction to the active Schur-complement operator 𝒜 S ( ε ) \mathcal{A}_{S}(\varepsilon)

Appendix A.4 ; Lemma A.5

Implicit differentiation identity D S ε = ( I − J ε ) − 1 ∂ C Φ DS_{\varepsilon}=(I-J_{\varepsilon})^{-1}\partial_{C}\Phi

Lemma A.8 (Appendix A.5 )

Resolvent norm ‖ ( I − J ε ) − 1 ‖ |(I-J_{\varepsilon})^{-1}| must diverge as ε → 0 \varepsilon\to 0

Lemma A.8 ; quantitative constants in Appendix A.7

Non-normal transient amplification despite spectral contraction

Theorem A.2 ; full proof in Appendix A.2

Effective basin shrinkage by modal condition number κ ( V ) \kappa(V)

Appendix A.2 , transient growth bound

Discrete-time tracking formulation for annealing

Appendix B , Section “Discrete Tracking Model”

Existence of a dynamical annealing speed limit

Theorem 3.2 ; proof and constants in Appendix B

text

 O     (  1  )    O(1)     per-step drift under exponential schedule       ε   t  +  1    =   α     ε  t     \varepsilon_{t+1}=\alpha\varepsilon_{t}

Lemma A.7 ; discussion in Appendix A.5

Second-order remainder controlled by O ( 1 / ε 2 ) O(1/\varepsilon^{2}) Hessian bound

Lemma A.9 (Appendix A.5 )

Definition and computation of numerical sensitivity constant M num ( ε ) M_{\mathrm{num}}(\varepsilon)

Appendix A.7 , Section “Numerical Constants”

Justification of QSA drift estimation via implicit differentiation

Appendix A.3 ; error control in Appendix B

ASC admissible drift threshold ‖ Δ t ‖ ≤ 1 − ρ ( J ε ) κ ( V ) R |\Delta_{t}|\leq\frac{1-\rho(J_{\varepsilon})}{\kappa(V)}R

Appendix B , stability inequality derivation

Practical choice of ( α 0 , β , R ) (\alpha_{0},\beta,R) in ASC

Appendix A.7

Failure of linear diagnostics under support bifurcation

Appendix A.1 , discussion following Assumption A.1

Table 2 : Alignment between main-text claims and appendix results. Each nontrivial statement in the main paper is backed by an explicit lemma, theorem, or derivation in the appendix.

References

[1] J. Altschuler, J. Weed, and P. Rigollet (2017)

Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration .

In Advances in Neural Information Processing Systems (NeurIPS) ,

Vol. 30 .

Cited by: §1 .

[2] M. Blondel, V. Séguy, and A. Rolet (2018)

Smooth and sparse optimal transport .

In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) ,

pp. 880–889 .

Cited by: §1 .

[3] L. Chizat (2024)

Annealed Sinkhorn for optimal transport: convergence, regularization path and debiasing .

arXiv preprint arXiv:2408.11620 .

Cited by: §1 .

[4] R. Cominetti and J. San Martín (1994)

Asymptotic analysis of the exponential penalty trajectory in linear programming .

Mathematical Programming 67 ( 1-3 ), pp. 169–187 .

Cited by: §1 , §2.2 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .

[5] M. Cuturi (2013)

Sinkhorn distances: lightspeed computation of optimal transport .

Advances in Neural Information Processing Systems (NeurIPS) 26 .

Cited by: §1 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .

[6] DeepSeek-AI, W. Liang, et al. (2025)

MHC: manifold-constrained hyper-connections .

arXiv preprint arXiv:2512.24880 .

Cited by: §1 , §1 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .

[7] M. Eisenberger, A. Toker, L. Leal-Taixé, F. Bernard, and D. Cremers (2022)

A unified framework for implicit Sinkhorn differentiation .

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ,

pp. 509–518 .

Cited by: §1 .

[8] A. Genevay, G. Peyré, and M. Cuturi (2018)

Learning generative models with Sinkhorn divergences .

In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) ,

pp. 1608–1617 .

Cited by: §1 .

[9] G. Peyré and M. Cuturi (2019)

Computational optimal transport: with applications to data science .

Now Publishers .

Cited by: §1 , §1 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .

[10] R. Sinkhorn and P. Knopp (1967)

Concerning nonnegative matrices and doubly stochastic matrices .

Pacific Journal of Mathematics 21 ( 2 ), pp. 343–348 .

Cited by: §1 .

[11] L. N. Trefethen and M. Embree (2005)

Spectra and pseudospectra: the behavior of nonnormal matrices and operators .

Princeton University Press .

Cited by: §2.3 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .

[12] J. Weed (2018)

An explicit analysis of the entropic penalty in linear programming .

In Proceedings of the 31st Conference On Learning Theory (COLT) ,

pp. 1841–1855 .

Cited by: §1 , §2.2 , Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference .

[13] D. Zhu, H. Huang, Z. Huang, Y. Zeng, Y. Mao, B. Wu, Q. Min, and X. Zhou (2024)

Hyper-connections .

arXiv preprint arXiv:2409.19606 .

Cited by: §1 .

📦Data Source: hf

🔄 Daily sync (03:00 UTC)

AI Summary: Based on hf metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Paper Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: arxiv-paper--unknown--2601.23039
slug: unknown--2601.23039
source: hf
author: Yizhi Liu
license: ArXiv
tags: paper, research

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 0
stars: 0
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

Cite this paper

🔬Technical Deep Dive

⚖️ Nexus Index V2.0

💬 Index Insight

Verification Authority

📝 Executive Summary

❝ Cite Node

Abstract & Analysis

Avoiding Premature Collapse:

Abstract

1

2

2.1

2.2

2.3

Proposition 2.1

3

3.1

Proposition 3.1

Proof.

3.2

Theorem 3.2

Corollary 3.3

4

4.1

4.2

Phase I: Calibration (Offline).

Phase II: Runtime Control (Adaptive Annealing).

5

5.1

5.2

5.3

6

Appendix A

A.1

Assumption A.1

A.2

Theorem A.2

Proof.

Corollary A.3

Proof.

A.3

A.4

Definition A.4

Lemma A.5

Proof.

Remark A.6

A.5

Lemma A.7

Proof.

Remark.

Lemma A.8

Proof.

(Identity).

(Norm factorization).

(Resolvent lower bound).

(Upper bound under diagonalizability).

(Consequence for ε \varepsilon -scalings).

Remark.

Lemma A.9

Proof.

A.6

Lemma A.10

Proof.

Remark A.11

A.7

A.8

Log-domain and stability.

Implicit differentiation vs finite difference.

Repetition and reporting.

A.9

A.10

Appendix B

B.1

B.2

Lemma B.1