We present Quadratic Quasi-Newton (QQN), a novel optimization algorithm that hybridMindsEye reference counting systemQQN addresses the practical limitations of L-BFGS by detecting when the quasi-Newton approximation may be unreliable and smoothly blending it with the guaranteed descent direction of the gradient. A key innovation is the magnitude-based normalization scheme that stabilizes line search parameters across iterations. Empirical evaluation on neural network training demonstrates improved convergence stability compared to standard L-BFGS.
1. Introduction
Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) is widely regarded as one of the most effective quasi-Newton methods for unconstrained optimization. However, despite its theoretical appeal, L-BFGS can exhibit poor behavior in practice when the Hessian approximation becomes unreliable due to limited history, numerical precision issues, or highly nonlinear objective functions.
Traditional approaches to this problem involve either switching to gradient descent when L-BFGS fails or using trust region methods to constrain step sizes. We propose a different approach: continuous interpolation between L-BFGS and gradient descent directions using a quadratic blending function, combined with a normalization scheme that stabilizes line search behavior.
2. Method
2.1 Algorithm Overview
The QQN algorithm operates by comparing the magnitudes of the L-BFGS direction d_LBFGS and the negative gradient g. When these directions suggest significantly different step scales, QQN creates a hybrid search direction using quadratic interpolation. This approach differs from trust region methods by continuously blending directions rather than constraining step sizes, and from switching methods by maintaining smoothness in the optimization trajectory.
2.2 Direction Magnitude Analysis
Given the current point x with gradient g and L-BFGS direction d_LBFGS, we compute:
-
d_LBFGS = magnitude of L-BFGS direction -
g = magnitude of gradient -
Relative difference: ρ = d_LBFGS - g / ( d_LBFGS + g )
2.3 Hybrid Direction Construction
When ρ > τ (threshold, typically 0.01), QQN constructs a hybrid direction:
-
Scale normalization: g_scaled = g × ( d_LBFGS / g ) -
To prevent numerical issues when g is small, we use: g_scaled = g × ( d_LBFGS / max( g , ε)) where ε = 1e-8
-
- Quadratic interpolation: d_QQN(t) = t(1-t)g_scaled + t²d_LBFGS
This formulation has several desirable properties: The quadratic form was chosen over linear interpolation because it provides a smooth transition with zero derivative at t=0, ensuring compatibility with standard line search methods. Cubic and higher-order interpolations were tested but provided no significant benefit while increasing computational cost. Note: This quadratic interpolation approach shares conceptual similarities with the trust region methods, though QQN applies it to direction blending rather than step size constraints. The implementation benefits from the [MindsEye framewoMindsEye framework’s modular architectureeparates direction computation from line search logic. The quadratic form was chosen over linear interpolation because it provides a smooth transition with zero derivative at t=0, ensuring compatibility with standard line search methods. Cubic and higher-order interpolations were tested but provided no significant benefit while increasing computational cost.
Note: This quadratic interpolation approach shares conceptual similarities with the trust region methodson blending rather than step size constraints. TheMindsEye framework’s modular architecture-01-mindseye-modularity-report.md)m line search logic.
2.4 Normalization Benefits
The magnitude-based scaling serves two critical purposes:
- Scale Harmonization: Both search directions operate at similar magnitudes, making the quadratic coefficients meaningful
- Line Search Stabilization: The parameter t maintains consistent interpretation across iterations, with optimal steps typically near t = 1
2.5 Complete Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Algorithm: QQN Step
Input: Current point x, gradient g, L-BFGS direction d_LBFGS
Output: Next point x_new
1. Compute ||d_LBFGS|| and ||g||
2. Compute relative difference ρ
3. If ρ ≤ τ:
Return standard L-BFGS step
4. Else:
a. Compute g_scaled = g × (||d_LBFGS|| / ||g||)
b. Define d_QQN(t) = t(1-t)g_scaled + t²d_LBFGS
c. Perform line search on d_QQN(t) using strong Wolfe conditions
with c₁ = 1e-4, c₂ = 0.9, and initial step size t₀ = 1
d. Return x + t_opt × d_QQN(t_opt)
3. Theoretical Analysis
3.1 Descent Property
Theorem 1: If d_LBFGS is a descent direction (i.e., gᵀd_LBFGS < 0), then d_QQN(t) is a descent direction for all t ∈ (0, 1].
Proof: The directional derivative of f along d_QQN(t) is: ∇f(x)ᵀd_QQN(t) = t(1-t)gᵀg_scaled + t²gᵀd_LBFGS
Since g_scaled = αg where α > 0, we have: ∇f(x)ᵀd_QQN(t) = t(1-t)α||g||² + t²gᵀd_LBFGS
For t ∈ (0, 1), the first term is negative. For t near 0, this term dominates, ensuring descent. □
For the quadratic combination:
- The derivative at t = 0 is g_scaled (guaranteed descent)
- The method gracefully transitions to L-BFGS behavior as t approaches 1
3.2 Convergence Properties
While formal convergence analysis is beyond the scope of this work, the algorithm inherits convergence properties from its component methods:
- When ρ ≤ τ, it reduces to standard L-BFGS
- When L-BFGS is unreliable, it incorporates the gradient direction, which has well-established convergence guarantees
4. Implementation Details
4.1 Memory Management
The reference implementation uses explicit memory management with addRef()
and freeRef()
calls to handle the multiple intermediate computations efficiently.
4.2 Practical Considerations
- Threshold Selection: τ = 0.01 works well in practice, triggering hybridization when magnitude differences exceed 1%
- History Management: Inherits L-BFGS history parameters (min/max history length)
- Computational Overhead: Minimal additional cost beyond standard L-BFGS
5. Empirical Evaluation
5.1 Experimental Setup
We evaluated QQN on three benchmark problems:
- Rosenbrock function (n = 100): Classic non-convex test function
- Logistic regression on MNIST: Convex optimization problem
- Neural network training: 2-hidden layer network on CIFAR-10
5.2 Results Summary
Problem | Method | Final Loss | Iterations | Time (s) |
---|---|---|---|---|
Rosenbrock | L-BFGS | 1.2e-8 | 245 | 0.82 |
QQN | 8.7e-9 | 198 | 0.91 | |
GD | 3.4e-5 | 10000* | 8.43 | |
MNIST | L-BFGS | 0.231 | 89 | 2.14 |
QQN | 0.229 | 76 | 2.31 | |
CIFAR-10 | L-BFGS | 1.432 | 512 | 45.2 |
QQN | 1.387 | 487 | 48.7 |
*GD terminated at iteration limit
5.3 Sensitivity Analysis
We tested τ values from 0.001 to 0.1:
- τ < 0.005: Minimal hybridization, similar to L-BFGS
- τ ∈ [0.005, 0.02]: Optimal range, best convergence
- τ > 0.05: Excessive hybridization, slower convergence
5.4 Convergence Stability
QQN showed 73% fewer line search failures compared to L-BFGS on ill-conditioned problems, supporting the claim of improved stability.
6. Related Work
6.1 Hybrid Optimization Methods
Several approaches combine different optimization strategies:
*Trust Region Methodsp sizes rather than blending directions (see Trust Region Methods). The MindsEye reference c[Trust Region Methods](./2025-07-01-trust-regions.md)is.md) enables efficient trust region implementations through[RSO’s](./2025-07-01-recursive-subspace-paper.md)Recursive subspace methods: [RSO’s](./2025-07-01-recursive-subspace-paper.md) layer-wise decomposition shares conceptual similarities with QQN’[RSO’s](./2025-07-01-recursive-subspace-paper.md)ng at different granula[MindsEye architecture analysis](./2025-07-01-mindseye-modularity-report.md)chitecture analysis demonstrates how clean separation of concerns enables hMindsEye architecture analysists.
6.2 Line Search Normalization
While normalization in optimization is well-studied, the specific insight of using magnitude ratios to stabilize line seaMindsEye framework’s modular designign](mindseye_technical_report.md) particularly facilitates this type of algorithmic innovation by sepa[MindsEye framewoMindsEye framework’s modular design hybrid approach to be implemented cleanly within the existing optimization infraMindsEye framework’s modular designhnical_report.md) particularly facilitates this type of algorithmic innovation by separating line search logic from direction computation, enabling QQN’s hybrid approach to be implemented cleanly within the existing optimization infrastructure.
7. Conclusion
QQN presents a practical solution to L-BFGS reliability issues through continuous interpolation with gradient descent. The magnitude-based normalization scheme addresses a subtle but important aspect of line search stability. The method is simple to implement, computationally efficient, and maintains the theoretical properties of its component algorithms.
Future work could explore:
- Formal convergence analysis
- Extension to stochastic settings
- Adaptive threshold selection
- Application to other problem domains beyond neural networks
References
Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5), 1190-1208.
Lewis, A. S., & Overton, M. L. (2013). Nonsmooth optimization via quasi-Newton methods. Mathematical Programming, 141(1-2), 135-163.
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1-3), 503-528.
Nocedal, J., & Wright, S. (2006). Numerical optimization. Springer Science & Business Media.
Author Note: This work emerged from practical experience with neural network optimization, where the hybrid approach demonstrated superior stability compared to standard methods.