Iterative Solvers

This page documents the investigation of iterative Krylov methods as an alternative to sparse direct solvers for the GPU linear solve path. The key finding is that ILU(0) preconditioning of the full stiffness matrix reduces GMRES to a single iteration across all zones, making it viable for GPU deployment where sparse direct solvers face sequential bottlenecks.

Motivation

Sparse direct solvers (LU, LDL^T) are inherently sequential: the factorization follows an elimination tree that limits parallelism. On the CPU, this is acceptable because libraries like UMFPACK and MUMPS are heavily optimized for cache-efficient sequential factorization. On the GPU, the sequential elimination tree becomes a bottleneck because GPUs need thousands of concurrent operations to hide memory latency.

Iterative Krylov methods (GMRES, CG, BiCGStab) are built on sparse matrix-vector products (SpMV), which are embarrassingly parallel and map well to GPU hardware via cuSPARSE. If an effective preconditioner can reduce the iteration count to O(1), the total cost becomes a small number of SpMV operations plus preconditioner applications – all of which can run on the GPU.

The challenge is finding a preconditioner that is both effective (low iteration count) and GPU-friendly (parallel to compute and apply). The investigation below shows that ILU(0) of the full stiffness matrix satisfies both criteria for Sentinel's zone-sized problems.

Saddle-Point Challenge

Sentinel's stiffness matrices arise from mixed displacement-pressure finite elements (Hex27 displacement, Hex8 pressure). For incompressible and nearly-incompressible materials (models 1, 2, 5, 6, 7), the stiffness matrix has the saddle-point structure:

K = [\begin{matrix} A & B^{T} \\ B & 0 \end{matrix}]

where $A$ is the displacement stiffness block (complex symmetric, positive-definite in the real part), $B$ is the divergence operator coupling displacement to pressure, and the lower-right block is zero.

This structure makes the system indefinite: $K$ has both positive and negative eigenvalues. Standard unpreconditioned Krylov methods converge slowly on indefinite systems because the eigenvalue spectrum spans both sides of the origin, preventing the Krylov subspace from approximating the solution efficiently.

Additionally, the matrices are complex-valued (not Hermitian) because Sentinel uses complex shear moduli $μ = μ^{'} + i μ^{″}$ to model viscoelastic tissue. This rules out standard Hermitian Krylov methods (CG, MINRES) and requires GMRES or a complex-symmetric variant.

Preconditioner Investigation

The benchmark (benchmark/gmres_solver_benchmark.jl) tested four preconditioning strategies on a median-sized zone from the MGH brain dataset (~1700 DOFs, complex symmetric K). GMRES was configured with restart=50 and relative tolerance 1e-8.

Results Summary

Preconditioner	Iterations (zone 1)	Iterations (all 276 zones)	Max relative error	Status
None	500 (no convergence)	500+ on all zones	8.5%	Unusable
Diagonal Schur complement	135	1000-2800+ on most zones	Varies	Fails most zones
ILU(A) block + Schur	37-102	Inconsistent	1e-4 to 1e-2	Unreliable
ILU(0) of full K	1	1 on all 276 zones	1.91e-13	All zones pass

Unpreconditioned GMRES

Without preconditioning, GMRES hits the 500-iteration restart limit on every zone. The saddle-point structure means the eigenvalues cluster around both positive and negative values, and the zero block in the pressure-pressure position creates a near-singular subspace that GMRES cannot resolve without guidance.

Block Preconditioners

The standard approach for saddle-point systems is block preconditioning, where the preconditioner exploits the 2x2 block structure:

Diagonal Schur complement: Uses $diag (A, - B A^{- 1} B^{T})$ as preconditioner. The Schur complement $S = - B A^{- 1} B^{T}$ is approximated by the pressure mass matrix or a diagonal lumping. This works for zone 1 (135 iterations) but fails on most other zones (1000-2800+ iterations). The failure mode is that the Schur complement approximation quality varies with zone geometry and material properties.

ILU(A) block + Schur: Replaces the exact $A^{- 1}$ in the Schur complement with an ILU factorization of $A$ alone. This is better (37-102 iterations) but still inconsistent across zones: some zones converge quickly while others stall.

ILU(0) of the Full Matrix

Computing ILU(0) on the entire stiffness matrix $K$ (not block-by-block) achieves 1 GMRES iteration on every zone, with maximum relative error 1.91e-13. This was tested on all 276 zones of the MGH brain dataset.

The ILU(0) factorization uses zero fill-in: the factors L and U have the same sparsity pattern as the lower and upper triangles of K, respectively. No drop tolerance parameter needs tuning, and the memory footprint equals that of the original matrix. Higher-fill variants (ILUT with drop tolerances of 0.1, 0.01, 0.001) were also tested; all achieve 1-iteration convergence but at higher setup cost due to the additional fill-in.

Why ILU(0) Works

The effectiveness of full-matrix ILU(0) comes from its ability to capture the coupling between displacement and pressure blocks. When the ILU factorization operates on the full K, the fill-in from the $A$ and $B$ blocks naturally propagates into the zero pressure-pressure region, creating an approximate Schur complement implicitly. Block preconditioners miss this coupling because they treat the blocks independently.

For Sentinel's zone-sized systems (~1700 DOFs), the stiffness matrix is well-conditioned enough that the zero-fill ILU (no additional fill beyond the original sparsity pattern) captures sufficient information. The K matrices are not pathologically ill-conditioned: the diagonal ratio (max/min of |diag(K)|) is typically O(1e4), and the complex-valued entries from the viscoelastic constitutive law provide numerical regularization that prevents the near-nullity that plagues purely real indefinite systems.

The 1-iteration convergence means that GMRES with ILU(0) is effectively a stationary iterative method: the preconditioned residual is already within tolerance after a single application of $M^{- 1} K$ . This makes the total cost:

1 ILU(0) factorization (comparable cost to incomplete LU)
1 SpMV ( $K \cdot x$ )
2 triangular solves (forward and back substitution through L and U)
1 vector update

KrylovSolver Integration

The CPU-side Krylov solver is implemented as a KrylovSolver struct that conforms to the AbstractLinearSolver interface:

julia

struct KrylovSolver <: AbstractLinearSolver
    ilu_factor::Union{Nothing, ILUFactorization}  # Cached ILU(0)
    tol::Float64           # GMRES relative tolerance
    maxiter::Int           # Maximum GMRES iterations
    restart::Int           # GMRES restart parameter
end

Cache Lifecycle

The ILU preconditioner is cached between forward and adjoint solves on the same K:

First linear_solve! call (forward): Computes ILU(0) of K, stores in ilu_factor, runs GMRES
Second linear_solve! call (adjoint): Reuses ilu_factor from step 1. Since K = K^T (complex symmetric), the same ILU works for the transpose system
invalidate_cache! call: Discards ilu_factor when K is re-assembled with new material properties

This matches the CachedDirectSolver pattern: factorize once for the forward solve, reuse for the adjoint, invalidate when material changes.

Adjoint Reuse

The adjoint system solves $\bar{K} λ = r$ where $\bar{K} = conj (K^{T}) = conj (K)$ (since K is complex symmetric, $K^{T} = K$ ). Rather than computing a new ILU of $\bar{K}$ , we use the identity:

\bar{K} λ = r ⟺ K \bar{λ} = \bar{r}

So the adjoint solve reuses the forward ILU on K with conjugated right-hand side, then conjugates the result. This eliminates the cost of a second ILU factorization.

CPU vs GPU Tradeoff

On the CPU, ILU(0) + GMRES is not faster than sparse LU for zone-sized problems. The incomplete factorization has roughly the same cost as a full factorization at this scale, and the sequential triangular solves dominate both approaches. Benchmarks on the MGH dataset show:

Solver	Time (single zone)
UMFPACK LU (K\f)	~3 ms
ILU(0) setup + 1 GMRES iter	~4 ms
Pardiso MKL (LDL^T)	~2 ms

The ILU setup cost is comparable to a full LU factorization for matrices of this size (~1700 DOFs, ~50K nonzeros). On the CPU, the sequential triangular solves in ILU and UMFPACK have similar performance because both are memory-bandwidth limited on small matrices.

The value proposition changes on the GPU:

SpMV: cuSPARSE achieves 10-50x speedup over CPU for the SpMV kernel, because the scatter-gather pattern maps to GPU memory controllers
Triangular solves: cuSPARSE csrsv2 parallelizes level-set analysis of the triangular factor, achieving 3-10x speedup over CPU for ILU(0) factors (which have favorable sparsity for parallel wavefront execution)
Batching: 276 independent zone solves can run concurrently on the GPU, unlike CPU where thread-level parallelism is limited to ~16 cores

The net effect is that GPU ILU(0) + GMRES for 276 zones costs roughly the same wall time as a single CPU direct solve, providing the parallelism needed for real-time-scale inversions.

This tradeoff is why the CPU code path continues to use CachedDirectSolver (UMFPACK LU) while the GPU code path uses CUDAKrylovSolver (ILU(0) + GMRES). The AbstractLinearSolver dispatch mechanism allows switching between them without changing the optimization loop.

Future Directions

COCR for Complex Symmetric Systems

GMRES requires storing the full Krylov basis (restart vectors of length n), which consumes O(restart * n) memory. For complex symmetric systems (K = K^T but not Hermitian), the Conjugate Orthogonal Conjugate Residual (COCR) method exploits the symmetry to use short recurrences, reducing memory from O(restart * n) to O(n). Since our ILU(0) preconditioner achieves 1-iteration convergence, the memory savings are modest in practice, but COCR would be advantageous if the preconditioner quality degrades on larger zones or different material models.

GPU-Native ILU via cuSPARSE

The current GPU path computes ILU(0) on the CPU (via IncompleteLU.jl) and transfers the factors to the GPU. cuSPARSE provides csrilu02, a GPU-native ILU(0) implementation that computes the incomplete factorization directly on GPU-resident matrices. This eliminates the CPU-GPU transfer of the ILU factors and enables the entire solve pipeline (assembly on CPU, transfer K to GPU, ILU on GPU, GMRES on GPU, read solution) to overlap with other GPU work.

Algebraic Multigrid

For future extensions to larger zone sizes or global (non-decomposed) solves, algebraic multigrid (AMG) preconditioners may outperform ILU(0). AMG achieves mesh-independent convergence rates for elliptic operators, though extending it to complex-valued indefinite saddle-point systems requires specialized coarsening strategies. The AmgX library provides GPU-native AMG for NVIDIA hardware.

Batched ILU for Zone Parallelism

When running 276 zones concurrently on the GPU, each zone needs its own ILU(0) factorization. A batched ILU kernel that processes all zone matrices in parallel would further reduce the setup time. This is not available in cuSPARSE as of CUDA 12, but could be implemented as a custom CUDA kernel that exploits the shared sparsity structure across zones of similar size.

Iterative Solvers ​

Motivation ​

Saddle-Point Challenge ​

Preconditioner Investigation ​

Results Summary ​

Unpreconditioned GMRES ​

Block Preconditioners ​

ILU(0) of the Full Matrix ​

Why ILU(0) Works ​

KrylovSolver Integration ​

Cache Lifecycle ​

Adjoint Reuse ​

CPU vs GPU Tradeoff ​

Future Directions ​

COCR for Complex Symmetric Systems ​

GPU-Native ILU via cuSPARSE ​

Algebraic Multigrid ​

Batched ILU for Zone Parallelism ​