Iterative Solvers
This page documents the investigation of iterative Krylov methods as an alternative to sparse direct solvers for the GPU linear solve path. The key finding is that ILU(0) preconditioning of the full stiffness matrix reduces GMRES to a single iteration across all zones, making it viable for GPU deployment where sparse direct solvers face sequential bottlenecks.
Motivation
Sparse direct solvers (LU, LDL^T) are inherently sequential: the factorization follows an elimination tree that limits parallelism. On the CPU, this is acceptable because libraries like UMFPACK and MUMPS are heavily optimized for cache-efficient sequential factorization. On the GPU, the sequential elimination tree becomes a bottleneck because GPUs need thousands of concurrent operations to hide memory latency.
Iterative Krylov methods (GMRES, CG, BiCGStab) are built on sparse matrix-vector products (SpMV), which are embarrassingly parallel and map well to GPU hardware via cuSPARSE. If an effective preconditioner can reduce the iteration count to O(1), the total cost becomes a small number of SpMV operations plus preconditioner applications – all of which can run on the GPU.
The challenge is finding a preconditioner that is both effective (low iteration count) and GPU-friendly (parallel to compute and apply). The investigation below shows that ILU(0) of the full stiffness matrix satisfies both criteria for Sentinel's zone-sized problems.
Saddle-Point Challenge
Sentinel's stiffness matrices arise from mixed displacement-pressure finite elements (Hex27 displacement, Hex8 pressure). For incompressible and nearly-incompressible materials (models 1, 2, 5, 6, 7), the stiffness matrix has the saddle-point structure:
where
This structure makes the system indefinite:
Additionally, the matrices are complex-valued (not Hermitian) because Sentinel uses complex shear moduli
Preconditioner Investigation
The benchmark (benchmark/gmres_solver_benchmark.jl) tested four preconditioning strategies on a median-sized zone from the MGH brain dataset (~1700 DOFs, complex symmetric K). GMRES was configured with restart=50 and relative tolerance 1e-8.
Results Summary
| Preconditioner | Iterations (zone 1) | Iterations (all 276 zones) | Max relative error | Status |
|---|---|---|---|---|
| None | 500 (no convergence) | 500+ on all zones | 8.5% | Unusable |
| Diagonal Schur complement | 135 | 1000-2800+ on most zones | Varies | Fails most zones |
| ILU(A) block + Schur | 37-102 | Inconsistent | 1e-4 to 1e-2 | Unreliable |
| ILU(0) of full K | 1 | 1 on all 276 zones | 1.91e-13 | All zones pass |
Unpreconditioned GMRES
Without preconditioning, GMRES hits the 500-iteration restart limit on every zone. The saddle-point structure means the eigenvalues cluster around both positive and negative values, and the zero block in the pressure-pressure position creates a near-singular subspace that GMRES cannot resolve without guidance.
Block Preconditioners
The standard approach for saddle-point systems is block preconditioning, where the preconditioner exploits the 2x2 block structure:
Diagonal Schur complement: Uses
ILU(A) block + Schur: Replaces the exact
ILU(0) of the Full Matrix
Computing ILU(0) on the entire stiffness matrix
The ILU(0) factorization uses zero fill-in: the factors L and U have the same sparsity pattern as the lower and upper triangles of K, respectively. No drop tolerance parameter needs tuning, and the memory footprint equals that of the original matrix. Higher-fill variants (ILUT with drop tolerances of 0.1, 0.01, 0.001) were also tested; all achieve 1-iteration convergence but at higher setup cost due to the additional fill-in.
Why ILU(0) Works
The effectiveness of full-matrix ILU(0) comes from its ability to capture the coupling between displacement and pressure blocks. When the ILU factorization operates on the full K, the fill-in from the
For Sentinel's zone-sized systems (~1700 DOFs), the stiffness matrix is well-conditioned enough that the zero-fill ILU (no additional fill beyond the original sparsity pattern) captures sufficient information. The K matrices are not pathologically ill-conditioned: the diagonal ratio (max/min of |diag(K)|) is typically O(1e4), and the complex-valued entries from the viscoelastic constitutive law provide numerical regularization that prevents the near-nullity that plagues purely real indefinite systems.
The 1-iteration convergence means that GMRES with ILU(0) is effectively a stationary iterative method: the preconditioned residual is already within tolerance after a single application of
1 ILU(0) factorization (comparable cost to incomplete LU)
1 SpMV (
) 2 triangular solves (forward and back substitution through L and U)
1 vector update
KrylovSolver Integration
The CPU-side Krylov solver is implemented as a KrylovSolver struct that conforms to the AbstractLinearSolver interface:
struct KrylovSolver <: AbstractLinearSolver
ilu_factor::Union{Nothing, ILUFactorization} # Cached ILU(0)
tol::Float64 # GMRES relative tolerance
maxiter::Int # Maximum GMRES iterations
restart::Int # GMRES restart parameter
endCache Lifecycle
The ILU preconditioner is cached between forward and adjoint solves on the same K:
First
linear_solve!call (forward): Computes ILU(0) of K, stores inilu_factor, runs GMRESSecond
linear_solve!call (adjoint): Reusesilu_factorfrom step 1. Since K = K^T (complex symmetric), the same ILU works for the transpose systeminvalidate_cache!call: Discardsilu_factorwhen K is re-assembled with new material properties
This matches the CachedDirectSolver pattern: factorize once for the forward solve, reuse for the adjoint, invalidate when material changes.
Adjoint Reuse
The adjoint system solves
So the adjoint solve reuses the forward ILU on K with conjugated right-hand side, then conjugates the result. This eliminates the cost of a second ILU factorization.
CPU vs GPU Tradeoff
On the CPU, ILU(0) + GMRES is not faster than sparse LU for zone-sized problems. The incomplete factorization has roughly the same cost as a full factorization at this scale, and the sequential triangular solves dominate both approaches. Benchmarks on the MGH dataset show:
| Solver | Time (single zone) |
|---|---|
| UMFPACK LU (K\f) | ~3 ms |
| ILU(0) setup + 1 GMRES iter | ~4 ms |
| Pardiso MKL (LDL^T) | ~2 ms |
The ILU setup cost is comparable to a full LU factorization for matrices of this size (~1700 DOFs, ~50K nonzeros). On the CPU, the sequential triangular solves in ILU and UMFPACK have similar performance because both are memory-bandwidth limited on small matrices.
The value proposition changes on the GPU:
SpMV: cuSPARSE achieves 10-50x speedup over CPU for the SpMV kernel, because the scatter-gather pattern maps to GPU memory controllers
Triangular solves: cuSPARSE
csrsv2parallelizes level-set analysis of the triangular factor, achieving 3-10x speedup over CPU for ILU(0) factors (which have favorable sparsity for parallel wavefront execution)Batching: 276 independent zone solves can run concurrently on the GPU, unlike CPU where thread-level parallelism is limited to ~16 cores
The net effect is that GPU ILU(0) + GMRES for 276 zones costs roughly the same wall time as a single CPU direct solve, providing the parallelism needed for real-time-scale inversions.
This tradeoff is why the CPU code path continues to use CachedDirectSolver (UMFPACK LU) while the GPU code path uses CUDAKrylovSolver (ILU(0) + GMRES). The AbstractLinearSolver dispatch mechanism allows switching between them without changing the optimization loop.
Future Directions
COCR for Complex Symmetric Systems
GMRES requires storing the full Krylov basis (restart vectors of length n), which consumes O(restart * n) memory. For complex symmetric systems (K = K^T but not Hermitian), the Conjugate Orthogonal Conjugate Residual (COCR) method exploits the symmetry to use short recurrences, reducing memory from O(restart * n) to O(n). Since our ILU(0) preconditioner achieves 1-iteration convergence, the memory savings are modest in practice, but COCR would be advantageous if the preconditioner quality degrades on larger zones or different material models.
GPU-Native ILU via cuSPARSE
The current GPU path computes ILU(0) on the CPU (via IncompleteLU.jl) and transfers the factors to the GPU. cuSPARSE provides csrilu02, a GPU-native ILU(0) implementation that computes the incomplete factorization directly on GPU-resident matrices. This eliminates the CPU-GPU transfer of the ILU factors and enables the entire solve pipeline (assembly on CPU, transfer K to GPU, ILU on GPU, GMRES on GPU, read solution) to overlap with other GPU work.
Algebraic Multigrid
For future extensions to larger zone sizes or global (non-decomposed) solves, algebraic multigrid (AMG) preconditioners may outperform ILU(0). AMG achieves mesh-independent convergence rates for elliptic operators, though extending it to complex-valued indefinite saddle-point systems requires specialized coarsening strategies. The AmgX library provides GPU-native AMG for NVIDIA hardware.
Batched ILU for Zone Parallelism
When running 276 zones concurrently on the GPU, each zone needs its own ILU(0) factorization. A batched ILU kernel that processes all zone matrices in parallel would further reduce the setup time. This is not available in cuSPARSE as of CUDA 12, but could be implemented as a custom CUDA kernel that exploits the shared sparsity structure across zones of similar size.