CUDA Backend

Sentinel's CUDA backend provides GPU-accelerated linear solves and gradient computation for NVIDIA GPUs. Unlike the Metal backend, CUDA hardware supports IEEE 754 Float64 arithmetic natively, enabling full-precision GPU direct solves via the cuDSS library.

Requirements

Component	Version	Purpose
NVIDIA GPU	Compute capability 7.0+	Hardware
CUDA toolkit	12.0+	Runtime/drivers
CUDA.jl	5.0+	Julia GPU arrays, kernel launch
CUDSS.jl	0.6+	GPU sparse direct solver (cuDSS)
Krylov.jl	0.9+	GPU Krylov methods

Load the extension by importing CUDA and CUDSS before Sentinel:

julia

using CUDA, CUDSS
using Sentinel

This triggers the SentinelCUDAExt package extension, which registers GPU solver implementations and array types.

Float64 Advantage

Sentinel's stiffness matrices are ComplexF64 (128 bits per entry). NVIDIA GPUs execute Float64 arithmetic in hardware at a well-defined throughput ratio (typically 1:2 or 1:32 relative to Float32, depending on the GPU generation). This is a critical advantage over Apple Metal, which does not support Float64 in compute shaders.

Because the linear solver is the dominant cost in the inverse pipeline, GPU-accelerated Float64 direct solves unlock the most significant speedup opportunity. The cuDSS library provides sparse LDL^T and LU factorization directly on GPU memory, avoiding the CPU-GPU transfer bottleneck that would arise from factorizing on the CPU and copying factors back.

Benchmark Results

Full 100-iteration MGH brain MRE inverse solve on NVIDIA DGX Spark (GB10 GPU, 10 CPU threads):

Configuration	Total Time	Per-iter (CG=1)	Per-iter (CG=2)	Speedup
CPU baseline (CachedDirectSolver, 10 threads)	~330 min	~133s	~200s	1.0x
Fortran MPI (14 workers)	~240 min	—	~144s	1.4x
cuDSS batched + CPU line search	222 min	~46s	~140s	1.5x
cuDSS batched + GPU line search	158 min	~35s	~100s	2.1x

Material output validated: mean shear modulus agrees within 1 Pa between CPU and GPU paths (2579.9 Pa vs 2578.9 Pa after 100 iterations).

Per-Iteration Profiling Breakdown (CG=2, 290 zones)

Phase	Time	%	Location
Line search (CUDSSDirectSolver)	~30s	~65%	GPU, CPU threaded
cuDSS batched factorization	4.9s	~11%	GPU
GPU gradient kernel	3.9s	~8%	GPU
CPU assembly	1.5s	~3%	CPU threaded
Zone building	~1s	~2%	CPU threaded
cuDSS solve (forward + adjoint)	95ms	<1%	GPU
Other (consolidation, GC)	~5s	~11%	CPU

CUDSSBatchedSolver

The CUDSSBatchedSolver factors and solves all zone stiffness matrices simultaneously on the GPU. This is the primary solver for Phase 1 (forward + adjoint solves) of the zone decomposition loop.

Design

Zone decomposition produces 200-300 independent linear systems per global iteration, each of size ~1700 DOFs. These are too small individually to saturate GPU parallelism, but as a batch they expose substantial data parallelism. The batched solver:

Transfers all zone K matrices and right-hand sides to GPU memory as CuSparseMatrixCSR arrays
Creates one CudssSolver per zone, configured for general complex factorization
Executes symbolic analysis, numerical factorization, and triangular solve across all zones
Reads solutions back to CPU for consolidation

Key Functions

create_batched_solver! — Allocates GPU memory and creates cuDSS solver instances for a set of zone matrices. Handles variable-size zones by creating independent solver objects.
batched_solve! — Solves all zones using existing factorizations. Each right-hand side is transferred to GPU, solved, and read back.
batched_update_and_refactor! — Updates numerical values of already-analyzed matrices (reuses symbolic factorization) when material properties change between CG iterations.

Variable-Size Zones

Zone matrices vary in size because zones near the domain boundary contain fewer elements. The batched solver handles this by maintaining a vector of independent CudssSolver instances rather than requiring uniform matrix dimensions. Each solver operates on its own sparse data, so no padding or masking is needed.

Memory Management

For the MGH brain dataset (~290 zones, ~1700 DOFs per zone), the total GPU memory for all zone K matrices, factors, and work arrays is approximately 2-4 GB. The solver pre-allocates all buffers during create_batched_solver! and reuses them across iterations.

Pressure Condensation

For the isotropic incompressible model (Model 1), run_inverse_solve! statically eliminates the four element-local pressure DOFs from every zone matrix before it reaches the batched solver, so cuDSS factors a ~12% smaller, well-conditioned displacement-only system. This is exact (the condensed solution is per-voxel identical to the full saddle-point solve to round-off) and is on by default for the standard complex- $μ^{*}$ reconstruction with the bulk modulus held fixed (condense_pressure defaults to automatic; pass false to disable). See Static Pressure Condensation in the performance guide for the theory, accuracy validation, and benchmarks.

CUDAKrylovSolver

The CUDAKrylovSolver provides GPU-accelerated GMRES with cuSPARSE sparse matrix-vector products and an ILU(0) preconditioner. This solver is validated to converge in 1 GMRES iteration on all 276 zones of the MGH brain dataset (see Iterative Solvers for the investigation).

Architecture

julia

struct CUDAKrylovSolver <: AbstractLinearSolver
    K_gpu::CuSparseMatrixCSC{ComplexF64}   # GPU copy of K
    ilu_factor::CuSparseMatrixCSC{ComplexF64}  # ILU(0) on GPU via csrilu02
    krylov_workspace::GmresWorkspace       # Preallocated Krylov vectors
end

The solver:

Transfers K to the GPU as a CuSparseMatrixCSC
Computes ILU(0) using cuSPARSE's csrilu02 routine (GPU-native incomplete factorization)
Runs GMRES via Krylov.jl with GPU-resident vectors and the ILU preconditioner
All SpMV operations use cuSPARSE, and triangular preconditioner solves use cuSPARSE csrsv2

Because ILU(0) achieves convergence in 1 iteration for these systems, the effective cost is one SpMV plus two triangular solves — comparable to a direct solve but fully parallel on the GPU.

Note

Despite 1-iteration convergence, CUDAKrylovSolver is slower than CUDSSBatchedSolver for the zone sizes in Sentinel (~1700 DOFs). GPU kernel launch overhead and ILU setup cost dominate at this scale. CUDSSBatchedSolver is recommended for production use.

Preconditioner Reuse

The ILU(0) preconditioner is cached between forward and adjoint solves on the same K. Since the adjoint system solves K^T * lambda = rhs and K = K^T (complex symmetric), the same factored matrix is reused without recomputation. Call invalidate_cache! when K changes after material property updates.

CUDSSDirectSolver

The CUDSSDirectSolver performs per-zone GPU direct solves via cuDSS. Unlike the batched solver, it operates on one zone at a time and is used during the line search phase (Phase 3) where individual zones need trial forward solves with updated material properties.

Architecture

Each CUDSSDirectSolver instance owns a dedicated CUDA stream for thread safety:

julia

mutable struct CUDSSDirectSolver <: AbstractLinearSolver
    _solver::Union{CudssSolver, Nothing}
    _A_gpu::Union{CuSparseMatrixCSR, Nothing}
    _x_gpu::Union{CuVector, Nothing}
    _b_gpu::Union{CuVector, Nothing}
    _stream::CuStream              # dedicated CUDA stream
    _analyzed::Bool
    _needs_refactor::Bool
end

Per-Solver CUDA Stream Isolation

During line search, multiple CUDSSDirectSolver instances run concurrently from different Julia threads (one per zone). Without stream isolation, concurrent cuDSS calls on the default stream cause data races and ERROR_ILLEGAL_ADDRESS crashes.

Each solver creates its own CuStream at construction time and binds all GPU operations (memory transfers, factorization, solve) to that stream via CUDA.stream!. The cuDSS handle is also rebound to the solver's stream before each operation via CUDSS.cudssSetStream. This ensures thread-safe concurrent operation without global locks.

Refactorization Reuse

The solver exploits cuDSS's ability to separate symbolic analysis (performed once per sparsity pattern) from numerical factorization (repeated each time K values change). On the first call, linear_solve! performs full analysis + factorization. On subsequent calls with _needs_refactor=true, only numerical refactorization runs. When solving with the same K but a different RHS (e.g., adjoint reuse), only the RHS is updated.

Call invalidate_cache! after material updates to flag the solver for refactorization.

CSC vs CSR Symmetry Note

cuDSS operates on CSR format. For complex symmetric matrices (K = K^T, not Hermitian), CSC and CSR representations are identical because nzval entries are the same under transposition. However, apply_dirichlet! can break this symmetry at constrained DOFs by zeroing rows but not columns. The current implementation works correctly because line search BCs are applied before the solver sees the matrix, and the boundary structure doesn't change between refactorizations.

GPU Memory Management

Inter-Iteration Garbage Collection

CUDSSDirectSolver instances are created per-zone per global iteration (one solver per zone during line search, ~290 solvers per iteration). Each solver allocates GPU memory for sparse matrices, solution vectors, and cuDSS internal buffers. Without explicit cleanup, these allocations accumulate across iterations and eventually exhaust GPU memory, causing CUDA ERROR_ILLEGAL_ADDRESS crashes.

The batched zone driver calls GC.gc() between global iterations to reclaim GPU memory from expired solver instances:

julia

# In zone_driver_batched.jl, after each global iteration
GC.gc()  # Reclaim CuArrays, CuStreams from expired CUDSSDirectSolvers

A non-blocking GC.gc(false) also runs after each CG iteration within the line search loop to incrementally reclaim temporaries.

Warning

Without inter-iteration GC, the 100-iteration MGH benchmark crashes between iterations 35-68 depending on zone count and GPU memory capacity. This is not a memory leak — Julia's GC simply doesn't run frequently enough to keep up with the GPU allocation rate.

Memory Budget

On the DGX Spark GB10 (120 GiB GPU memory):

CUDSSBatchedSolver (all zones): ~2-4 GiB
CUDSSDirectSolver (per-zone, transient): ~10-50 MiB each, ~290 concurrent during line search
KA gradient workspace: ~500 MiB
Headroom for cuDSS internal buffers: ~2 GiB

Total peak: ~10-15 GiB of 120 GiB available.

Batched Driver Integration

The GPU-accelerated zone decomposition loop is driven by zone_decomposition_solve_batched_gpu!, which replaces the CPU zone_decomposition_solve! function. It accepts a cudss_batched_solver keyword argument to inject the pre-allocated GPU solver.

Three-Phase Execution

Each CG iteration within a global iteration proceeds in three phases:

Phase 0: Zone building. Zone subproblems are constructed from the global mesh and material. This is CPU-threaded and runs once per global iteration.

Phase 1: Batched forward + adjoint solve. All zone K matrices are assembled on the CPU (threaded), then factored and solved in batch on the GPU via CUDSSBatchedSolver. Each zone gets a forward solve (u = K\f) and an adjoint solve (λ = K\r) using the cached factorization.

Phase 2: GPU gradient kernel. The μ gradient ∂Φ/∂μ is computed on the GPU using a KernelAbstractions.jl kernel across all zones simultaneously (see Overview for kernel design). Other property gradients (κ, ρ) are computed on the CPU since they are cheap.

Phase 3: Per-zone line search. Each zone runs secant line search (with Armijo fallback) with trial material properties. Trial forward solves use CUDSSDirectSolver for GPU-accelerated evaluation. Zones are processed concurrently via Threads.@threads, each solver on its own CUDA stream.

High-Level Usage

The simplest way to run a GPU-accelerated inverse solve is through run_inverse_solve!:

julia

using CUDA, CUDSS, Sentinel, KernelAbstractions

prob = load_inversion("runfile.dat")
setup = setup_inverse_problem(prob, data_dir)

# Create batched GPU solver
bsolver = CUDSSBatchedSolver(structure="G")

# Run with GPU acceleration
material, disp, state = run_inverse_solve!(setup;
    solver=CachedDirectSolver(),
    gradient_backend=KAGradientBackend(CUDABackend()),
    cudss_batched_solver=bsolver,
    verbose=true)

Low-Level Usage

For direct control over the zone driver:

julia

material, disp, state = zone_decomposition_solve_batched_gpu!(
    material, grid, dh, gp2mtrs, meshes, meas, bcs, model, omega, rho, config;
    cudss_batched_solver=CUDSSBatchedSolver(structure="G"),
    gradient_backend=KAGradientBackend(CUDABackend()),
    verbose=true)

Extension Architecture

The CUDA backend is implemented as the SentinelCUDAExt package extension, activated when both CUDA.jl and CUDSS.jl are loaded. The extension:

Registers CUDSSBatchedSolver, CUDAKrylovSolver, and CUDSSDirectSolver as subtypes of AbstractLinearSolver
Provides linear_solve! method dispatches for GPU solver types
Implements GPU array conversions (CPU SparseMatrixCSC → CuSparseMatrixCSR)
Hooks into Float64 gradient dispatch for CUDA-backed KAGradientBackend

No GPU code is loaded unless using CUDA appears before using Sentinel. All CPU code paths remain unchanged, and tests run without GPU hardware.

Target Hardware

The primary development and validation target is the NVIDIA DGX Spark:

Specification	Value
GPU	NVIDIA GB10 (Blackwell)
GPU memory	120 GiB unified (Grace-Blackwell)
FP64 throughput	~15 TFLOPS
Memory bandwidth	273 GB/s (GPU) + coherent CPU access
CPU	ARM Grace, 10 cores used for benchmarks
NVLink-C2C	CPU-GPU coherent memory fabric

The 120 GiB unified memory pool is large enough to hold all zone matrices, factors, and work arrays simultaneously. The NVLink-C2C interconnect provides coherent CPU-GPU memory access.

For smaller NVIDIA GPUs (e.g., RTX 4090 with 24 GB VRAM), the solver should work but has not been validated. Monitor GPU memory usage and reduce thread count if needed.

Troubleshooting

CUDA Out of Memory / ERROR_ILLEGAL_ADDRESS

If the solver crashes with ERROR_ILLEGAL_ADDRESS after many iterations, GPU memory is likely exhausted. The inter-iteration GC should handle this automatically, but if the problem persists:

Reduce thread count (fewer concurrent CUDSSDirectSolver instances during line search)
Verify GC.gc() is called between global iterations
Monitor GPU memory with CUDA.memory_status()

cuDSS Thread Safety

All concurrent cuDSS operations must use separate CUDA streams. If you see intermittent wrong results or crashes during multi-threaded line search, ensure each CUDSSDirectSolver has its own stream (this is the default when constructed via CUDSSDirectSolver()).

CUDA Backend ​

Requirements ​

Float64 Advantage ​

Benchmark Results ​

Per-Iteration Profiling Breakdown (CG=2, 290 zones) ​

CUDSSBatchedSolver ​

Design ​

Key Functions ​

Variable-Size Zones ​

Memory Management ​

Pressure Condensation ​

CUDAKrylovSolver ​

Architecture ​

Preconditioner Reuse ​

CUDSSDirectSolver ​

Architecture ​

Per-Solver CUDA Stream Isolation ​

Refactorization Reuse ​

CSC vs CSR Symmetry Note ​

GPU Memory Management ​

Inter-Iteration Garbage Collection ​

Memory Budget ​

Batched Driver Integration ​

Three-Phase Execution ​

High-Level Usage ​

Low-Level Usage ​

Extension Architecture ​

Target Hardware ​

Troubleshooting ​

CUDA Out of Memory / ERROR_ILLEGAL_ADDRESS ​

cuDSS Thread Safety ​

CUDA Backend

Requirements

Float64 Advantage

Benchmark Results

Per-Iteration Profiling Breakdown (CG=2, 290 zones)

CUDSSBatchedSolver

Design

Key Functions

Variable-Size Zones

Memory Management

Pressure Condensation

CUDAKrylovSolver

Architecture

Preconditioner Reuse

CUDSSDirectSolver

Architecture

Per-Solver CUDA Stream Isolation

Refactorization Reuse

CSC vs CSR Symmetry Note

GPU Memory Management

Inter-Iteration Garbage Collection

Memory Budget

Batched Driver Integration

Three-Phase Execution

High-Level Usage

Low-Level Usage

Extension Architecture

Target Hardware

Troubleshooting

CUDA Out of Memory / ERROR_ILLEGAL_ADDRESS

cuDSS Thread Safety