CUDA Backend
Sentinel's CUDA backend provides GPU-accelerated linear solves and gradient computation for NVIDIA GPUs. Unlike the Metal backend, CUDA hardware supports IEEE 754 Float64 arithmetic natively, enabling full-precision GPU direct solves via the cuDSS library.
Requirements
| Component | Version | Purpose |
|---|---|---|
| NVIDIA GPU | Compute capability 7.0+ | Hardware |
| CUDA toolkit | 12.0+ | Runtime/drivers |
| CUDA.jl | 5.0+ | Julia GPU arrays, kernel launch |
| CUDSS.jl | 0.6+ | GPU sparse direct solver (cuDSS) |
| Krylov.jl | 0.9+ | GPU Krylov methods |
Load the extension by importing CUDA and CUDSS before Sentinel:
using CUDA, CUDSS
using SentinelThis triggers the SentinelCUDAExt package extension, which registers GPU solver implementations and array types.
Float64 Advantage
Sentinel's stiffness matrices are ComplexF64 (128 bits per entry). NVIDIA GPUs execute Float64 arithmetic in hardware at a well-defined throughput ratio (typically 1:2 or 1:32 relative to Float32, depending on the GPU generation). This is a critical advantage over Apple Metal, which does not support Float64 in compute shaders.
Because the linear solver is the dominant cost in the inverse pipeline, GPU-accelerated Float64 direct solves unlock the most significant speedup opportunity. The cuDSS library provides sparse LDL^T and LU factorization directly on GPU memory, avoiding the CPU-GPU transfer bottleneck that would arise from factorizing on the CPU and copying factors back.
Benchmark Results
Full 100-iteration MGH brain MRE inverse solve on NVIDIA DGX Spark (GB10 GPU, 10 CPU threads):
| Configuration | Total Time | Per-iter (CG=1) | Per-iter (CG=2) | Speedup |
|---|---|---|---|---|
| CPU baseline (CachedDirectSolver, 10 threads) | ~330 min | ~133s | ~200s | 1.0x |
| Fortran MPI (14 workers) | ~240 min | — | ~144s | 1.4x |
| cuDSS batched + CPU line search | 222 min | ~46s | ~140s | 1.5x |
| cuDSS batched + GPU line search | 158 min | ~35s | ~100s | 2.1x |
Material output validated: mean shear modulus agrees within 1 Pa between CPU and GPU paths (2579.9 Pa vs 2578.9 Pa after 100 iterations).
Per-Iteration Profiling Breakdown (CG=2, 290 zones)
| Phase | Time | % | Location |
|---|---|---|---|
| Line search (CUDSSDirectSolver) | ~30s | ~65% | GPU, CPU threaded |
| cuDSS batched factorization | 4.9s | ~11% | GPU |
| GPU gradient kernel | 3.9s | ~8% | GPU |
| CPU assembly | 1.5s | ~3% | CPU threaded |
| Zone building | ~1s | ~2% | CPU threaded |
| cuDSS solve (forward + adjoint) | 95ms | <1% | GPU |
| Other (consolidation, GC) | ~5s | ~11% | CPU |
CUDSSBatchedSolver
The CUDSSBatchedSolver factors and solves all zone stiffness matrices simultaneously on the GPU. This is the primary solver for Phase 1 (forward + adjoint solves) of the zone decomposition loop.
Design
Zone decomposition produces 200-300 independent linear systems per global iteration, each of size ~1700 DOFs. These are too small individually to saturate GPU parallelism, but as a batch they expose substantial data parallelism. The batched solver:
Transfers all zone K matrices and right-hand sides to GPU memory as
CuSparseMatrixCSRarraysCreates one
CudssSolverper zone, configured for general complex factorizationExecutes symbolic analysis, numerical factorization, and triangular solve across all zones
Reads solutions back to CPU for consolidation
Key Functions
create_batched_solver!— Allocates GPU memory and creates cuDSS solver instances for a set of zone matrices. Handles variable-size zones by creating independent solver objects.batched_solve!— Solves all zones using existing factorizations. Each right-hand side is transferred to GPU, solved, and read back.batched_update_and_refactor!— Updates numerical values of already-analyzed matrices (reuses symbolic factorization) when material properties change between CG iterations.
Variable-Size Zones
Zone matrices vary in size because zones near the domain boundary contain fewer elements. The batched solver handles this by maintaining a vector of independent CudssSolver instances rather than requiring uniform matrix dimensions. Each solver operates on its own sparse data, so no padding or masking is needed.
Memory Management
For the MGH brain dataset (~290 zones, ~1700 DOFs per zone), the total GPU memory for all zone K matrices, factors, and work arrays is approximately 2-4 GB. The solver pre-allocates all buffers during create_batched_solver! and reuses them across iterations.
CUDAKrylovSolver
The CUDAKrylovSolver provides GPU-accelerated GMRES with cuSPARSE sparse matrix-vector products and an ILU(0) preconditioner. This solver is validated to converge in 1 GMRES iteration on all 276 zones of the MGH brain dataset (see Iterative Solvers for the investigation).
Architecture
struct CUDAKrylovSolver <: AbstractLinearSolver
K_gpu::CuSparseMatrixCSC{ComplexF64} # GPU copy of K
ilu_factor::CuSparseMatrixCSC{ComplexF64} # ILU(0) on GPU via csrilu02
krylov_workspace::GmresWorkspace # Preallocated Krylov vectors
endThe solver:
Transfers K to the GPU as a
CuSparseMatrixCSCComputes ILU(0) using cuSPARSE's
csrilu02routine (GPU-native incomplete factorization)Runs GMRES via Krylov.jl with GPU-resident vectors and the ILU preconditioner
All SpMV operations use cuSPARSE, and triangular preconditioner solves use cuSPARSE
csrsv2
Because ILU(0) achieves convergence in 1 iteration for these systems, the effective cost is one SpMV plus two triangular solves — comparable to a direct solve but fully parallel on the GPU.
Note
Despite 1-iteration convergence, CUDAKrylovSolver is slower than CUDSSBatchedSolver for the zone sizes in Sentinel (~1700 DOFs). GPU kernel launch overhead and ILU setup cost dominate at this scale. CUDSSBatchedSolver is recommended for production use.
Preconditioner Reuse
The ILU(0) preconditioner is cached between forward and adjoint solves on the same K. Since the adjoint system solves K^T * lambda = rhs and K = K^T (complex symmetric), the same factored matrix is reused without recomputation. Call invalidate_cache! when K changes after material property updates.
CUDSSDirectSolver
The CUDSSDirectSolver performs per-zone GPU direct solves via cuDSS. Unlike the batched solver, it operates on one zone at a time and is used during the line search phase (Phase 3) where individual zones need trial forward solves with updated material properties.
Architecture
Each CUDSSDirectSolver instance owns a dedicated CUDA stream for thread safety:
mutable struct CUDSSDirectSolver <: AbstractLinearSolver
_solver::Union{CudssSolver, Nothing}
_A_gpu::Union{CuSparseMatrixCSR, Nothing}
_x_gpu::Union{CuVector, Nothing}
_b_gpu::Union{CuVector, Nothing}
_stream::CuStream # dedicated CUDA stream
_analyzed::Bool
_needs_refactor::Bool
endPer-Solver CUDA Stream Isolation
During line search, multiple CUDSSDirectSolver instances run concurrently from different Julia threads (one per zone). Without stream isolation, concurrent cuDSS calls on the default stream cause data races and ERROR_ILLEGAL_ADDRESS crashes.
Each solver creates its own CuStream at construction time and binds all GPU operations (memory transfers, factorization, solve) to that stream via CUDA.stream!. The cuDSS handle is also rebound to the solver's stream before each operation via CUDSS.cudssSetStream. This ensures thread-safe concurrent operation without global locks.
Refactorization Reuse
The solver exploits cuDSS's ability to separate symbolic analysis (performed once per sparsity pattern) from numerical factorization (repeated each time K values change). On the first call, linear_solve! performs full analysis + factorization. On subsequent calls with _needs_refactor=true, only numerical refactorization runs. When solving with the same K but a different RHS (e.g., adjoint reuse), only the RHS is updated.
Call invalidate_cache! after material updates to flag the solver for refactorization.
CSC vs CSR Symmetry Note
cuDSS operates on CSR format. For complex symmetric matrices (K = K^T, not Hermitian), CSC and CSR representations are identical because nzval entries are the same under transposition. However, apply_dirichlet! can break this symmetry at constrained DOFs by zeroing rows but not columns. The current implementation works correctly because line search BCs are applied before the solver sees the matrix, and the boundary structure doesn't change between refactorizations.
GPU Memory Management
Inter-Iteration Garbage Collection
CUDSSDirectSolver instances are created per-zone per global iteration (one solver per zone during line search, ~290 solvers per iteration). Each solver allocates GPU memory for sparse matrices, solution vectors, and cuDSS internal buffers. Without explicit cleanup, these allocations accumulate across iterations and eventually exhaust GPU memory, causing CUDA ERROR_ILLEGAL_ADDRESS crashes.
The batched zone driver calls GC.gc() between global iterations to reclaim GPU memory from expired solver instances:
# In zone_driver_batched.jl, after each global iteration
GC.gc() # Reclaim CuArrays, CuStreams from expired CUDSSDirectSolversA non-blocking GC.gc(false) also runs after each CG iteration within the line search loop to incrementally reclaim temporaries.
Warning
Without inter-iteration GC, the 100-iteration MGH benchmark crashes between iterations 35-68 depending on zone count and GPU memory capacity. This is not a memory leak — Julia's GC simply doesn't run frequently enough to keep up with the GPU allocation rate.
Memory Budget
On the DGX Spark GB10 (120 GiB GPU memory):
CUDSSBatchedSolver (all zones): ~2-4 GiB
CUDSSDirectSolver (per-zone, transient): ~10-50 MiB each, ~290 concurrent during line search
KA gradient workspace: ~500 MiB
Headroom for cuDSS internal buffers: ~2 GiB
Total peak: ~10-15 GiB of 120 GiB available.
Batched Driver Integration
The GPU-accelerated zone decomposition loop is driven by zone_decomposition_solve_batched_gpu!, which replaces the CPU zone_decomposition_solve! function. It accepts a cudss_batched_solver keyword argument to inject the pre-allocated GPU solver.
Three-Phase Execution
Each CG iteration within a global iteration proceeds in three phases:
Phase 0: Zone building. Zone subproblems are constructed from the global mesh and material. This is CPU-threaded and runs once per global iteration.
Phase 1: Batched forward + adjoint solve. All zone K matrices are assembled on the CPU (threaded), then factored and solved in batch on the GPU via CUDSSBatchedSolver. Each zone gets a forward solve (u = K\f) and an adjoint solve (λ = K\r) using the cached factorization.
Phase 2: GPU gradient kernel. The μ gradient ∂Φ/∂μ is computed on the GPU using a KernelAbstractions.jl kernel across all zones simultaneously (see Overview for kernel design). Other property gradients (κ, ρ) are computed on the CPU since they are cheap.
Phase 3: Per-zone line search. Each zone runs secant line search (with Armijo fallback) with trial material properties. Trial forward solves use CUDSSDirectSolver for GPU-accelerated evaluation. Zones are processed concurrently via Threads.@threads, each solver on its own CUDA stream.
High-Level Usage
The simplest way to run a GPU-accelerated inverse solve is through run_inverse_solve!:
using CUDA, CUDSS, Sentinel, KernelAbstractions
config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, data_dir)
# Create batched GPU solver
bsolver = CUDSSBatchedSolver(structure="G")
# Run with GPU acceleration
material, disp, state = run_inverse_solve!(setup;
solver=CachedDirectSolver(),
gradient_backend=KAGradientBackend(CUDABackend()),
cudss_batched_solver=bsolver,
verbose=true)Low-Level Usage
For direct control over the zone driver:
material, disp, state = zone_decomposition_solve_batched_gpu!(
material, grid, dh, gp2mtrs, meshes, meas, bcs, model, omega, rho, config;
cudss_batched_solver=CUDSSBatchedSolver(structure="G"),
gradient_backend=KAGradientBackend(CUDABackend()),
verbose=true)Extension Architecture
The CUDA backend is implemented as the SentinelCUDAExt package extension, activated when both CUDA.jl and CUDSS.jl are loaded. The extension:
Registers
CUDSSBatchedSolver,CUDAKrylovSolver, andCUDSSDirectSolveras subtypes ofAbstractLinearSolverProvides
linear_solve!method dispatches for GPU solver typesImplements GPU array conversions (CPU
SparseMatrixCSC→CuSparseMatrixCSR)Hooks into
Float64gradient dispatch for CUDA-backedKAGradientBackend
No GPU code is loaded unless using CUDA appears before using Sentinel. All CPU code paths remain unchanged, and tests run without GPU hardware.
Target Hardware
The primary development and validation target is the NVIDIA DGX Spark:
| Specification | Value |
|---|---|
| GPU | NVIDIA GB10 (Blackwell) |
| GPU memory | 120 GiB unified (Grace-Blackwell) |
| FP64 throughput | ~15 TFLOPS |
| Memory bandwidth | 273 GB/s (GPU) + coherent CPU access |
| CPU | ARM Grace, 10 cores used for benchmarks |
| NVLink-C2C | CPU-GPU coherent memory fabric |
The 120 GiB unified memory pool is large enough to hold all zone matrices, factors, and work arrays simultaneously. The NVLink-C2C interconnect provides coherent CPU-GPU memory access.
For smaller NVIDIA GPUs (e.g., RTX 4090 with 24 GB VRAM), the solver should work but has not been validated. Monitor GPU memory usage and reduce thread count if needed.
Troubleshooting
CUDA Out of Memory / ERROR_ILLEGAL_ADDRESS
If the solver crashes with ERROR_ILLEGAL_ADDRESS after many iterations, GPU memory is likely exhausted. The inter-iteration GC should handle this automatically, but if the problem persists:
Reduce thread count (fewer concurrent
CUDSSDirectSolverinstances during line search)Verify
GC.gc()is called between global iterationsMonitor GPU memory with
CUDA.memory_status()
cuDSS Thread Safety
All concurrent cuDSS operations must use separate CUDA streams. If you see intermittent wrong results or crashes during multi-threaded line search, ensure each CUDSSDirectSolver has its own stream (this is the default when constructed via CUDSSDirectSolver()).