GPU Acceleration Overview
Sentinel provides GPU-accelerated pathways for the zone decomposition inverse solver. GPU acceleration is additive –- it does not replace CPU code paths, and all CPU-only workflows continue to work unchanged. GPU features are opt-in via keyword arguments and loaded through Julia package extensions.
Execution Modes
Sentinel offers three execution modes for zone decomposition. All three produce equivalent numerical results; they differ only in how zones are scheduled and where compute-intensive kernels run.
| Mode | Entry Point | Parallelism | Hardware |
|---|---|---|---|
| Serial | zone_decomposition_solve! | None –- zones solved sequentially | Any CPU |
| Distributed | zone_decomposition_solve_parallel! | pmap_func (e.g., Distributed.pmap) | Multi-core CPU |
| Batched GPU | zone_decomposition_solve_batched_gpu! | Phased BSP: threaded CPU + GPU kernels | CPU + Metal or CUDA |
Serial is the baseline for correctness testing and debugging. Distributed scales across CPU cores via Distributed.pmap or any compatible parallel map function. Batched GPU restructures the iteration loop into synchronized phases so that GPU kernels can operate across all zones simultaneously.
Per-Zone vs Batched Architecture
Per-Zone Execution (Serial / Distributed)
In per-zone mode, each zone runs a complete CG optimization loop independently. The zones are embarrassingly parallel –- no communication occurs between zones within a single global iteration.
Each zone solve is self-contained: it assembles its own stiffness matrix, runs forward and adjoint solves, computes the gradient, and performs line search iterations. This is simple and robust but cannot exploit GPU parallelism across zones.
Batched GPU Execution
The batched driver restructures the per-zone CG iteration into three synchronized phases using a Bulk Synchronous Parallel (BSP) model. All zones advance through each phase together before proceeding to the next.
Phase 1 runs forward and adjoint solves on CPU threads. These require sparse direct factorization (complex-valued LU), which is inherently sequential per zone but can be threaded across zones.
Phase 2 launches a single GPU kernel that computes the material gradient for all zones simultaneously. This is the primary GPU acceleration target –- the gradient kernel is compute-bound with an O(27^2) inner loop per Gauss point and maps naturally to GPU parallelism.
Phase 3 performs secant line search on CPU threads, updating material properties per zone.
KernelAbstractions.jl
Sentinel uses KernelAbstractions.jl to write portable GPU kernels. The same mu_gradient_kernel! source code runs on CPU, Metal, and CUDA –- the backend is selected at call time.
# The kernel is defined once using KernelAbstractions syntax
@kernel function mu_gradient_kernel!(grad, elem_data, disp_fwd, disp_adj, @Const(lookup))
i = @index(Global)
# ... gradient accumulation over 27x27 shape function pairs per GP
endBackend selection is handled by KAGradientBackend:
using Metal
backend = KAGradientBackend(MetalBackend()) # Metal GPU
using CUDA
backend = KAGradientBackend(CUDABackend()) # NVIDIA GPU
backend = KAGradientBackend(CPU()) # CPU fallbackThe backend is passed via the gradient_backend keyword argument to the batched driver. When no GPU backend is specified, the gradient runs on CPU.
Gradient Kernel Design
The gradient kernel is the most compute-intensive part of the inverse solver (~27% of total runtime in profiling). It computes
Data Structures
Two data layouts support per-zone and batched execution:
ElementGradientData (per-zone): Holds shape function derivatives, Jacobian determinants, and material indices for a single zone. Used in the CPU gradient path and as the building block for batched data.
BatchedElementData (cross-zone): Packs data from all zones into flat arrays with O(1) lookup tables:
work_zone[i]–- which zone does work itemibelong towork_el[i]–- which element within that zonework_q[i]–- which Gauss point within that element
This flat mapping allows a single GPU kernel launch to cover all Gauss points across all zones without nested indexing or zone-boundary logic.
Inner Loop
At each Gauss point, the kernel performs an O(27^2) double loop over the 27 nodes of the Hex27 element, accumulating the product of shape function derivatives, forward displacements, and adjoint displacements. The result is the gradient contribution at that Gauss point, which is then scattered to the material mesh via the GP2MTR mapping.
Design Principles
The GPU acceleration in Sentinel follows these principles:
Additive, not replacement. GPU code paths are added alongside CPU paths. No existing CPU functionality is modified or removed.
Opt-in via keyword arguments. GPU acceleration is activated by passing a
gradient_backendorcudss_batched_solverkeyword. Default behavior is always CPU-only.CPU paths unchanged. All tests pass with and without GPU packages loaded. The test suite runs entirely on CPU.
GPU code lives in extensions.
SentinelMetalExtandSentinelCUDAExtare Julia package extensions loaded only whenMetalorCUDAis imported. The coreSentinelmodule has no GPU dependencies.Dispatch-based backend selection. The
_gpu_float_typeand_to_gpu_arrayfunctions are extended by each GPU backend to handle type conversion and array transfer.
Backend Comparison
| Feature | Metal | CUDA |
|---|---|---|
| Float precision | Float32 only | Float64 |
| Gradient kernel | Yes (KernelAbstractions) | Yes (KernelAbstractions) |
| Linear solves | CPU only (Float32 unusable for FEM) | GPU via cuDSS |
| Batched factorization | No | Yes (CUDSSBatchedSolver) |
| Iterative solver | No | CUDAKrylovSolver (GMRES + cuSPARSE) |
| Target hardware | Apple Silicon (M1/M2/M3/M4) | NVIDIA datacenter GPUs |
| Extension module | SentinelMetalExt | SentinelCUDAExt |
Metal is best suited for gradient-only acceleration on Apple Silicon development machines. CUDA provides full GPU offload including linear solves and is the target for production-scale inversions on NVIDIA hardware.
See Metal Backend and CUDA Backend for backend-specific details.