GPU Acceleration Overview

Sentinel provides GPU-accelerated pathways for the zone decomposition inverse solver. GPU acceleration is additive –- it does not replace CPU code paths, and all CPU-only workflows continue to work unchanged. GPU features are opt-in via keyword arguments and loaded through Julia package extensions.

Execution Modes

Sentinel offers three execution modes for zone decomposition. All three produce equivalent numerical results; they differ only in how zones are scheduled and where compute-intensive kernels run.

Mode	Entry Point	Parallelism	Hardware
Serial	`zone_decomposition_solve!`	None –- zones solved sequentially	Any CPU
Distributed	`zone_decomposition_solve_parallel!`	`pmap_func` (e.g., `Distributed.pmap`)	Multi-core CPU
Batched GPU	`zone_decomposition_solve_batched_gpu!`	Phased BSP: threaded CPU + GPU kernels	CPU + Metal or CUDA

Serial is the baseline for correctness testing and debugging. Distributed scales across CPU cores via Distributed.pmap or any compatible parallel map function. Batched GPU restructures the iteration loop into synchronized phases so that GPU kernels can operate across all zones simultaneously.

Per-Zone vs Batched Architecture

Per-Zone Execution (Serial / Distributed)

In per-zone mode, each zone runs a complete CG optimization loop independently. The zones are embarrassingly parallel –- no communication occurs between zones within a single global iteration.

Each zone solve is self-contained: it assembles its own stiffness matrix, runs forward and adjoint solves, computes the gradient, and performs line search iterations. This is simple and robust but cannot exploit GPU parallelism across zones.

Batched GPU Execution

The batched driver restructures the per-zone CG iteration into three synchronized phases using a Bulk Synchronous Parallel (BSP) model. All zones advance through each phase together before proceeding to the next.

Phase 1 runs forward and adjoint solves on CPU threads. These require sparse direct factorization (complex-valued LU), which is inherently sequential per zone but can be threaded across zones.

Phase 2 launches a single GPU kernel that computes the material gradient for all zones simultaneously. This is the primary GPU acceleration target –- the gradient kernel is compute-bound with an O(27^2) inner loop per Gauss point and maps naturally to GPU parallelism.

Phase 3 performs secant line search on CPU threads, updating material properties per zone.

KernelAbstractions.jl

Sentinel uses KernelAbstractions.jl to write portable GPU kernels. The same mu_gradient_kernel! source code runs on CPU, Metal, and CUDA –- the backend is selected at call time.

julia

# The kernel is defined once using KernelAbstractions syntax
@kernel function mu_gradient_kernel!(grad, elem_data, disp_fwd, disp_adj, @Const(lookup))
    i = @index(Global)
    # ... gradient accumulation over 27x27 shape function pairs per GP
end

Backend selection is handled by KAGradientBackend:

julia

using Metal
backend = KAGradientBackend(MetalBackend())  # Metal GPU

using CUDA
backend = KAGradientBackend(CUDABackend())   # NVIDIA GPU

backend = KAGradientBackend(CPU())            # CPU fallback

The backend is passed via the gradient_backend keyword argument to the batched driver. When no GPU backend is specified, the gradient runs on CPU.

Gradient Kernel Design

The gradient kernel is the most compute-intensive part of the inverse solver (~27% of total runtime in profiling). It computes $\partial Φ / \partial θ$ at each Gauss point by contracting forward and adjoint displacement fields through the element stiffness sensitivity.

Data Structures

Two data layouts support per-zone and batched execution:

ElementGradientData (per-zone): Holds shape function derivatives, Jacobian determinants, and material indices for a single zone. Used in the CPU gradient path and as the building block for batched data.

BatchedElementData (cross-zone): Packs data from all zones into flat arrays with O(1) lookup tables:

work_zone[i] –- which zone does work item i belong to
work_el[i] –- which element within that zone
work_q[i] –- which Gauss point within that element

This flat mapping allows a single GPU kernel launch to cover all Gauss points across all zones without nested indexing or zone-boundary logic.

Inner Loop

At each Gauss point, the kernel performs an O(27^2) double loop over the 27 nodes of the Hex27 element, accumulating the product of shape function derivatives, forward displacements, and adjoint displacements. The result is the gradient contribution at that Gauss point, which is then scattered to the material mesh via the GP2MTR mapping.

Design Principles

The GPU acceleration in Sentinel follows these principles:

Additive, not replacement. GPU code paths are added alongside CPU paths. No existing CPU functionality is modified or removed.
Opt-in via keyword arguments. GPU acceleration is activated by passing a gradient_backend or cudss_batched_solver keyword. Default behavior is always CPU-only.
CPU paths unchanged. All tests pass with and without GPU packages loaded. The test suite runs entirely on CPU.
GPU code lives in extensions. SentinelMetalExt and SentinelCUDAExt are Julia package extensions loaded only when Metal or CUDA is imported. The core Sentinel module has no GPU dependencies.
Dispatch-based backend selection. The _gpu_float_type and _to_gpu_array functions are extended by each GPU backend to handle type conversion and array transfer.

Backend Comparison

Feature	Metal	CUDA
Float precision	Float32 only	Float64
Gradient kernel	Yes (KernelAbstractions)	Yes (KernelAbstractions)
Linear solves	CPU only (Float32 unusable for FEM)	GPU via cuDSS
Batched factorization	No	Yes (CUDSSBatchedSolver)
Iterative solver	No	CUDAKrylovSolver (GMRES + cuSPARSE)
Target hardware	Apple Silicon (M1/M2/M3/M4)	NVIDIA datacenter GPUs
Extension module	`SentinelMetalExt`	`SentinelCUDAExt`

Metal is best suited for gradient-only acceleration on Apple Silicon development machines. CUDA provides full GPU offload including linear solves and is the target for production-scale inversions on NVIDIA hardware.

See Metal Backend and CUDA Backend for backend-specific details.

GPU Acceleration Overview ​

Execution Modes ​

Per-Zone vs Batched Architecture ​

Per-Zone Execution (Serial / Distributed) ​

Batched GPU Execution ​

KernelAbstractions.jl ​

Gradient Kernel Design ​

Data Structures ​

Inner Loop ​

Design Principles ​

Backend Comparison ​