◆ FORGE Suite
GitHubMechanical Neuroimaging Lab · Univ. of Delaware
Skip to content

Metal Backend

The Metal backend accelerates Sentinel on Apple Silicon Macs using Metal.jl and KernelAbstractions.jl.

Requirements

  • Hardware: Apple Silicon Mac (M1, M2, M3, or M4 series)

  • Julia: 1.10 or later

  • Packages: Metal.jl (triggers the SentinelMetalExt extension)

julia
using Pkg
Pkg.add("Metal")

When Metal is loaded alongside Sentinel, the SentinelMetalExt extension is activated automatically via Julia's package extension mechanism.

What Works

The Metal backend accelerates the gradient kernel via KernelAbstractions. The same mu_gradient_kernel! that runs on CPU and CUDA also runs on Metal, computing Φ/θ across all Gauss points in a single GPU dispatch.

Measured speedups for the gradient kernel alone (vs single-threaded CPU):

Mesh size (elements)CPU gradient (ms)Metal gradient (ms)Speedup
Small (~500)120.5522x
Medium (~2,000)481.434x
Large (~8,000)2104.448x

Speedup scales with mesh size because larger meshes better saturate the GPU's compute units. The batched driver combines Metal gradient with CPU-threaded forward/adjoint and line search phases (see GPU Acceleration Overview).

Float32 Limitation

Metal GPUs support Float32 only –- there is no Float64 (double precision) hardware on Apple Silicon GPUs. This has significant implications:

Gradient: Float32 is acceptable

The gradient kernel accumulates products of shape function derivatives and displacement fields. Float32 introduces small rounding errors in each Gauss point contribution, but these are averaged over many Gauss points and many CG iterations. In practice, Float32 gradient values are sufficiently accurate to drive convergence of the optimizer.

Linear solves: Float32 is NOT acceptable

FEM stiffness matrices for viscoelastic MRE are complex-valued and ill-conditioned. Dense LU factorization in Float32 produces ~18% solution error compared to Float64 –- far too large for a forward solver where sub-percent accuracy is required.

We investigated Apple's Metal Performance Shaders (MPS) for GPU-accelerated dense LU, but MPS enforces a hard constraint:

"Only MPSDataTypeFloat32 is supported" –- MPS assertion error when attempting Float64 input

This rules out GPU-accelerated linear solves on Metal. Forward and adjoint solves remain on CPU using Float64 sparse LU (UMFPACK or Accelerate).

AppleAccelerate Integration

On macOS, Sentinel automatically loads AppleAccelerate.jl to replace OpenBLAS with Apple's Accelerate framework for BLAS and LAPACK operations:

julia
@static if Sys.isapple()
    using AppleAccelerate
end

This provides two benefits:

  1. ~13% faster sparse LU factorization. Accelerate's LAPACK is optimized for Apple Silicon's memory hierarchy and produces measurably faster LU factorization than OpenBLAS for the complex-valued sparse matrices in Sentinel.

  2. Eliminates OpenBLAS multi-threading pathology. OpenBLAS with multiple BLAS threads is slower than single-threaded for the small per-zone matrices in zone decomposition (overhead exceeds benefit). Accelerate's threading model does not exhibit this pathology, so BLAS.set_num_threads(1) is not needed when using Accelerate.

AppleAccelerate is loaded automatically –- no user configuration required.

Usage Example

julia
using Sentinel
using Metal

# Create the gradient backend
backend = KAGradientBackend(MetalBackend())

# Run batched GPU zone decomposition
material, disp, state = zone_decomposition_solve_batched_gpu!(
    material, grid, dh, gp2mtrs, meshes, meas, bcs, model, omega, rho, config;
    gradient_backend = backend,
    verbose = true
)

The gradient_backend keyword tells the batched driver to offload Phase 2 (gradient computation) to the Metal GPU. Phases 1 and 3 (forward/adjoint solves and line search) run on CPU threads.

For CPU-only execution with Accelerate benefits, simply omit the gradient_backend keyword –- AppleAccelerate loads automatically on macOS.

Performance

Best measured performance on a Mac Studio M2 Max (12 GPU cores, 64 GB RAM) for the MGH brain MRE dataset (276 zones, ~500 elements per zone):

ConfigurationTime per global iteration
Serial, OpenBLAS, 1 thread~850 s
Serial, Accelerate, 1 thread~740 s
Threaded (8), Accelerate, CPU gradient~120 s
Threaded (8), Accelerate, Metal gradient~98 s

The Metal gradient kernel saves approximately 18% compared to CPU gradient when combined with 8-thread CPU phases. The majority of per-iteration time is spent in linear solves (forward + adjoint), which remain on CPU.

Extension: SentinelMetalExt

The SentinelMetalExt package extension provides two methods that integrate Metal into the KernelAbstractions dispatch system:

julia
# Float type: Metal only supports Float32
Sentinel._gpu_float_type(::MetalBackend) = Float32

# Array transfer: convert CPU arrays to MtlArray on the Metal device
Sentinel._to_gpu_array(::MetalBackend, x::AbstractArray) = MtlArray(x)

These are called internally by the batched driver and gradient kernel infrastructure. Users interact with the Metal backend through KAGradientBackend(MetalBackend()) and do not call these functions directly.

Limitations

  • No GPU linear solves. Float32 precision is insufficient for FEM stiffness matrix factorization. All linear algebra remains on CPU.

  • No sparse operations on GPU. Metal does not provide sparse matrix libraries comparable to cuSPARSE.

  • Gradient only. The GPU acceleration is limited to the gradient kernel (Phase 2 of the batched driver).

For full GPU offload including linear solves and batched factorization, see the CUDA Backend.