Parallel Execution
Sentinel supports several execution modes for the zone decomposition inverse solver, from single-core serial to multi-worker distributed and GPU-accelerated. This page describes each mode and when to use it.
Execution Modes
| Mode | Function | Workers | Best For |
|---|---|---|---|
| Serial | run_inverse_solve!(setup) | 1 | Debugging, development, small problems |
| Distributed.jl | run_inverse_solve!(setup; pmap_func=pmap) | N CPU | Production on multi-core servers |
| Batched GPU | zone_decomposition_solve_batched_gpu! | GPU + CPU threads | Apple Silicon or CUDA systems |
All modes produce identical results (up to floating-point ordering). The zone decomposition algorithm is embarrassingly parallel at the zone level –- each zone is an independent inverse sub-problem –- so parallelism scales nearly linearly with the number of workers.
Serial
The default mode runs all zones sequentially on a single core:
using Sentinel
config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, datadir)
material, disp, state = run_inverse_solve!(setup)Serial mode is useful for debugging, profiling, and validating results against the Fortran reference. It requires no additional setup beyond the base Sentinel package.
Distributed.jl
For multi-core CPU servers, Distributed.jl provides the best performance. Each worker process handles a subset of zones in parallel via pmap:
using Distributed
addprocs(14) # add 14 worker processes
@everywhere using Sentinel
@everywhere using LinearAlgebra: BLAS
@everywhere BLAS.set_num_threads(1) # critical for performance
using Sentinel
config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, datadir)
material, disp, state = run_inverse_solve!(setup; pmap_func=pmap)Key points:
@everywhere using Sentinelmust load Sentinel on all workers before callingrun_inverse_solve!. Each worker needs the full module available.BLAS.set_num_threads(1)is critical –- see the BLAS Threading Warning below.Worker count should match available cores minus one (the master process coordinates but does not solve zones). On a 16-core machine, 14–15 workers is typical.
run_inverse_solve!automatically sets BLAS threads to 1 during zone solves and restores the original count afterward. The@everywherecall above ensures workers start with the correct setting.
Batched GPU
The batched GPU driver partitions zone solves into phased execution: GPU-accelerated linear solves and gradient computation interleaved with CPU-threaded assembly and line search.
CUDA (Recommended for NVIDIA GPUs)
using CUDA, CUDSS, Sentinel, KernelAbstractions
config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, data_dir)
material, disp, state = run_inverse_solve!(setup;
solver=CachedDirectSolver(),
gradient_backend=KAGradientBackend(CUDABackend()),
cudss_batched_solver=CUDSSBatchedSolver(structure="G"),
verbose=true)Metal (Apple Silicon)
using Metal, Sentinel, KernelAbstractions
material, disp, state = zone_decomposition_solve_batched_gpu!(
material, grid, dh, gp2mtrs, meshes,
meas, bcs, model, omega, rho, zone_config;
solver=CachedDirectSolver(),
gradient_backend=KAGradientBackend(MetalBackend()))Phased Execution
The batched driver runs in four phases per CG iteration:
Phase 0 (CPU, threaded): Build zone subproblems and concatenate element data for GPU. Runs once per global iteration.
Phase 1 (GPU or CPU): Forward and adjoint solves. With
cudss_batched_solver, all zones are factored and solved on the GPU simultaneously. Without it, zones are solved on the CPU withThreads.@threads.Phase 2 (GPU): Batched gradient computation using a KernelAbstractions.jl kernel across all zones in a single dispatch.
Phase 3 (CPU, threaded): Per-zone CG direction and line search. With CUDA, trial forward solves use
CUDSSDirectSolver(one per zone, each on its own CUDA stream). With Metal or CPU, usesCachedDirectSolver.
After all CG iterations complete, zone results are consolidated on the master thread.
The batched driver automatically sets BLAS.set_num_threads(1) for the CPU phases, matching the behavior of the serial and distributed drivers.
BLAS Threading Warning
BLAS.set_num_threads(1) is mandatory for zone solves
OpenBLAS multi-threaded mode is 24x slower for zone-sized linear systems (typically 500–5000 DOFs) due to thread-spawn overhead exceeding the computation time. This is the single most impactful performance setting.
The issue arises because each zone's direct solve involves small-to-moderate sparse matrices. OpenBLAS spawns threads for each BLAS call, but the matrix operations complete faster than the thread management overhead.
# Bad: ~614 seconds/iteration with 8 BLAS threads
BLAS.set_num_threads(8)
run_inverse_solve!(setup)
# Good: ~233 seconds/iteration with 1 BLAS thread
BLAS.set_num_threads(1)
run_inverse_solve!(setup)Sentinel handles this automatically: run_inverse_solve! saves the current BLAS thread count, sets it to 1 before zone solves, and restores it afterward. However, if you are calling zone-level functions directly, you must set BLAS.set_num_threads(1) yourself.
On macOS, AppleAccelerate handles threading correctly and does not suffer from this issue. See Performance Tuning for details.
Choosing a Mode
| Hardware | Recommended Mode | Expected Performance |
|---|---|---|
| Laptop / single core | Serial | Baseline |
| Multi-core server | Distributed.jl (N-1 workers) | Near-linear speedup |
| Apple Silicon Mac | Batched GPU (Metal) | 2–3x over serial |
| NVIDIA GPU server | Batched GPU (CUDA) | Gradient on GPU, solves on CPU |
For most users, Distributed.jl on a multi-core machine provides the best balance of performance and simplicity. The GPU modes are beneficial when gradient computation dominates (many zones, large material meshes) and a capable GPU is available. See Performance Tuning for benchmark numbers.