Parallel Execution

Sentinel supports several execution modes for the zone decomposition inverse solver, from single-core serial to multi-worker distributed and GPU-accelerated. This page describes each mode and when to use it.

Execution Modes

Mode	Function	Workers	Best For
Serial	`run_inverse_solve!(setup)`	1	Debugging, development, small problems
Distributed.jl	`run_inverse_solve!(setup; pmap_func=pmap)`	N CPU	Production on multi-core servers
Batched GPU	`zone_decomposition_solve_batched_gpu!`	GPU + CPU threads	Apple Silicon or CUDA systems

All modes produce identical results (up to floating-point ordering). The zone decomposition algorithm is embarrassingly parallel at the zone level –- each zone is an independent inverse sub-problem –- so parallelism scales nearly linearly with the number of workers.

Serial

The default mode runs all zones sequentially on a single core:

julia

using Sentinel

prob = load_inversion("runfile.dat")
setup = setup_inverse_problem(prob, datadir)
material, disp, state = run_inverse_solve!(setup)

Serial mode is useful for debugging, profiling, and validating results against the Fortran reference. It requires no additional setup beyond the base Sentinel package.

Distributed.jl

For multi-core CPU servers, Distributed.jl provides the best performance. Each worker process handles a subset of zones in parallel via pmap:

julia

using Distributed
addprocs(14)  # add 14 worker processes

@everywhere using Sentinel
@everywhere using LinearAlgebra: BLAS
@everywhere BLAS.set_num_threads(1)  # critical for performance

using Sentinel
prob = load_inversion("runfile.dat")
setup = setup_inverse_problem(prob, datadir)
material, disp, state = run_inverse_solve!(setup; pmap_func=pmap)

Key points:

@everywhere using Sentinel must load Sentinel on all workers before calling run_inverse_solve!. Each worker needs the full module available.
BLAS.set_num_threads(1) is critical –- see the BLAS Threading Warning below.
Worker count should match available cores minus one (the master process coordinates but does not solve zones). On a 16-core machine, 14–15 workers is typical.
run_inverse_solve! automatically sets BLAS threads to 1 during zone solves and restores the original count afterward. The @everywhere call above ensures workers start with the correct setting.

Batched GPU

The batched GPU driver partitions zone solves into phased execution: GPU-accelerated linear solves and gradient computation interleaved with CPU-threaded assembly and line search.

CUDA (Recommended for NVIDIA GPUs)

julia

using CUDA, CUDSS, Sentinel, KernelAbstractions

prob = load_inversion("runfile.dat")
setup = setup_inverse_problem(prob, data_dir)

material, disp, state = run_inverse_solve!(setup;
    solver=CachedDirectSolver(),
    gradient_backend=KAGradientBackend(CUDABackend()),
    cudss_batched_solver=CUDSSBatchedSolver(structure="G"),
    verbose=true)

Metal (Apple Silicon)

julia

using Metal, Sentinel, KernelAbstractions

material, disp, state = zone_decomposition_solve_batched_gpu!(
    material, grid, dh, gp2mtrs, meshes,
    meas, bcs, model, omega, rho, zone_config;
    solver=CachedDirectSolver(),
    gradient_backend=KAGradientBackend(MetalBackend()),
    cudss_batched_solver=CPUBatchedSolver())   # batched line search (the Mac speedup)

Pass a CPUBatchedSolver on Apple Silicon

cuDSS is CUDA-only, so on a Mac the linear solves run on the CPU either way — but how they are driven matters. Passing a CPUBatchedSolver (the cudss_batched_solver keyword accepts any batched solver; the name is historical) switches the line search from the per-zone secant — which re-runs a full lu and a per-zone gradient on every trial — to the lockstep batched secant: one batched Metal gradient dispatch per tick and threaded CPU factor/solve with symbolic-analysis reuse. On the 2-iter MGH brain reconstruction (M2 Max, 8 threads) this is ~2.5× faster (171s → 67s/run), with bit-identical results. Without it, the driver falls back to the slower per-zone secant. Pressure condensation (see Performance Tuning) also engages automatically for Model-1 μ-only reconstructions on this path.

Phased Execution

The batched driver runs in four phases per CG iteration:

Phase 0 (CPU, threaded): Build zone subproblems and concatenate element data for GPU. Runs once per global iteration.
Phase 1 (GPU or CPU): Forward and adjoint solves. With a batched solver (CUDSSBatchedSolver on CUDA, CPUBatchedSolver on Apple Silicon / CPU), all zones are factored and solved as a batch — on the GPU for cuDSS, threaded on the CPU for CPUBatchedSolver. With no batched solver, zones are solved per-zone on the CPU with Threads.@threads.
Phase 2 (GPU): Batched gradient computation using a KernelAbstractions.jl kernel across all zones in a single dispatch.
Phase 3 (CPU, threaded): CG direction and line search. With a batched solver, the whole batch advances in lockstep through one batched secant line search (one batched gradient dispatch per tick) — on CUDA each trial reuses the cuDSS analysis; with CPUBatchedSolver the threaded CPU factor reuses the symbolic analysis (lu!). With no batched solver, each zone runs its own per-zone secant (a full lu per trial) via CachedDirectSolver.

After all CG iterations complete, zone results are consolidated on the master thread.

The batched driver automatically sets BLAS.set_num_threads(1) for the CPU phases, matching the behavior of the serial and distributed drivers.

BLAS Threading Warning

BLAS.set_num_threads(1) is mandatory for zone solves

OpenBLAS multi-threaded mode is 24x slower for zone-sized linear systems (typically 500–5000 DOFs) due to thread-spawn overhead exceeding the computation time. This is the single most impactful performance setting.

The issue arises because each zone's direct solve involves small-to-moderate sparse matrices. OpenBLAS spawns threads for each BLAS call, but the matrix operations complete faster than the thread management overhead.

julia

# Bad: ~614 seconds/iteration with 8 BLAS threads
BLAS.set_num_threads(8)
run_inverse_solve!(setup)

# Good: ~233 seconds/iteration with 1 BLAS thread
BLAS.set_num_threads(1)
run_inverse_solve!(setup)

Sentinel handles this automatically: run_inverse_solve! saves the current BLAS thread count, sets it to 1 before zone solves, and restores it afterward. However, if you are calling zone-level functions directly, you must set BLAS.set_num_threads(1) yourself.

On macOS, AppleAccelerate handles threading correctly and does not suffer from this issue. See Performance Tuning for details.

Choosing a Mode

Hardware	Recommended Mode	Expected Performance
Laptop / single core	Serial	Baseline
Multi-core server	Distributed.jl (N-1 workers)	Near-linear speedup
Apple Silicon Mac	Batched GPU (Metal)	2–3x over serial
NVIDIA GPU server	Batched GPU (CUDA)	Gradient on GPU, solves on CPU

For most users, Distributed.jl on a multi-core machine provides the best balance of performance and simplicity. The GPU modes are beneficial when gradient computation dominates (many zones, large material meshes) and a capable GPU is available. See Performance Tuning for benchmark numbers.

Parallel Execution ​

Execution Modes ​

Serial ​

Distributed.jl ​

Batched GPU ​

CUDA (Recommended for NVIDIA GPUs) ​

Metal (Apple Silicon) ​

Phased Execution ​

BLAS Threading Warning ​

Choosing a Mode ​

Parallel Execution

Execution Modes

Serial

Distributed.jl

Batched GPU

CUDA (Recommended for NVIDIA GPUs)

Metal (Apple Silicon)

Phased Execution

BLAS Threading Warning

Choosing a Mode