◆ FORGE Suite
GitHubMechanical Neuroimaging Lab · Univ. of Delaware
Skip to content

Parallel Execution

Sentinel supports several execution modes for the zone decomposition inverse solver, from single-core serial to multi-worker distributed and GPU-accelerated. This page describes each mode and when to use it.

Execution Modes

ModeFunctionWorkersBest For
Serialrun_inverse_solve!(setup)1Debugging, development, small problems
Distributed.jlrun_inverse_solve!(setup; pmap_func=pmap)N CPUProduction on multi-core servers
Batched GPUzone_decomposition_solve_batched_gpu!GPU + CPU threadsApple Silicon or CUDA systems

All modes produce identical results (up to floating-point ordering). The zone decomposition algorithm is embarrassingly parallel at the zone level –- each zone is an independent inverse sub-problem –- so parallelism scales nearly linearly with the number of workers.

Serial

The default mode runs all zones sequentially on a single core:

julia
using Sentinel

config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, datadir)
material, disp, state = run_inverse_solve!(setup)

Serial mode is useful for debugging, profiling, and validating results against the Fortran reference. It requires no additional setup beyond the base Sentinel package.

Distributed.jl

For multi-core CPU servers, Distributed.jl provides the best performance. Each worker process handles a subset of zones in parallel via pmap:

julia
using Distributed
addprocs(14)  # add 14 worker processes

@everywhere using Sentinel
@everywhere using LinearAlgebra: BLAS
@everywhere BLAS.set_num_threads(1)  # critical for performance

using Sentinel
config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, datadir)
material, disp, state = run_inverse_solve!(setup; pmap_func=pmap)

Key points:

  • @everywhere using Sentinel must load Sentinel on all workers before calling run_inverse_solve!. Each worker needs the full module available.

  • BLAS.set_num_threads(1) is critical –- see the BLAS Threading Warning below.

  • Worker count should match available cores minus one (the master process coordinates but does not solve zones). On a 16-core machine, 14–15 workers is typical.

  • run_inverse_solve! automatically sets BLAS threads to 1 during zone solves and restores the original count afterward. The @everywhere call above ensures workers start with the correct setting.

Batched GPU

The batched GPU driver partitions zone solves into phased execution: GPU-accelerated linear solves and gradient computation interleaved with CPU-threaded assembly and line search.

julia
using CUDA, CUDSS, Sentinel, KernelAbstractions

config = parse_inverse_runfile("runfile.dat")
setup = setup_inverse_problem(config, data_dir)

material, disp, state = run_inverse_solve!(setup;
    solver=CachedDirectSolver(),
    gradient_backend=KAGradientBackend(CUDABackend()),
    cudss_batched_solver=CUDSSBatchedSolver(structure="G"),
    verbose=true)

Metal (Apple Silicon)

julia
using Metal, Sentinel, KernelAbstractions

material, disp, state = zone_decomposition_solve_batched_gpu!(
    material, grid, dh, gp2mtrs, meshes,
    meas, bcs, model, omega, rho, zone_config;
    solver=CachedDirectSolver(),
    gradient_backend=KAGradientBackend(MetalBackend()))

Phased Execution

The batched driver runs in four phases per CG iteration:

  1. Phase 0 (CPU, threaded): Build zone subproblems and concatenate element data for GPU. Runs once per global iteration.

  2. Phase 1 (GPU or CPU): Forward and adjoint solves. With cudss_batched_solver, all zones are factored and solved on the GPU simultaneously. Without it, zones are solved on the CPU with Threads.@threads.

  3. Phase 2 (GPU): Batched gradient computation using a KernelAbstractions.jl kernel across all zones in a single dispatch.

  4. Phase 3 (CPU, threaded): Per-zone CG direction and line search. With CUDA, trial forward solves use CUDSSDirectSolver (one per zone, each on its own CUDA stream). With Metal or CPU, uses CachedDirectSolver.

After all CG iterations complete, zone results are consolidated on the master thread.

The batched driver automatically sets BLAS.set_num_threads(1) for the CPU phases, matching the behavior of the serial and distributed drivers.

BLAS Threading Warning

BLAS.set_num_threads(1) is mandatory for zone solves

OpenBLAS multi-threaded mode is 24x slower for zone-sized linear systems (typically 500–5000 DOFs) due to thread-spawn overhead exceeding the computation time. This is the single most impactful performance setting.

The issue arises because each zone's direct solve involves small-to-moderate sparse matrices. OpenBLAS spawns threads for each BLAS call, but the matrix operations complete faster than the thread management overhead.

julia
# Bad: ~614 seconds/iteration with 8 BLAS threads
BLAS.set_num_threads(8)
run_inverse_solve!(setup)

# Good: ~233 seconds/iteration with 1 BLAS thread
BLAS.set_num_threads(1)
run_inverse_solve!(setup)

Sentinel handles this automatically: run_inverse_solve! saves the current BLAS thread count, sets it to 1 before zone solves, and restores it afterward. However, if you are calling zone-level functions directly, you must set BLAS.set_num_threads(1) yourself.

On macOS, AppleAccelerate handles threading correctly and does not suffer from this issue. See Performance Tuning for details.

Choosing a Mode

HardwareRecommended ModeExpected Performance
Laptop / single coreSerialBaseline
Multi-core serverDistributed.jl (N-1 workers)Near-linear speedup
Apple Silicon MacBatched GPU (Metal)2–3x over serial
NVIDIA GPU serverBatched GPU (CUDA)Gradient on GPU, solves on CPU

For most users, Distributed.jl on a multi-core machine provides the best balance of performance and simplicity. The GPU modes are beneficial when gradient computation dominates (many zones, large material meshes) and a capable GPU is available. See Performance Tuning for benchmark numbers.