Performance Tuning
This page covers the key settings and configuration choices that affect Sentinel's inverse solver performance, along with benchmark results on real brain MRE data.
BLAS Backend
The BLAS backend for dense linear algebra has a significant impact on solve time.
AppleAccelerate (macOS)
On macOS with Apple Silicon, the AppleAccelerate BLAS is loaded automatically when available. It provides approximately 13% faster sparse factorization and solve operations compared to OpenBLAS, and handles internal threading correctly without the overhead issues that plague OpenBLAS on small matrices.
No configuration is needed –- Julia 1.11+ detects and uses Accelerate automatically on supported systems.
OpenBLAS (Linux / default)
On Linux and other platforms, Julia uses OpenBLAS by default. The critical setting is to restrict BLAS to a single thread during zone solves:
using LinearAlgebra: BLAS
BLAS.set_num_threads(1)OpenBLAS multi-threaded mode incurs massive overhead for zone-sized problems (500–5000 DOFs). With 8 BLAS threads, zone iteration time increases by approximately 2.6x compared to single-threaded BLAS. This is because the thread-spawn cost exceeds the actual computation time for these small systems.
run_inverse_solve! sets BLAS.set_num_threads(1) automatically during zone solves and restores the original count afterward. When using Distributed.jl, also set threads on workers via @everywhere BLAS.set_num_threads(1).
Solver Selection
Sentinel provides three sparse direct solver backends via the AbstractLinearSolver interface:
| Solver | Use Case | Factorization | Re-solve |
|---|---|---|---|
DirectSolver | One-off solves | ~55 ms | ~55 ms (re-factors) |
CachedDirectSolver | Production inverse solves | ~55 ms | ~0.2 ms (reuses LU) |
MUMPSSolver | Very large systems (>500K DOFs) | Varies | Varies |
Recommendation: Always use CachedDirectSolver for inverse problems. Each zone iteration performs paired forward + adjoint solves on the same stiffness matrix K. The cached solver factors K once during the forward solve and reuses the LU factorization for the adjoint, eliminating redundant factorization.
run_inverse_solve! uses CachedDirectSolver by default. You can override this via the solver keyword argument:
# Default (recommended)
run_inverse_solve!(setup) # uses CachedDirectSolver internally
# Override with MUMPS for very large problems
using MUMPS
run_inverse_solve!(setup; solver=MUMPSSolver(sym=2))Profile Breakdown
Profiling on the MGH brain MRE dataset (isotropic incompressible, ~171K material nodes, ~600 zones per iteration) shows the following time distribution per zone iteration:
| Phase | Share | Notes |
|---|---|---|
| Gradient computation | 27% | Sensitivity matrix assembly + material derivatives |
| Line search | 24% | Forward solves at trial step sizes |
| Linear solve (forward) | 20% | Sparse LU factorization + solve |
| Adjoint solve | 10% | Reuses cached LU factorization |
| Stiffness assembly | 6% | Element loop over hex27 elements |
| Other (scatter, consolidate, I/O) | 13% | Zone setup, accumulation, convergence check |
The gradient and line search phases dominate. GPU acceleration targets the gradient computation (via batched KernelAbstractions.jl kernels), while the linear solve benefits most from LU caching.
Benchmark Results
All benchmarks use the MGH brain MRE dataset: isotropic incompressible model, ~171K material mesh nodes (Julia) / ~162K nodes (Fortran), ~290 zones per global iteration, 100 global iterations.
CPU and Metal (Mac Studio, M2 Ultra)
Per global iteration, CG=1 phase (first 10 iterations):
| Configuration | Machine | Time/Iter | Speedup vs Serial |
|---|---|---|---|
| Julia serial, BLAS=8t | ms-a2 (16c/32t, 92 GB) | ~614 s | 0.4x (slower) |
| Julia serial, BLAS=1t | ms-a2 | ~233 s | 1.0x (baseline) |
| Fortran MPI, 14 workers | ms-a2 | ~72 s | 3.2x |
| Julia Distributed, 14 workers | ms-a2 | ~72 s | 3.2x |
| Per-zone Metal GPU | Mac Studio (M2 Ultra) | ~167 s | 1.4x |
| Batched GPU, 8 CPU threads (OpenBLAS) | Mac Studio | ~112 s | 2.1x |
| Batched GPU, 8 CPU threads (Accelerate) | Mac Studio | ~98 s | 2.4x |
CUDA GPU (DGX Spark, GB10)
Full 100-iteration run totals (10 CPU threads):
| Configuration | Total Time | Per-iter (CG=2) | Speedup |
|---|---|---|---|
| CPU baseline (CachedDirectSolver) | ~330 min | ~200s | 1.0x |
| Fortran MPI, 14 workers | ~240 min | ~144s | 1.4x |
| cuDSS batched + CPU line search | 222 min | ~140s | 1.5x |
| cuDSS batched + GPU line search | 158 min | ~100s | 2.1x |
The GPU line search uses CUDSSDirectSolver with per-solver CUDA stream isolation for thread-safe concurrent zone solves. See CUDA Backend for implementation details.
Key Observations
Julia matches Fortran MPI at the same worker count (~72 s/iter with 14 workers), validating the port's computational efficiency.
BLAS threading matters enormously: 8 BLAS threads is 2.6x slower than 1 thread for serial execution. This is the most common performance pitfall.
Distributed.jl provides the best absolute performance on multi-core CPU servers, with near-linear scaling up to the number of physical cores.
CUDA GPU achieves 2.1x speedup on a single DGX Spark node, with the batched cuDSS solver providing the dominant acceleration in Phase 1 and GPU line search further reducing per-iteration time.
Batched GPU with AppleAccelerate achieves 2.4x speedup on a single Mac Studio, competitive with 6–7 CPU workers.
Julia vs Fortran Accuracy
At 100 iterations on the MGH brain MRE dataset, Julia and Fortran produce visually indistinguishable results with quantitative agreement:
| Metric | Value |
|---|---|
| Mean relative difference | 2.0% |
| Spatial distribution | Range and histogram match |
| Convergence trajectory | Parallel objective function curves |
The 2.0% mean difference is attributable to:
Material mesh size: Julia generates 171K nodes vs Fortran's 162K nodes due to slightly different bounding-box rounding in mesh generation. This changes the effective spatial resolution of the reconstruction.
Floating-point ordering: Zone solve order and accumulation order differ between the implementations, causing small differences that compound over 100 iterations.
Random seed propagation: Both use Park-Miller RNG for zone grid seeds, but minor initialization differences can shift zone boundaries.
These differences are well within the noise floor of MRE measurements and do not affect clinical or research interpretation of the results.