Performance Tuning

This page covers the key settings and configuration choices that affect Sentinel's inverse solver performance, along with benchmark results on real brain MRE data.

BLAS Backend

The BLAS backend for dense linear algebra has a significant impact on solve time.

AppleAccelerate (macOS)

On macOS with Apple Silicon, the AppleAccelerate BLAS is loaded automatically when available. It provides approximately 13% faster sparse factorization and solve operations compared to OpenBLAS, and handles internal threading correctly without the overhead issues that plague OpenBLAS on small matrices.

No configuration is needed –- Julia 1.11+ detects and uses Accelerate automatically on supported systems.

OpenBLAS (Linux / default)

On Linux and other platforms, Julia uses OpenBLAS by default. The critical setting is to restrict BLAS to a single thread during zone solves:

julia

using LinearAlgebra: BLAS
BLAS.set_num_threads(1)

OpenBLAS multi-threaded mode incurs massive overhead for zone-sized problems (500–5000 DOFs). With 8 BLAS threads, zone iteration time increases by approximately 2.6x compared to single-threaded BLAS. This is because the thread-spawn cost exceeds the actual computation time for these small systems.

run_inverse_solve! sets BLAS.set_num_threads(1) automatically during zone solves and restores the original count afterward. When using Distributed.jl, also set threads on workers via @everywhere BLAS.set_num_threads(1).

Solver Selection

Sentinel provides three sparse direct solver backends via the AbstractLinearSolver interface:

Solver	Use Case	Factorization	Re-solve
`DirectSolver`	One-off solves	~55 ms	~55 ms (re-factors)
`CachedDirectSolver`	Production inverse solves	~55 ms	~0.2 ms (reuses LU)
`MUMPSSolver`	Very large systems (>500K DOFs)	Varies	Varies

Recommendation: Always use CachedDirectSolver for inverse problems. Each zone iteration performs paired forward + adjoint solves on the same stiffness matrix K. The cached solver factors K once during the forward solve and reuses the LU factorization for the adjoint, eliminating redundant factorization.

run_inverse_solve! uses CachedDirectSolver by default. You can override this via the solver keyword argument:

julia

# Default (recommended)
run_inverse_solve!(setup)  # uses CachedDirectSolver internally

# Override with MUMPS for very large problems
using MUMPS
run_inverse_solve!(setup; solver=MUMPSSolver(sym=2))

Static Pressure Condensation

For the isotropic incompressible model (Model 1), the four element-local pressure DOFs are statically condensed out of each zone stiffness matrix before factorization.

This is on by default (the condense_pressure keyword defaults to nothing, meaning automatic). Condensation is baked into the displacement-sized stiffness and is solver-agnostic, so the batched driver condenses on both its branches — the cuDSS batched solve and the threaded per-zone CPU-fallback solve used on Apple Silicon (where cuDSS is unavailable). On either, it is enabled exactly for the validated case — a Model 1 reconstruction of the complex shear modulus μ with the bulk modulus κ held fixed — and silently skipped for any other configuration. No action is needed to benefit from it:

julia

run_inverse_solve!(setup;
    cudss_batched_solver = CUDSSBatchedSolver(structure="G"))
    # condense_pressure defaults to automatic → on for Model-1 μ-only

Pass condense_pressure = false to force the full saddle-point system, or condense_pressure = true to require condensation (it raises an ArgumentError for an inapplicable configuration rather than silently running uncondensed). From the MGH benchmark script the same is controlled by CONDENSE=auto (default), CONDENSE=0, or CONDENSE=1.

What it does

The mixed $(u, p)$ formulation appends 4 discontinuous, element-local pressure DOFs per element to the displacement system. Because they are element-local, the pressure–pressure block is block-diagonal and can be eliminated with one $4 \times 4$ solve per element, leaving the exact Schur complement in the displacement unknowns alone. See Static Pressure Condensation in the Mathematical Reference for the derivation. Two consequences make this worthwhile:

Smaller factorization. The factorized matrix drops from $3 n_{nodes} + 4 n_{e}$ to $3 n_{nodes}$ DOFs — about 12% fewer DOFs on the MGH mesh — with no added fill (the condensed sparsity equals the pure-displacement stiffness).
Far better conditioning. The eliminated penalty block scales as $1 / κ^{*} \sim 1 / (2 GPa)$ against a shear block of $μ^{*} \sim kPa$ , giving the saddle system a diagonal-scale spread of order $10^{12}$ . Condensation removes that block, so the condensed matrix is a well-conditioned displacement stiffness.

Accuracy

Condensation is exact algebra, and the conditioning improvement makes it more accurate than the raw saddle-point solve, not merely equivalent. On the full 100-iteration MGH reconstruction, the condensed result is per-voxel identical (to $\sim 10^{- 10}$ , i.e. round-off) to the mixed-precision equilibrated solve — an algorithmically independent but equally well-conditioned path (FP32 factorization with Jacobi equilibration and FP64 iterative refinement), so the two corroborate each other:

Comparison	$μ^{'}$ correlation	$μ^{″}$ correlation	rel. L2
Condensed vs. equilibrated mixed-precision	1.0000000	1.0000000	$1.1 \times 10^{- 10}$
Condensed vs. raw FP64 saddle solve	0.99914	0.99960	$1.4 \times 10^{- 2}$
Equilibrated mixed vs. raw FP64 saddle solve	0.99914	0.99960	$1.4 \times 10^{- 2}$

The condensed and equilibrated-mixed solutions deviate from the raw FP64 saddle solve by the same amount because the raw saddle solve is the inaccurate one — its $10^{12}$ conditioning loses several digits. Condensation and Jacobi equilibration are two independent routes to the same well-conditioned answer.

Speed

Holding everything else fixed (same branch, same FP64 path), condensation reduces per-iteration time modestly — the factorization is one of several costs, and the $4 \times 4$ element eliminations add a small assembly cost that partly offsets the smaller factorization. On the 2-iteration MGH benchmark (GB10 GPU):

Configuration	Total (2 iter)
FP64, full saddle system	63.0 s
FP64, condensed	60.7 s
Mixed precision, full saddle system	58.7 s
Mixed precision, condensed	57.7 s
Mixed precision, condensed, 1 IR step	56.5 s

Condensation composes with mixed precision: because the condensed matrix is well-conditioned, the mixed-precision iterative refinement needs only a single step (IR_STEPS=1) rather than two to reach FP64 accuracy.

Applicability

Condensation requires Model 1 reconstructing the complex shear modulus $μ^{*} = μ^{'} + i μ^{″}$ (storage and loss moduli), optionally with density $ρ$ , with the bulk modulus $κ^{*}$ held fixed. This is the standard viscoelastic MRE reconstruction. The $κ^{*}$ gradient and the anisotropic models' gradients are the only terms that read the pressure solution, so they are not condensation-compatible; run_inverse_solve! raises an ArgumentError if condense_pressure=true is combined with a $κ$ -active reconstruction or a non-Model-1 model, rather than silently producing a wrong gradient.

Batched Line Search (Apple Silicon)

The batched driver's line search re-solves each zone at several trial materials per CG iteration. The fallback path runs a per-zone secant: every trial re-runs a full lu (symbolic + numeric factorization) and computes its directional-derivative gradient on the CPU, one zone at a time. On a Mac this dominates the wall-clock.

Passing a CPUBatchedSolver (the cudss_batched_solver keyword — the name is historical; it accepts any batched solver) switches the line search to the lockstep batched secant: all zones advance together, so each tick does one batched Metal gradient dispatch instead of per-zone CPU gradients, and the threaded CPU factorization reuses the symbolic analysis (lu!) instead of re-analyzing every trial. It is the Apple-Silicon analogue of the cuDSS batched path. On the 2-iter MGH benchmark (M2 Max, 8 threads):

Configuration	Total (2 iter)
Full saddle, per-zone secant line search	171 s
+ pressure condensation	168 s
+ batched line search (`CPUBatchedSolver`)	70 s
both	67 s (≈ 2.5×)

The reconstruction is bit-identical across all four — both changes are numerically exact. The batched line search is the dominant lever; condensation is a small, free addition that composes with it.

Why CPU, not Metal, for the solve

Apple Metal has no sparse solver, so the factorizations run on the CPU (UMFPACK) regardless. An FP32 Metal iterative solver was investigated (benchmark/fp32_iterative_spike.jl) and found uncompetitive: the equilibrated saddle needs ILU-quality preconditioning, which on Metal means sequential triangular solves across hundreds of small zones — the classic poorly-parallel GPU-sparse problem. The batched line search above is the practical Mac win.

Profile Breakdown

Profiling on the MGH brain MRE dataset (isotropic incompressible, ~171K material nodes, ~600 zones per iteration) shows the following time distribution per zone iteration:

Phase	Share	Notes
Gradient computation	27%	Sensitivity matrix assembly + material derivatives
Line search	24%	Forward solves at trial step sizes
Linear solve (forward)	20%	Sparse LU factorization + solve
Adjoint solve	10%	Reuses cached LU factorization
Stiffness assembly	6%	Element loop over hex27 elements
Other (scatter, consolidate, I/O)	13%	Zone setup, accumulation, convergence check

The gradient and line search phases dominate. GPU acceleration targets the gradient computation (via batched KernelAbstractions.jl kernels), while the linear solve benefits most from LU caching.

Benchmark Results

All benchmarks use the MGH brain MRE dataset: isotropic incompressible model, ~171K material mesh nodes (Julia) / ~162K nodes (Fortran), ~290 zones per global iteration, 100 global iterations.

CPU and Metal (Mac Studio, M2 Ultra)

Per global iteration, CG=1 phase (first 10 iterations):

Configuration	Machine	Time/Iter	Speedup vs Serial
Julia serial, BLAS=8t	ms-a2 (16c/32t, 92 GB)	~614 s	0.4x (slower)
Julia serial, BLAS=1t	ms-a2	~233 s	1.0x (baseline)
Fortran MPI, 14 workers	ms-a2	~72 s	3.2x
Julia Distributed, 14 workers	ms-a2	~72 s	3.2x
Per-zone Metal GPU	Mac Studio (M2 Ultra)	~167 s	1.4x
Batched GPU, 8 CPU threads (OpenBLAS)	Mac Studio	~112 s	2.1x
Batched GPU, 8 CPU threads (Accelerate)	Mac Studio	~98 s	2.4x

CUDA GPU (DGX Spark, GB10)

Full 100-iteration run totals (10 CPU threads):

Configuration	Total Time	Per-iter (CG=2)	Speedup
CPU baseline (CachedDirectSolver)	~330 min	~200s	1.0x
Fortran MPI, 14 workers	~240 min	~144s	1.4x
cuDSS batched + CPU line search	222 min	~140s	1.5x
cuDSS batched + GPU line search	158 min	~100s	2.1x

The GPU line search uses CUDSSDirectSolver with per-solver CUDA stream isolation for thread-safe concurrent zone solves. See CUDA Backend for implementation details.

Key Observations

Julia matches Fortran MPI at the same worker count (~72 s/iter with 14 workers), validating the port's computational efficiency.
BLAS threading matters enormously: 8 BLAS threads is 2.6x slower than 1 thread for serial execution. This is the most common performance pitfall.
Distributed.jl provides the best absolute performance on multi-core CPU servers, with near-linear scaling up to the number of physical cores.
CUDA GPU achieves 2.1x speedup on a single DGX Spark node, with the batched cuDSS solver providing the dominant acceleration in Phase 1 and GPU line search further reducing per-iteration time.
Batched GPU with AppleAccelerate achieves 2.4x speedup on a single Mac Studio, competitive with 6–7 CPU workers.

Julia vs Fortran Accuracy

At 100 iterations on the MGH brain MRE dataset, Julia and Fortran produce visually indistinguishable results with quantitative agreement:

Metric	Value
Mean relative difference	2.0%
Spatial distribution	Range and histogram match
Convergence trajectory	Parallel objective function curves

The 2.0% mean difference is attributable to:

Material mesh size: Julia generates 171K nodes vs Fortran's 162K nodes due to slightly different bounding-box rounding in mesh generation. This changes the effective spatial resolution of the reconstruction.
Floating-point ordering: Zone solve order and accumulation order differ between the implementations, causing small differences that compound over 100 iterations.
Random seed propagation: Both use Park-Miller RNG for zone grid seeds, but minor initialization differences can shift zone boundaries.

These differences are well within the noise floor of MRE measurements and do not affect clinical or research interpretation of the results.

Performance Tuning ​

BLAS Backend ​

AppleAccelerate (macOS) ​

OpenBLAS (Linux / default) ​

Solver Selection ​

Static Pressure Condensation ​

What it does ​

Accuracy ​

Speed ​

Batched Line Search (Apple Silicon) ​

Profile Breakdown ​

Benchmark Results ​

CPU and Metal (Mac Studio, M2 Ultra) ​

CUDA GPU (DGX Spark, GB10) ​

Key Observations ​

Julia vs Fortran Accuracy ​

Performance Tuning

BLAS Backend

AppleAccelerate (macOS)

OpenBLAS (Linux / default)

Solver Selection

Static Pressure Condensation

What it does

Accuracy

Speed

Batched Line Search (Apple Silicon)

Profile Breakdown

Benchmark Results

CPU and Metal (Mac Studio, M2 Ultra)

CUDA GPU (DGX Spark, GB10)

Key Observations

Julia vs Fortran Accuracy