# Performance

NTX now includes explicit scaling benchmarks and figure-generation helpers for
serial batched scans and the multiprocess throughput lane. It also now includes
workflow profilers for the archive-backed fixed-field closure audit and the
corrected integrated W7-X workflow.

## File-Backed Run Path

The TOML/CLI path prepares the geometry and derivative operators once, then
reuses that prepared system for the solve and output writer. This avoids the
old double geometry evaluation in file-backed single-case runs and lowers both
runtime and peak transient array pressure. Verbose CLI runs print separate
prepare, solve, write, plot, and total timings.

NetCDF and HDF5 outputs are written uncompressed for fast inspection and
cross-code exchange. Use `.npz` when smaller Python-only artifacts matter more
than write speed:

```bash
ntx input.toml --output outputs/run.nc --plot
ntx input.toml --output outputs/run.npz
```

## Benchmark Scripts

Collect scaling data:

```bash
python scripts/benchmark_scaling.py --backend cpu --surface dkes --sizes 8,16,32,64
python scripts/benchmark_scaling.py --backend gpu --surface dkes --sizes 16,32,64 --workers 2
python scripts/benchmark_strong_scaling.py --backend cpu --surface dkes --num-cases 64
```

Generate publication-style figures:

```bash
python examples/performance_scaling.py \
  --cpu-json docs/_static/performance_scaling_cpu_smoke.json \
  --gpu-json docs/_static/performance_scaling_gpu_smoke.json \
  --figure-title "Smoke-grid serial vs multiprocess scaling" \
  --output-prefix docs/_static/performance_scaling_smoke

python examples/performance_strong_scaling.py \
  --cpu-json docs/_static/performance_strong_scaling_cpu_production.json \
  --gpu-json docs/_static/performance_strong_scaling_gpu_production.json \
  --figure-title "Production fixed-workload strong scaling" \
  --output-prefix docs/_static/performance_strong_scaling_production
```

The example writes PNG, PDF, and JSON summary outputs. The summary JSON records
CPU/GPU crossover cases, process peak resident memory, device counts, and
serial-vs-parallel coefficient deltas.

For the committed production-grid map:

```bash
XLA_FLAGS=--xla_force_host_platform_device_count=4 \
python scripts/benchmark_scaling.py \
  --backend cpu --surface dkes --sizes 16,32,64,128 \
  --workers 4 --n-theta 17 --n-zeta 25 --n-xi 16 \
  --output-json docs/_static/performance_scaling_cpu_production.json

python examples/performance_scaling.py \
  --cpu-json docs/_static/performance_scaling_cpu_production.json \
  --gpu-json docs/_static/performance_scaling_gpu_production.json \
  --figure-title "Production-grid serial vs parallel scaling" \
  --output-prefix docs/_static/performance_scaling_production

XLA_FLAGS=--xla_force_host_platform_device_count=4 \
python scripts/benchmark_strong_scaling.py \
  --backend cpu --surface dkes --num-cases 128 \
  --worker-counts 1,2,4 --device-counts 1,2,4 \
  --n-theta 17 --n-zeta 25 --n-xi 16 \
  --output-json docs/_static/performance_strong_scaling_cpu_production.json
```

Profile the corrected integrated W7-X workflow:

```bash
python scripts/profile_w7x_integrated_workflow.py \
  --output-json examples/outputs/profile_w7x_integrated_workflow/profile.json \
  --cprofile-out examples/outputs/profile_w7x_integrated_workflow/profile.pstats \
  --trace-dir examples/outputs/profile_w7x_integrated_workflow/trace
```

The script records:

- cached scan/database timings
- first-call and steady-state closure timings
- resident memory
- a Python `cProfile` dump
- a TensorFlow/JAX trace that can be opened in TensorBoard or Perfetto

## Smoke-Grid Scaling

Figure assets:

```text
docs/_static/performance_scaling_smoke.png
docs/_static/performance_scaling_smoke.pdf
docs/_static/performance_scaling_smoke.json
```

![Smoke-grid scaling](_static/performance_scaling_smoke.png)

Interpretation:

- on the repository smoke grid `9 x 11 x 6`, serial batched JAX is the default
  choice on both CPU and GPU for small and medium scans
- the smallest GPU point is startup dominated and should not be interpreted as a
  real throughput crossover
- on the refreshed local CPU run, the multiprocess and single-process
  device-parallel lanes are numerically correct but still slower than serial
  over the tested smoke-grid range
- the refreshed CPU smoke artifact reports process peak resident memory of
  about `1.76 GB`
- the refreshed office GPU smoke artifact reports process peak resident memory
  of about `1.29 GB`, with one of two GPUs passing the single-process
  device-parallel smoke filter

## Heavier-Grid Scaling

Figure assets:

```text
docs/_static/performance_scaling_heavy.png
docs/_static/performance_scaling_heavy.pdf
docs/_static/performance_scaling_heavy.json
```

![Heavier-grid scaling](_static/performance_scaling_heavy.png)

Interpretation:

- on the heavier DKES grid `17 x 25 x 16`, the refreshed local CPU artifact
  shows the single-process device-parallel lane crossing serial by `32` cases,
  while the 4-worker CPU multiprocess lane remains slower through `64` cases
- on the same heavier grid, the office 2-GPU multiprocess lane remains slower
  than serial in the tested range under the current shared-office software and
  hardware stack
- the refreshed CPU heavy artifact reports process peak resident memory of
  about `2.70 GB`
- the refreshed office GPU heavy artifact reports process peak resident memory
  of about `1.42 GB`, again with one healthy single-process device
- the practical guidance from these measurements is:
  - use serial batched JAX for small and medium studies
  - use the single-process device-parallel lane on CPU only after checking that
    the target grid/scan size has crossed over
  - use the multiprocess lane only when a measured workload shows enough
    amortization of process startup on the target machine
  - treat office multi-GPU multiprocess execution as a robust isolation path
    first, and as a throughput path only after benchmarking the specific
    production workload

## Production-Grid Scaling

Figure assets:

```text
docs/_static/performance_scaling_production.png
docs/_static/performance_scaling_production.pdf
docs/_static/performance_scaling_production.json
```

![Production-grid scaling](_static/performance_scaling_production.png)

Interpretation:

- the committed production map uses the same `17 x 25 x 16` DKES-style grid as
  the heavier-grid artifact but extends the scan ladder to `128` cases
- with four logical CPU devices exposed to JAX, the single-process
  device-parallel lane crosses serial at `32` cases and reaches a best observed
  speedup of about `1.72x` at `128` cases
- the 4-worker CPU multiprocess lane remains below serial through `128` cases,
  reaching about `0.92x`; process startup and duplicated runtime state still
  dominate this workload
- the office two-GPU workstation run found two CUDA devices, but only one device
  passed the NTX smoke solve for single-process parallel execution under the
  tested software stack
- on that GPU workload, single-process device-parallel timing is characterized
  and numerically identical to serial for `D11`, but multiprocess remains below
  serial through `128` cases
- peak resident memory is about `4.39 GB` for the 4-device CPU run and
  `1.50 GB` for the tested GPU run

The production-grid guidance is therefore:

- use compiled prepared-geometry reuse first when the geometry and array shapes
  are fixed
- use single-process JAX device parallelism for CPU scan ladders after a local
  crossover measurement
- keep multiprocess and multi-GPU execution as workload-specific isolation or
  throughput paths until the exact target grid shows a measured win

## Production Strong Scaling

Figure assets:

```text
docs/_static/performance_strong_scaling_production.png
docs/_static/performance_strong_scaling_production.pdf
docs/_static/performance_strong_scaling_production.json
```

![Production strong scaling](_static/performance_strong_scaling_production.png)

Interpretation:

- the committed strong-scaling map fixes the workload at `128` cases on the
  `17 x 25 x 16` DKES-style grid, then varies workers or requested devices
- on CPU, single-process device parallelism scales from `1.01x` at one exposed
  device to `1.74x` at four devices; the corresponding efficiency drops from
  startup parity to about `0.43` at four devices, so this is useful but not
  ideal strong scaling
- on CPU, the multiprocess lane improves with more workers but remains below
  serial at `0.93x` for four workers, which confirms that process startup and
  duplicated runtime state are still too costly for this fixed workload
- on the tested two-GPU workstation, both CUDA devices are visible but only one
  passes the NTX single-process smoke solve; the strong-scaling artifact
  therefore records one healthy parallel GPU and does not promote multi-GPU
  speedup
- all CPU and GPU strong-scaling lanes reproduce serial `D11` to the committed
  numerical tolerance; the largest GPU multiprocess delta is about `2.34e-8`
- peak resident memory is about `2.83 GB` for the CPU strong-scaling run and
  `1.37 GB` for the GPU strong-scaling run

This closes the first artifact-backed strong-scaling lane. The next performance
work should target device-health reproducibility and larger VMEC-family
workloads before claiming general multi-GPU scaling.

## Prepared-Geometry Reuse

The prepared-geometry reuse artifact isolates the repeated fixed-geometry solve
path from the multiprocess throughput lane:

```bash
python examples/prepared_geometry_reuse_profile.py --preset paper
```

For targeted trace capture:

```bash
python examples/prepared_geometry_reuse_profile.py \
  --preset smoke --case-counts 3 \
  --trace-dir examples/outputs/ntx_prepared_geometry_profile/cpu_smoke_trace \
  --perfetto \
  --device-memory-profile examples/outputs/ntx_prepared_geometry_profile/cpu_smoke_trace/device_memory.prof
```

Figure assets:

```text
docs/_static/prepared_geometry_reuse_profile.png
docs/_static/prepared_geometry_reuse_profile.pdf
docs/_static/prepared_geometry_reuse_profile.json
```

![Prepared-geometry reuse](_static/prepared_geometry_reuse_profile.png)

Current local CPU interpretation:

- direct repeated solves and un-jitted prepared solves are near parity after
  one warmup solve, so hoisting geometry arrays alone is not the main win on
  this grid
- the compiled prepared steady path reaches a best observed speedup of about
  `1.50e2x` against direct repeated solves with maximum coefficient mismatch
  below `2e-9`
- the first compiled call is still visible at about `0.43 s`, which confirms
  that optimization workflows should compile once per fixed geometry and reuse
  stable shapes across collisionality, electric-field, species, and radial axes
- the process peak resident memory in this run is about `1.24 GB`

This turns the speed lane into a concrete engineering target: stabilize and
reuse prepared compiled closures before deeper linear-algebra rewrites or
multi-process orchestration.

## Finite-Beta RHSMode=1 Profile-Current Profiling

The finite-beta profile-current lane now has a dedicated handoff note for the
same-contract SFINCS-JAX RHSMode=1 bottleneck:

```text
docs/sfincs-jax-rhsmode1-profile-current-handoff.md
```

The current profiling result is:

- SFINCS-JAX `1.1.0` at `df0c70d` completes the `13 x 15 x 8, Nx=5`
  three-radius smoke profile-current artifact in `24.7 s` total on local CPU;
  all three HDF5 outputs pass the true-residual metadata gate
- the same checkout completes the `17 x 21 x 12, Nx=5` inner-radius HDF5 output
  in `9.90 s` wall time with `1.55 GB` max RSS, `sparse_pc_gmres`, and
  true-residual/target `8.45e-7`
- a three-radius `25 x 31 x 17, Nx=11` production ladder completes in
  `383.16 s` with about `9.46 GB` max RSS and true-residual gates passing at
  every radius
- the pitch-resolution audit shows the remaining RHSMode=1 profile-current
  discrepancy is not a residual failure: the accepted high-`Nxi` even/odd
  Legendre stress gap is `1.323e-1`, below the current `1.5e-1`
  reduced-closure tolerance
- a same-grid `collisionOperator=0` full-collision probe timed out after
  `901.76 s` and about `9.97 GB` max RSS without a completed current output

The old sparse-solver runtime lane and the reduced-closure pitch stress lane
are closed under the documented tolerances. The full-collision branch remains a
non-shipping feasibility diagnostic rather than a release blocker.

## QI Hires NEOPAX-Database Export

The downstream QI finite-beta hires database-generation command exercises the
largest public `examples/build_neopax_scan_from_ertilde.py` path used so far:
`25 x 25 x 60`, seven radial surfaces, and the default `16 x 12`
`(nu_v, Er_tilde)` scan per surface. The current script reports per-surface
timing and accepts `--scan-batch-size` to split that flattened scan into
fixed-size chunks.

Measured on the local CPU for one radial surface with the same QI hires
VMEC/Boozer files:

- full-surface batching: `64.7 s`, about `5.0 GB` peak RSS
- `--scan-batch-size 32`: `55.9 s`, about `1.46 GB` peak RSS
- `--scan-batch-size 16`: `75.3 s`, about `1.27 GB` peak RSS
- `XLA_FLAGS=--xla_force_host_platform_device_count=4` with
  `--parallel-devices 4 --scan-batch-size 32`: `47.7 s` for the same one-surface
  workload on the local CPU, with coefficient differences from the serial
  batched run at roundoff

For CPU runs of that example, start with `--scan-batch-size 32`. For GPU runs,
leave full-surface batching enabled when memory permits; add a batch size only
when the device runs out of memory at higher resolution.

`--scan-batch-size` primarily reduces peak memory; it is not a CPU parallelism
switch. For CPU-only laptops that are still too slow, expose multiple JAX host
devices before launch and request per-surface scan sharding:

```bash
XLA_FLAGS=--xla_force_host_platform_device_count=4 \
python examples/build_neopax_scan_from_ertilde.py \
  --wout examples/inputs/wout_QI_nfp2_newNT_opt_hires.nc \
  --booz examples/inputs/boozermn_wout_QI_nfp2_newNT_opt_hires.nc \
  --surface-backend vmec \
  --device-backend cpu \
  --parallel-devices 4 \
  --scan-batch-size 32 \
  --output examples/input/Dij_NTX.h5
```

The script reports the resolved batch size, requested parallel device count,
and visible backend. If the collaborator command still includes
`--device-backend gpu` on a CPU-only machine, it will fail before solving; use
`--device-backend cpu` or omit the flag on laptops without a configured JAX GPU.

## Reproducibility

The figure JSON payloads committed in `docs/_static/` are:

- `performance_scaling_cpu_smoke.json`
- `performance_scaling_gpu_smoke.json`
- `performance_scaling_cpu_heavy.json`
- `performance_scaling_gpu_heavy.json`
- `performance_scaling_cpu_production.json`
- `performance_scaling_gpu_production.json`
- `performance_scaling_production.json`
- `performance_strong_scaling_cpu_production.json`
- `performance_strong_scaling_gpu_production.json`
- `performance_strong_scaling_production.json`
- `prepared_geometry_reuse_profile.json`

Fresh runs of `scripts/benchmark_scaling.py` and
`scripts/profile_parallel_runtime.py` also record process peak resident memory
as `max_rss_mb`. That value is intentionally treated as a run-environment
metric rather than a parity target, but it keeps memory visible whenever timing
artifacts are regenerated. The committed CPU artifacts were refreshed locally;
the committed GPU artifacts were refreshed from a clean temporary checkout on
the office GPU workstation.
For CI smoke coverage, `scripts/profile_parallel_runtime.py` accepts
`--num-cases` and `--grid` so the serial/device-parallel correctness path can
run on a tiny grid while the default command remains the profiling workload.

They were collected on:

- local workstation CPU with `XLA_FLAGS=--xla_force_host_platform_device_count=4`
- office workstation GPU with `XLA_PYTHON_CLIENT_PREALLOCATE=false`

## Integrated W7-X Workflow

The corrected integrated W7-X raw branch is now the right profiling target
because the database normalization is closed there and the rebuilt workflow
matches the shipped reference current tightly.

Current local CPU profile, using the cached rebuilt W7-X scan:

- `reference_load_seconds`: `1.04e-2`
- `scan_prepare_seconds`: `2.94e-4`
- `rebuilt_scan_load_seconds`: `2.69e-3`
- `field_species_seconds`: `1.97`
- `database_seconds`: `2.55e-1`
- `no_momentum_first_seconds`: `8.64`
- `no_momentum_steady_seconds`: `2.63e-2`
- `momentum_correction_first_seconds`: `8.81`
- `momentum_correction_steady_seconds`: `1.58e-2`
- `current_reduction_seconds`: `3.29e-2`
- `max_rss_mb`: about `1847`

Interpretation:

- the corrected integrated workflow is compile-bound on first call, not
  arithmetic-bound
- the steady-state closure path is already fast on CPU once compiled
- the main performance priority is therefore to reduce recompiles and tracing,
  not to micro-optimize the final current reduction

The current `cProfile` dump is dominated by XLA compilation:

- about `15 s` in `backend_compile_and_load`
- about `20 s` total Python runtime

That points directly to the next speed lane:

- stabilize shapes and dtypes in the closure path
- hoist and reuse the compiled no-momentum and momentum-correction calls
- avoid retracing/vmap rebuilding across repeated workflow invocations
- then revisit deeper kernel/vectorization work only after those compile
  overheads are under control

A simple persistent compilation-cache experiment is now also bounded out as a
first-order fix. Re-running the same workflow in a fresh process with
`--compilation-cache-dir` enabled leaves the first-call latencies essentially
unchanged:

- cold cached process:
  - `no_momentum_first_seconds`: `1.17e+1`
  - `momentum_correction_first_seconds`: `1.24e+1`
- warm cached process:
  - `no_momentum_first_seconds`: `1.17e+1`
  - `momentum_correction_first_seconds`: `1.23e+1`

So the current integrated workflow is not being held back by a missing on-disk
compilation cache alone. The speed lane should stay focused on shape
stability, static-argument control, and reusable compiled closure calls rather
than on cache toggles by themselves.

## Research-Grade Performance Plan

The next performance work should stay evidence-driven:

1. measure compile time, first-call time, steady-state time, peak resident
   memory, and device memory separately;
2. keep small PR tests and large profiling campaigns separate;
3. profile the exact workload before changing linear algebra, vectorization, or
   dependencies;
4. prefer stable shapes and prepared data structures over dynamic Python control
   inside `jit`;
5. promote multi-process or multi-device paths only when a measured production
   grid crosses over from serial batched JAX.

JAX-specific rules for NTX:

- use `jax.vmap` for independent collisionality, electric-field, species, or
  radial scan axes when all mapped leaves have compatible shapes;
- use `jax.lax.scan` for fixed-length iterative loops that would otherwise be
  unrolled inside `jit`;
- keep static arguments hashable, immutable, and low-cardinality so they do not
  create unnecessary recompiles;
- consider buffer donation only at public call boundaries where the caller will
  not reuse the donated arrays;
- use `jax.profiler.trace` or XProf/Perfetto for targeted traces, and JAX memory
  profiling for OOM or retained-buffer investigations;
- for GPU sharing, set explicit memory policy such as
  `XLA_PYTHON_CLIENT_PREALLOCATE=false` or `XLA_PYTHON_CLIENT_MEM_FRACTION`
  before launching concurrent runs.

Lineax and Equinox are useful but not automatic wins:

- Lineax should be evaluated first on repeated structured solve or
  Jacobian-linear-operator workloads where reuse or memory reduction can be
  measured against the current prepared dense solve.
- Equinox should be evaluated for typed PyTree modules and filtered transforms
  only if it simplifies static-versus-dynamic argument handling or custom
  derivative APIs without destabilizing the public NTX API.

Do not use broad XLA dump passes as the default profiling loop on normal
workstations. They are useful for focused compiler investigations, but the
current project bottlenecks are better attacked with smaller traces, shape
audits, and cached closure-only profiling.