# SFINCS-JAX RHSMode=1 Profile-Current Handoff This note records the finite-beta RHSMode=1 profile-current diagnostics used by the NTX validation lane. The coefficient-level finite-beta transport-matrix ladder is already below the order-`1e-1` coefficient gate at the profiled radii; the same-geometry profile-current calculation is now fast enough to run the pitch/velocity/radial convergence ladder used by the reduced-closure stress gate. ## Goal Run SFINCS-JAX, Redl, and NTX+NEOPAX on the same owned finite-beta QA VMEC geometry and analytic profile contract, then use the completed RHSMode=1 SFINCS-JAX profile-current outputs to separate: - profile-current normalization errors, - radial/profile interpolation errors, - reduced momentum-closure errors, - and raw solver/convergence errors. The finite-beta bootstrap-current comparison is closed as a reduced-closure stress benchmark. It is not promoted as a broad full-collision parity claim. ## Physics and Normalization Contract The owned deck is generated by `examples/owned_finite_beta_sfincs_jax_profile_current_audit.py`. Key settings: - finite-beta QA VMEC wout: `/Users/rogeriojorge/local/single_stage_optimization_finite_beta/test/wout_LandremanPaul2021_QA_lowres_pressure_current.nc` - `RHSMode = 1` - `geometryScheme = 5` - `inputRadialCoordinate = 3` - `inputRadialCoordinateForGradients = 3` - `includeXDotTerm = .true.` - `includeElectricFieldTermInXiDot = .true.` - `useDKESExBDrift = .false.` - `includePhi1 = .false.` - `dPhiHatdrN = 0.0` - ion/electron species order: `Zs = 1, -1` - `mHats = 2, 1/1836.15267343` - density and temperature are normalized as `nHat = n / 1e20` and `THat = T / 1 keV` - profile gradients are with respect to `rN = r/a = rho`; SFINCS-JAX converts internally to the radial coordinate used by the solver. The SFINCS-JAX current observable is converted to SI units as ```text /sqrt() [A m^-2] = FSABjHatOverRootFSAB2 * e * 1e20 * sqrt(2 * 1keV / m_p) ``` For the current profiles here, the scale is `7.012642439557152e6 A m^-2`. The SFINCS-JAX HDF5 output uses ```text FSABjHat = sum_s Z_s * FSABFlow_s FSABjHatOverRootFSAB2 = FSABjHat / sqrt(FSABHat2) ``` which matches the normalization used by the NTX diagnostic script. ## Completed Same-Contract Data Refresh status on 2026-05-01: - SFINCS-JAX worktree: `/Users/rogeriojorge/local/tests/sfincs_jax` - SFINCS-JAX version: `1.1.0` - SFINCS-JAX commit: `df0c70d` (`origin/main`) - NTX used `NTX_SFINCS_JAX_ROOT=/Users/rogeriojorge/local/tests/sfincs_jax` and `JAX_ENABLE_X64=True` for the reruns. Committed finite-beta RHSMode=1 low-resolution profile-current artifact: - JSON: `docs/_static/owned_finite_beta_sfincs_jax_profile_current_audit.json` - PNG/PDF: `docs/_static/owned_finite_beta_sfincs_jax_profile_current_audit.{png,pdf}` - grid: `Ntheta=13`, `Nzeta=15`, `Nxi=8`, `Nx=5` - radii: `rho = 1/7, 0.5, 13/14` - `nu_n = 8.31565e-3` Current status from that artifact: - completed RHSMode=1 current points: `3` - max SFINCS-JAX vs Redl current relative difference: `0.8478018522703108` - max SFINCS-JAX vs NTX+NEOPAX current relative difference: `0.8740375383375442` - max NTX+NEOPAX vs Redl current relative difference: `0.21926076611238907` - all completed SFINCS-JAX solver residual gates pass - maximum true residual over target: `1.2527267276319048e-4` At `rho=1/7`, the completed low-resolution current is ```text FSABjHatOverRootFSAB2 = -0.44600080476476256 /sqrt() = -3.1276441715700175e6 A m^-2 NIterations = 1 ``` The low-resolution current values are unchanged by the optimized SFINCS-JAX solver-policy update. This confirms that the smoke-grid current gap is not a linear-solver residual artifact. ## Profiling Runs The 2026-05-01 rerun used SFINCS-JAX default RHSMode=1 solver policy unless noted otherwise: ```bash JAX_ENABLE_X64=True PYTHONPATH=/Users/rogeriojorge/local/tests/sfincs_jax:$PYTHONPATH ``` The NTX profile-current audit now calls `write-output --compute-solution` and writes `sfincsOutput.solver_trace.json` beside each HDF5 output. The HDF5 summary must include `linearSolverAccepted=1`, `linearSolverTrueResidualConverged=1`, and `linearSolverResidualNorm <= linearSolverResidualTarget` before a current point is treated as a numerically converged comparison. ### Local CPU, 13 x 15 x 8, Nx=5 The committed three-radius smoke artifact now completes through the normal SFINCS-JAX CLI in `24.7 s` total on local CPU. All three points use the auto policy and pass the true-residual gate. The current amplitudes remain the same as before the solver-policy update. ### Local CPU, 17 x 21 x 12, Nx=5, Optimized Main Generated single-point deck: ```bash python examples/owned_finite_beta_sfincs_jax_profile_current_audit.py \ --rho 0.14285714285714285 --nu-n 0.00831565 \ --n-theta 17 --n-zeta 21 --n-xi 12 --nx 5 \ --run-sfincs-jax \ --output-dir examples/outputs/sfincs_jax_v1p1_profile_current/prod_17x21x12_deck \ --output-prefix docs/_static/owned_finite_beta_sfincs_jax_profile_current_prod_17x21x12 ``` Result: - HDF5 output completes through the normal CLI path in `9.90 s` - committed lightweight summary artifact: `docs/_static/owned_finite_beta_sfincs_jax_profile_current_prod_17x21x12.{json,png,pdf}` - max resident set size: `1.55 GB` - solver method: `sparse_pc_gmres` - true residual over target: `8.45200712991399e-7` - output: ```text FSABjHatOverRootFSAB2 = -1.3120713644649011 /sqrt() = -9.201087334174225e6 A m^-2 ``` Comparison to the same-profile targets at `rho=1/7`: - SFINCS-JAX vs Redl relative difference: `0.5522545492579316` - SFINCS-JAX vs NTX+NEOPAX relative difference: `0.6294362315512264` - NTX+NEOPAX vs Redl relative difference: `0.20828178269124106` This closes the previous `17 x 21 x 12` residual/runtime blocker. It does not close finite-beta bootstrap-current parity because the current amplitude remains far from both reference currents at this pitch truncation. ### Local CPU, 25 x 31 x 17, Nx=11 The three-radius production ladder: - completes in `383.16 s` wall time; - uses `sparse_pc_gmres` at every radius; - reaches a maximum true residual over target of `2.058374287457432e-5`; - reaches max SFINCS-JAX vs Redl current relative difference `1.2860477350497015`; - reaches max SFINCS-JAX vs NTX+NEOPAX current relative difference `1.442947235410411`. This proves the remaining profile-current discrepancy is not caused by a failed linear solve. It is a pitch truncation / collision-model / physical-branch audit. ### Pitch-Resolution Audit The new artifact `docs/_static/owned_finite_beta_sfincs_jax_profile_current_resolution_audit.{json,png,pdf}` summarizes a fixed-radius `Nxi` scan at `rho=1/7`, `Ntheta=17`, `Nzeta=21`, `Nx=5`. Key metrics: - scan rows: `18` - solver-converged rows: `17` - best SFINCS-JAX vs Redl relative difference: `0.024877277870714646` - best SFINCS-JAX vs NTX+NEOPAX relative difference: `0.015201711479977497` - high-`Nxi` even/odd tail relative gap: `0.13232636827032285` - Redl `1e-1` gate pass count: `2` - NTX+NEOPAX `1e-1` gate pass count: `4` The scan shows a real terminal-Legendre-mode parity split. Even `Nxi` values move through both reference scales, while odd `Nxi` values sit on a larger current branch. The adjacent high-`Nxi` gap is `1.323e-1`, which is accepted under the documented `1.5e-1` reduced-closure stress tolerance. ### Full-Collision Probe A same-grid `collisionOperator=0` probe at `17 x 21 x 20, Nx=5` was attempted as a physics discriminator. It timed out after `901.76 s`, used about `9.97 GB` max RSS, and did not write a completed current output. The PAS profile-current branch is therefore the only completed SFINCS-JAX RHSMode=1 finite-beta current reference in this NTX artifact set. ### Production RHSMode=3 Coefficient Ladder, Optimized Main The production coefficient ladder was rerun through the same clean SFINCS-JAX main checkout: - grid: `35 x 43 x 48` - six completed same-grid points - max NTX/SFINCS-JAX `L13/L31/L33` relative difference: `0.02064719610195181` - coefficient gate: `1e-1`, passed - current-conditioned precision gate: still failed, with maximum precision gap `29.948023011811134` - mean production point runtime: `5.999091423504676 s` This closes the broad finite-beta coefficient-resolution lane for the current owned QA case. The remaining finite-beta current work is localized to the profile-current closure/observable layer, pitch Legendre truncation, and the full-collision production path. ## Current Bottleneck Hypotheses 1. The remaining finite-beta profile-current gap is not explained by the RHSMode=3 monoenergetic coefficient bridge. The optimized production coefficient ladder remains below `2.1e-2`. 2. The `17 x 21 x 12, Nx=5` RHSMode=1 profile-current point changes the current amplitude substantially relative to the `13 x 15 x 8` smoke grid, so the profile-current observable is resolution sensitive. 3. The `17 x 21 x 12, Nx=5` point now converges quickly with sparse-PC GMRES, so the old residual-stalling hypothesis is closed. 4. The `Nxi` scan shows a high-order even/odd Legendre truncation split in the current observable. 5. The full-collision RHSMode=1 point is not yet a practical production comparison at this grid on the local CPU. ## Requested SFINCS-JAX Developer Actions Recommended implementation work: - add a solve-complete/profile-finalization split so a successful HDF5 solve is not reported as a failed run if Perfetto/XPlane finalization fails; - add timeout-safe profiling that periodically flushes phase timings, residual history, and partial diagnostics outside the JAX profiler context; - write solver progress for RHSMode=1 every fixed number of Krylov iterations, including residual norm, preconditioner kind, fallback decisions, and elapsed time since last progress line; - time and report these phases separately: geometry/output field build, operator build, RHS assembly, constraint projection, preconditioner probe, Schur/PAS build, Krylov iterations, diagnostics, and HDF5 write; - expose a stable option to write partial HDF5 diagnostics on timeout when the state vector or current residual is available; - document the expected even/odd `Nxi` behavior for the finite Legendre hierarchy and retain the current `1.5e-1` reduced-closure stress policy; - make `collisionOperator=0` RHSMode=1 feasible for this finite-beta deck or document the memory/runtime ceiling and recommended reduced test; - add a small finite-beta RHSMode=1 profile-current regression at `13 x 15 x 8, Nx=5`, using the current observable above; - add a larger optional benchmark at `17 x 21 x 12, Nx=5` that is expected to complete HDF5 output within a documented budget and report whether the solve residual satisfies the requested tolerance. Recommended validation sequence after those changes: 1. Re-run `13 x 15 x 8, Nx=5` CPU and GPU, warm and cold, and confirm the current remains `FSABjHatOverRootFSAB2 = -0.44600080476476256` at `rho=1/7`. 2. Re-run `17 x 21 x 12, Nx=5` with the solver metadata gate and require `sparse_pc_gmres` or another true-residual-converged method. 3. Re-run paired even/odd `Nxi` ladders when the pitch discretization or current observable changes, and require the gap to stay below `1.5e-1`. 4. Re-run the finite-beta radial/collisionality profile-current ladder when changing geometry loading, interpolation, or closure semantics. 5. Keep the bootstrap-current figure scoped as a reduced-closure stress result unless a feasible full-collision RHSMode=1 branch is added. ## NTX-Side Status NTX now exposes Perfetto and device-memory profiling options in the prepared-geometry reuse profiler: ```bash python examples/prepared_geometry_reuse_profile.py \ --preset smoke --case-counts 3 \ --output-prefix examples/outputs/ntx_prepared_geometry_profile/cpu_smoke \ --trace-dir examples/outputs/ntx_prepared_geometry_profile/cpu_smoke_trace \ --perfetto \ --device-memory-profile examples/outputs/ntx_prepared_geometry_profile/cpu_smoke_trace/device_memory.prof ``` The smoke run reports: - warmup solve: `2.091 s` - direct three-case path: `0.464 s` - compiled steady three-case path: `0.000968 s` - compiled steady speedup vs direct: `4.79e2` - max compiled relative mismatch: `6.85e-11` - peak RSS: about `1.83 GB` This supports the current NTX performance plan: keep fixed-shape prepared compiled closures as the first optimization target, and avoid broad XLA dump passes unless a small Perfetto/XPlane trace has localized the bottleneck.