# GPU Runs

NTX uses the same JAX solver path on CPU and GPU.

## GPU Test Targets

- `tests/test_gpu_smoke.py`
- `scripts/run_gpu_regression.py`
- `scripts/sh_office_gpu_smoke.sh`

## Typical Session

```bash
sh office
cd /path/to/NTX
python -m pip install -e ".[dev,docs,io]"
export XLA_PYTHON_CLIENT_PREALLOCATE=false
scripts/sh_office_gpu_smoke.sh
```

## What The Regression Script Reports

- backend and visible devices
- compile-plus-first-run timing
- steady-state timing
- solved coefficients
- max relative error against NTX-owned smoke references

## Device-Parallel Scans

For larger scans, NTX also exposes a device-parallel scan path through
`solve_monoenergetic_parallel_scan(...)` and the profiling helper:

```bash
python scripts/profile_parallel_runtime.py --output-json parallel-runtime.json
```

This is intended for multi-device CPU or GPU jobs when scan throughput matters
more than single-case latency.
For CI or quick local smoke checks, use
`--num-cases 2 --grid 5,5,4` to keep the serial/device-parallel equality check
fast while preserving the default profiling behavior for real measurements.

The helper now performs an NTX smoke check on local devices before using them.
If a visible device fails that check, it is excluded from the parallel solve
instead of silently returning bad coefficients.

NTX also provides a separate multiprocess path:

```bash
python scripts/profile_multiprocess_runtime.py --backend gpu --workers 2
```

That path runs one Python worker per GPU with process-local
`CUDA_VISIBLE_DEVICES` pinning. It is the current robust route for office
hardware because it avoids the single-process cuSolver failure mode seen on
`cuda:1`.

## Current Hardware Interpretation

The current GPU lane is numerically stable and validated on office hardware.
For the small repository smoke cases, CPU remains faster in steady-state wall
time. That is expected: these grids are small enough that GPU launch and
transfer overheads dominate.

For the single-process profiler on office:

- JAX sees two GPUs
- only one passes the NTX dense-solve smoke check under the current stack
- the guarded parallel path therefore runs on the healthy subset and preserves
  correct coefficients

For the multiprocess profiler on office:

- both GPUs execute correctly when pinned to separate worker processes
- coefficient deltas are zero at the repository smoke-case tolerance
- wall time is still worse than the serial batched solve for the small smoke
  grids because process launch and per-worker compilation dominate

So the current guidance is:

- use the serial batched JAX scan for small and medium studies
- use the guarded single-process path only when all visible devices are healthy
- use the multiprocess path for larger multi-GPU throughput workloads or for
  platforms that need strict one-process-per-GPU isolation

The current scaling figures and JSON payloads are documented in
[`Performance`](performance.md).