GPU-Resident Workflows
**********************

Catalyst 2.0 supports zero-copy passthrough of GPU-resident simulation data.
Device pointers are passed directly into the Conduit node via ``set_external()``
and the ParaView backend routes them through a Viskores pipeline without
any host copy. This page covers how device detection works and how to write an
adaptor for a GPU simulation.

For overlapping Catalyst processing with GPU compute, see :doc:`async_execution`.

A complete working example is available in the `catalyst-examples`_ repository
under ``ParaView/CxxCudaFullExample``.

.. _catalyst-examples: https://gitlab.kitware.com/paraview/catalyst-examples

How Device Detection Works
==========================

The ParaView backend probes each pointer individually using the CUDA Driver API::

    vtkConduitArrayUtilities::MCArrayToVTKArray()
      └─ IsDevicePointer(ptr, id, working)
           └─ vtkmDataArrayUtilities::IsDevicePointer(ptr, id)
                └─ cuPointerGetAttribute(CUDA_POINTER_ATTRIBUTE_MEMORY_TYPE)

If the pointer is device-resident, the backend creates a ``vtkmDataSet`` backed
by a Viskores ArrayHandle. If it is host-resident, a standard ``vtkDataSet`` is
created instead. This happens per-array, so mixed host/device channels are
supported.

HIP (AMD GPU) pointers are detected via the equivalent ROCm runtime query.
Use ``--viskores-device=HIP`` at initialization for AMD targets.

Initializing the Viskores Backend
==================================

To enable GPU-resident processing, the Viskores backend must be initialized
at ``catalyst_initialize`` time by specifying the target device:

.. code-block:: cpp

    conduit_cpp::Node node;
    // ... script paths and implementation ...
    node["catalyst/viskores/args"].append() = "--viskores-device=CUDA";
    catalyst_initialize(conduit_cpp::c_node(&node));

Available device strings correspond to Viskores-supported backends:

* ``CUDA`` — NVIDIA GPUs
* ``HIP`` — AMD GPUs (ROCm)
* ``Kokkos`` — portable backend (requires Kokkos-enabled build)
* ``OpenMP`` — CPU parallelism via OpenMP
* ``TBB`` — Intel Threading Building Blocks (CPU)

Without this initialization step, device pointers passed via ``set_external()``
will be detected correctly but processed on the CPU via a device-to-host copy.

Writing the Adaptor
===================

The adaptor code is identical to a host-memory adaptor. The only difference is
that the arrays returned by the simulation are ``cudaMalloc``-allocated device
pointers:

.. code-block:: cpp

    void Execute(int cycle, double time, Grid& grid, Attributes& attribs)
    {
      conduit_cpp::Node exec_params;
      auto channel = exec_params["catalyst/channels/grid"];
      channel["type"].set("mesh");

      auto mesh = channel["data"];
      mesh["coordsets/coords/type"].set("explicit");

      // Strided SOA: x/y/z interleaved in a single float3 device buffer.
      // set_external passes the device pointer directly — no copy occurs here.
      mesh["coordsets/coords/values/x"].set_external(
        grid.GetPointsArray(), grid.GetNumberOfPoints(),
        /*offset=*/0, /*stride=*/3 * sizeof(float));
      mesh["coordsets/coords/values/y"].set_external(
        grid.GetPointsArray(), grid.GetNumberOfPoints(),
        /*offset=*/sizeof(float), /*stride=*/3 * sizeof(float));
      mesh["coordsets/coords/values/z"].set_external(
        grid.GetPointsArray(), grid.GetNumberOfPoints(),
        /*offset=*/2 * sizeof(float), /*stride=*/3 * sizeof(float));

      mesh["topologies/mesh/type"].set("unstructured");
      mesh["topologies/mesh/coordset"].set("coords");
      mesh["topologies/mesh/elements/shape"].set("hex");
      mesh["topologies/mesh/elements/connectivity"].set_external(
        grid.GetCellPoints(0), grid.GetNumberOfCells() * 8);

      // Velocity: non-interlaced SOA, three separate device component buffers.
      auto& f = mesh["fields"];
      f["velocity/association"].set("vertex");
      f["velocity/topology"].set("mesh");
      f["velocity/volume_dependent"].set("false");
      f["velocity/values/x"].set_external(
        attribs.GetVelocityArray(), grid.GetNumberOfPoints(), /*offset=*/0);
      f["velocity/values/y"].set_external(
        attribs.GetVelocityArray(), grid.GetNumberOfPoints(),
        /*offset=*/grid.GetNumberOfPoints() * sizeof(float));
      f["velocity/values/z"].set_external(
        attribs.GetVelocityArray(), grid.GetNumberOfPoints(),
        /*offset=*/grid.GetNumberOfPoints() * sizeof(float) * 2);

      // Pressure: cell-centred scalar.
      f["pressure/association"].set("element");
      f["pressure/topology"].set("mesh");
      f["pressure/volume_dependent"].set("false");
      f["pressure/values"].set_external(
        attribs.GetPressureArray(), grid.GetNumberOfCells());

      catalyst_execute(conduit_cpp::c_node(&exec_params));
    }

Always call ``cudaDeviceSynchronize()`` before ``catalyst_execute()`` to ensure
all device writes are complete before the pointers are read.

Pipeline Scripts
================

Use ``VTKm``-prefixed filter variants in pipeline scripts to keep data on the
device. Note that while the Python API uses the ``VTKm`` prefix for historical
reasons, these filters are backed by Viskores:

.. code-block:: python

    LoadDistributedPlugin('VTKmFilters', remote=False, ns=globals())
    producer  = TrivialProducer(registrationName="grid")
    p2cell    = VTKmPointDatatoCellData(Input=producer)
    threshold = VTKmThreshold(Input=p2cell)
    threshold.Scalars = ['CELLS', 'velocity']
    threshold.LowerThreshold = 0
    threshold.UpperThreshold = 15

Using standard ParaView filters (e.g. plain ``Threshold``) will trigger an
implicit device-to-host copy before processing.

Known Limitation: Strided Connectivity
=======================================

Viskores requires a pointer to the end of the connectivity array for stride
computation. If your solver does not provide this directly, the backend falls
back to a device-to-host deep copy of the connectivity array only. Coordinate
and field arrays are not affected. This is tracked for a future fix.