GPU-Resident Workflows

Catalyst 2.0 supports zero-copy passthrough of GPU-resident simulation data. Device pointers are passed directly into the Conduit node via set_external() and the ParaView backend routes them through a Viskores pipeline without any host copy. This page covers how device detection works and how to write an adaptor for a GPU simulation.

For overlapping Catalyst processing with GPU compute, see Asynchronous Execution.

A complete working example is available in the catalyst-examples repository under ParaView/CxxCudaFullExample.

How Device Detection Works

The ParaView backend probes each pointer individually using the CUDA Driver API:

vtkConduitArrayUtilities::MCArrayToVTKArray()
  └─ IsDevicePointer(ptr, id, working)
       └─ vtkmDataArrayUtilities::IsDevicePointer(ptr, id)
            └─ cuPointerGetAttribute(CUDA_POINTER_ATTRIBUTE_MEMORY_TYPE)

If the pointer is device-resident, the backend creates a vtkmDataSet backed by a Viskores ArrayHandle. If it is host-resident, a standard vtkDataSet is created instead. This happens per-array, so mixed host/device channels are supported.

HIP (AMD GPU) pointers are detected via the equivalent ROCm runtime query. Use --viskores-device=HIP at initialization for AMD targets.

Initializing the Viskores Backend

To enable GPU-resident processing, the Viskores backend must be initialized at catalyst_initialize time by specifying the target device:

conduit_cpp::Node node;
// ... script paths and implementation ...
node["catalyst/viskores/args"].append() = "--viskores-device=CUDA";
catalyst_initialize(conduit_cpp::c_node(&node));

Available device strings correspond to Viskores-supported backends:

CUDA — NVIDIA GPUs
HIP — AMD GPUs (ROCm)
Kokkos — portable backend (requires Kokkos-enabled build)
OpenMP — CPU parallelism via OpenMP
TBB — Intel Threading Building Blocks (CPU)

Without this initialization step, device pointers passed via set_external() will be detected correctly but processed on the CPU via a device-to-host copy.

Writing the Adaptor

The adaptor code is identical to a host-memory adaptor. The only difference is that the arrays returned by the simulation are cudaMalloc-allocated device pointers:

void Execute(int cycle, double time, Grid& grid, Attributes& attribs)
{
  conduit_cpp::Node exec_params;
  auto channel = exec_params["catalyst/channels/grid"];
  channel["type"].set("mesh");

  auto mesh = channel["data"];
  mesh["coordsets/coords/type"].set("explicit");

  // Strided SOA: x/y/z interleaved in a single float3 device buffer.
  // set_external passes the device pointer directly — no copy occurs here.
  mesh["coordsets/coords/values/x"].set_external(
    grid.GetPointsArray(), grid.GetNumberOfPoints(),
    /*offset=*/0, /*stride=*/3 * sizeof(float));
  mesh["coordsets/coords/values/y"].set_external(
    grid.GetPointsArray(), grid.GetNumberOfPoints(),
    /*offset=*/sizeof(float), /*stride=*/3 * sizeof(float));
  mesh["coordsets/coords/values/z"].set_external(
    grid.GetPointsArray(), grid.GetNumberOfPoints(),
    /*offset=*/2 * sizeof(float), /*stride=*/3 * sizeof(float));

  mesh["topologies/mesh/type"].set("unstructured");
  mesh["topologies/mesh/coordset"].set("coords");
  mesh["topologies/mesh/elements/shape"].set("hex");
  mesh["topologies/mesh/elements/connectivity"].set_external(
    grid.GetCellPoints(0), grid.GetNumberOfCells() * 8);

  // Velocity: non-interlaced SOA, three separate device component buffers.
  auto& f = mesh["fields"];
  f["velocity/association"].set("vertex");
  f["velocity/topology"].set("mesh");
  f["velocity/volume_dependent"].set("false");
  f["velocity/values/x"].set_external(
    attribs.GetVelocityArray(), grid.GetNumberOfPoints(), /*offset=*/0);
  f["velocity/values/y"].set_external(
    attribs.GetVelocityArray(), grid.GetNumberOfPoints(),
    /*offset=*/grid.GetNumberOfPoints() * sizeof(float));
  f["velocity/values/z"].set_external(
    attribs.GetVelocityArray(), grid.GetNumberOfPoints(),
    /*offset=*/grid.GetNumberOfPoints() * sizeof(float) * 2);

  // Pressure: cell-centred scalar.
  f["pressure/association"].set("element");
  f["pressure/topology"].set("mesh");
  f["pressure/volume_dependent"].set("false");
  f["pressure/values"].set_external(
    attribs.GetPressureArray(), grid.GetNumberOfCells());

  catalyst_execute(conduit_cpp::c_node(&exec_params));
}

Always call cudaDeviceSynchronize() before catalyst_execute() to ensure all device writes are complete before the pointers are read.

Pipeline Scripts

Use VTKm-prefixed filter variants in pipeline scripts to keep data on the device. Note that while the Python API uses the VTKm prefix for historical reasons, these filters are backed by Viskores:

LoadDistributedPlugin('VTKmFilters', remote=False, ns=globals())
producer  = TrivialProducer(registrationName="grid")
p2cell    = VTKmPointDatatoCellData(Input=producer)
threshold = VTKmThreshold(Input=p2cell)
threshold.Scalars = ['CELLS', 'velocity']
threshold.LowerThreshold = 0
threshold.UpperThreshold = 15

Using standard ParaView filters (e.g. plain Threshold) will trigger an implicit device-to-host copy before processing.

Known Limitation: Strided Connectivity

Viskores requires a pointer to the end of the connectivity array for stride computation. If your solver does not provide this directly, the backend falls back to a device-to-host deep copy of the connectivity array only. Coordinate and field arrays are not affected. This is tracked for a future fix.