GPU-Resident Workflows
Catalyst 2.0 supports zero-copy passthrough of GPU-resident simulation data.
Device pointers are passed directly into the Conduit node via set_external()
and the ParaView backend routes them through a Viskores pipeline without
any host copy. This page covers how device detection works and how to write an
adaptor for a GPU simulation.
For overlapping Catalyst processing with GPU compute, see Asynchronous Execution.
A complete working example is available in the catalyst-examples repository
under ParaView/CxxCudaFullExample.
How Device Detection Works
The ParaView backend probes each pointer individually using the CUDA Driver API:
vtkConduitArrayUtilities::MCArrayToVTKArray()
└─ IsDevicePointer(ptr, id, working)
└─ vtkmDataArrayUtilities::IsDevicePointer(ptr, id)
└─ cuPointerGetAttribute(CUDA_POINTER_ATTRIBUTE_MEMORY_TYPE)
If the pointer is device-resident, the backend creates a vtkmDataSet backed
by a Viskores ArrayHandle. If it is host-resident, a standard vtkDataSet is
created instead. This happens per-array, so mixed host/device channels are
supported.
HIP (AMD GPU) pointers are detected via the equivalent ROCm runtime query.
Use --viskores-device=HIP at initialization for AMD targets.
Initializing the Viskores Backend
To enable GPU-resident processing, the Viskores backend must be initialized
at catalyst_initialize time by specifying the target device:
conduit_cpp::Node node;
// ... script paths and implementation ...
node["catalyst/viskores/args"].append() = "--viskores-device=CUDA";
catalyst_initialize(conduit_cpp::c_node(&node));
Available device strings correspond to Viskores-supported backends:
CUDA— NVIDIA GPUsHIP— AMD GPUs (ROCm)Kokkos— portable backend (requires Kokkos-enabled build)OpenMP— CPU parallelism via OpenMPTBB— Intel Threading Building Blocks (CPU)
Without this initialization step, device pointers passed via set_external()
will be detected correctly but processed on the CPU via a device-to-host copy.
Writing the Adaptor
The adaptor code is identical to a host-memory adaptor. The only difference is
that the arrays returned by the simulation are cudaMalloc-allocated device
pointers:
void Execute(int cycle, double time, Grid& grid, Attributes& attribs)
{
conduit_cpp::Node exec_params;
auto channel = exec_params["catalyst/channels/grid"];
channel["type"].set("mesh");
auto mesh = channel["data"];
mesh["coordsets/coords/type"].set("explicit");
// Strided SOA: x/y/z interleaved in a single float3 device buffer.
// set_external passes the device pointer directly — no copy occurs here.
mesh["coordsets/coords/values/x"].set_external(
grid.GetPointsArray(), grid.GetNumberOfPoints(),
/*offset=*/0, /*stride=*/3 * sizeof(float));
mesh["coordsets/coords/values/y"].set_external(
grid.GetPointsArray(), grid.GetNumberOfPoints(),
/*offset=*/sizeof(float), /*stride=*/3 * sizeof(float));
mesh["coordsets/coords/values/z"].set_external(
grid.GetPointsArray(), grid.GetNumberOfPoints(),
/*offset=*/2 * sizeof(float), /*stride=*/3 * sizeof(float));
mesh["topologies/mesh/type"].set("unstructured");
mesh["topologies/mesh/coordset"].set("coords");
mesh["topologies/mesh/elements/shape"].set("hex");
mesh["topologies/mesh/elements/connectivity"].set_external(
grid.GetCellPoints(0), grid.GetNumberOfCells() * 8);
// Velocity: non-interlaced SOA, three separate device component buffers.
auto& f = mesh["fields"];
f["velocity/association"].set("vertex");
f["velocity/topology"].set("mesh");
f["velocity/volume_dependent"].set("false");
f["velocity/values/x"].set_external(
attribs.GetVelocityArray(), grid.GetNumberOfPoints(), /*offset=*/0);
f["velocity/values/y"].set_external(
attribs.GetVelocityArray(), grid.GetNumberOfPoints(),
/*offset=*/grid.GetNumberOfPoints() * sizeof(float));
f["velocity/values/z"].set_external(
attribs.GetVelocityArray(), grid.GetNumberOfPoints(),
/*offset=*/grid.GetNumberOfPoints() * sizeof(float) * 2);
// Pressure: cell-centred scalar.
f["pressure/association"].set("element");
f["pressure/topology"].set("mesh");
f["pressure/volume_dependent"].set("false");
f["pressure/values"].set_external(
attribs.GetPressureArray(), grid.GetNumberOfCells());
catalyst_execute(conduit_cpp::c_node(&exec_params));
}
Always call cudaDeviceSynchronize() before catalyst_execute() to ensure
all device writes are complete before the pointers are read.
Pipeline Scripts
Use VTKm-prefixed filter variants in pipeline scripts to keep data on the
device. Note that while the Python API uses the VTKm prefix for historical
reasons, these filters are backed by Viskores:
LoadDistributedPlugin('VTKmFilters', remote=False, ns=globals())
producer = TrivialProducer(registrationName="grid")
p2cell = VTKmPointDatatoCellData(Input=producer)
threshold = VTKmThreshold(Input=p2cell)
threshold.Scalars = ['CELLS', 'velocity']
threshold.LowerThreshold = 0
threshold.UpperThreshold = 15
Using standard ParaView filters (e.g. plain Threshold) will trigger an
implicit device-to-host copy before processing.
Known Limitation: Strided Connectivity
Viskores requires a pointer to the end of the connectivity array for stride computation. If your solver does not provide this directly, the backend falls back to a device-to-host deep copy of the connectivity array only. Coordinate and field arrays are not affected. This is tracked for a future fix.