Multi-GPU distributed joins are not shipped. The production join paths are still single-GPU CUDA routes such as hash_join_v2, WCOJ, and the main-only factorized routes described elsewhere in the Architecture tab.
Treat this page as design-state documentation. Do not promise distributed join execution, cross-device partitioning, or multi-GPU WCOJ in current release artifacts.

What Exists

xlog-cuda contains a MultiGpuMemoryManager substrate. It wraps a GpuDevicePool, builds one GpuMemoryManager per device, and supports:
  • device-count inspection;
  • allocation on a specified device;
  • round-robin allocation on the next device;
  • per-device remaining-byte reporting;
  • access to the underlying device pool.
CUDA certification also checks basic multi-GPU detection consistency when the test environment exposes multiple devices.

What Does Not Exist Yet

The repository does not currently ship:
  • a distributed relation buffer type for query execution;
  • partitioning kernels that route rows by hash key across devices;
  • peer-to-peer shuffle orchestration for joins;
  • distributed hash-join execution;
  • cross-device WCOJ or Free Join;
  • optimizer costing for multi-GPU partition plans.
Those pieces are future architecture work.

Design Direction

A future distributed hash join would likely use hash partitioning:
  1. compute a partition for each row from the join key;
  2. move left and right partitions to the same device;
  3. run the normal local join kernel per device;
  4. concatenate or expose the partitioned result as a distributed relation.
That design still has unresolved production requirements:
  • skew handling for hot keys;
  • memory budgeting across devices;
  • peer-to-peer versus host-mediated copies;
  • deterministic result ordering or explicit unordered semantics;
  • relation-generation and cache invalidation across devices;
  • fallback behavior when only one GPU is present.

User Guidance

For current workloads, plan around one CUDA device per executor. Use the single-GPU execution, WCOJ, adaptive-indexing, and factorized-execution pages for the routes that actually run today. If you see multi-GPU allocation APIs in source, treat them as substrate, not as evidence that distributed joins are available.