Multi-GPU distributed joins are not shipped. The production join paths are still
single-GPU CUDA routes such as hash_join_v2, WCOJ, and the main-only
factorized routes described elsewhere in the Architecture tab.
Treat this page as design-state documentation. Do not promise distributed join
execution, cross-device partitioning, or multi-GPU WCOJ in current release
artifacts.
What Exists
xlog-cuda contains a MultiGpuMemoryManager substrate. It wraps a
GpuDevicePool, builds one GpuMemoryManager per device, and supports:
- device-count inspection;
- allocation on a specified device;
- round-robin allocation on the next device;
- per-device remaining-byte reporting;
- access to the underlying device pool.
CUDA certification also checks basic multi-GPU detection consistency when the
test environment exposes multiple devices.
What Does Not Exist Yet
The repository does not currently ship:
- a distributed relation buffer type for query execution;
- partitioning kernels that route rows by hash key across devices;
- peer-to-peer shuffle orchestration for joins;
- distributed hash-join execution;
- cross-device WCOJ or Free Join;
- optimizer costing for multi-GPU partition plans.
Those pieces are future architecture work.
Design Direction
A future distributed hash join would likely use hash partitioning:
- compute a partition for each row from the join key;
- move left and right partitions to the same device;
- run the normal local join kernel per device;
- concatenate or expose the partitioned result as a distributed relation.
That design still has unresolved production requirements:
- skew handling for hot keys;
- memory budgeting across devices;
- peer-to-peer versus host-mediated copies;
- deterministic result ordering or explicit unordered semantics;
- relation-generation and cache invalidation across devices;
- fallback behavior when only one GPU is present.
User Guidance
For current workloads, plan around one CUDA device per executor. Use the
single-GPU execution, WCOJ, adaptive-indexing, and factorized-execution pages for
the routes that actually run today.
If you see multi-GPU allocation APIs in source, treat them as substrate, not as
evidence that distributed joins are available.