Struct CudaKernelProvider

Source

pub struct CudaKernelProvider { /* private fields */ }

Expand description

CUDA kernel provider for xlog GPU operations

Manages pre-compiled PTX modules for relational operations:

Join: Hash join with build/probe phases
Dedup: Sort-based deduplication with prefix-sum compaction
GroupBy: Sorted-input group aggregation (count, sum, min, max)

PTX modules are loaded at construction time and stored in the CUDA device. Kernel functions can be retrieved using device.get_func().

§Example

use std::sync::Arc;
use xlog_cuda::{CudaDevice, GpuMemoryManager, CudaKernelProvider};
use xlog_core::MemoryBudget;

let device = Arc::new(CudaDevice::new(0)?);
let memory = Arc::new(GpuMemoryManager::new(device.clone(), MemoryBudget::default()));
let provider = CudaKernelProvider::new(device, memory)?;

Implementations§

Source §

impl CudaKernelProvider

Source

pub fn to_dlpack_table(&self, buffer: CudaBuffer) -> DlpackTable

Convert a CudaBuffer into a DLPack-exportable table without device↔host copies.

Export each column with DlpackTable::column(...).

Source

pub fn from_dlpack_tensors( &self, tensors: Vec<DlpackManagedTensor>, ) -> Result<CudaBuffer>

Import one DLPack tensor per column as a zero-copy CudaBuffer.

The returned buffer owns the DLPack tensors and will call their deleters on drop.

Source

pub fn from_dlpack_tensors_with_schema( &self, schema: Schema, tensors: Vec<DlpackManagedTensor>, ) -> Result<CudaBuffer>

Import DLPack column tensors with an explicit schema (type-checked).

Source §

impl CudaKernelProvider

Source

pub fn create_constant_column( &self, value_bytes: &[u8], col_type: ScalarType, num_rows: u64, ) -> Result<CudaBuffer>

Create a column filled with a constant value

§Arguments

value_bytes - The raw bytes of the constant value (in little-endian format)
col_type - The ScalarType of the column
num_rows - Number of rows to create

§Returns

A new single-column CudaBuffer filled with the constant value

Source

pub fn create_constant_column_with_device_count( &self, value_bytes: &[u8], col_type: ScalarType, row_cap: u64, d_num_rows_src: &TrackedCudaSlice<u32>, ) -> Result<CudaBuffer>

Create a constant column sized to row_cap while preserving device row count from d_num_rows_src.

Source

pub fn add_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise addition of two single-column buffers

Performs element-wise addition using GPU kernels. Uses wrapping arithmetic for integer overflow.

§Arguments

a - First operand buffer (single column)
b - Second operand buffer (single column)

§Returns

A new CudaBuffer containing the element-wise sum

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn sub_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise subtraction of two single-column buffers

Performs element-wise subtraction using GPU kernels. Uses wrapping arithmetic for integer overflow.

§Arguments

a - First operand buffer (single column)
b - Second operand buffer (single column)

§Returns

A new CudaBuffer containing the element-wise difference

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn mul_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise multiplication of two single-column buffers

Performs element-wise multiplication using GPU kernels. Uses wrapping arithmetic for integer overflow.

§Arguments

a - First operand buffer (single column)
b - Second operand buffer (single column)

§Returns

A new CudaBuffer containing the element-wise product

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn div_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise division of two single-column buffers

Performs element-wise division using GPU kernels. For signed integers, division by zero returns i64::MAX/i32::MAX. For unsigned integers, division by zero returns u64::MAX/u32::MAX. For floats, division by zero produces Inf/NaN as per IEEE 754.

§Arguments

a - Dividend buffer (single column)
b - Divisor buffer (single column)

§Returns

A new CudaBuffer containing the element-wise quotient

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn mod_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise modulo of two single-column buffers

Performs element-wise modulo using GPU kernels. For integers, modulo by zero returns 0. For floats, modulo by zero produces NaN as per IEEE 754.

§Arguments

a - Dividend buffer (single column)
b - Divisor buffer (single column)

§Returns

A new CudaBuffer containing the element-wise remainder

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn abs_column(&self, a: &CudaBuffer) -> Result<CudaBuffer>

Element-wise absolute value of a single-column buffer

Performs element-wise absolute value using GPU kernels.

§Arguments

a - Input buffer (single column)

§Returns

A new CudaBuffer containing the absolute values

§Errors

Returns XlogError::Kernel if:

Buffer is not single-column
Type is not supported for arithmetic

Source

pub fn min_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise minimum of two single-column buffers

Performs element-wise minimum using GPU kernels.

§Arguments

a - First operand buffer (single column)
b - Second operand buffer (single column)

§Returns

A new CudaBuffer containing the element-wise minimums

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn max_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Element-wise maximum of two single-column buffers

Performs element-wise maximum using GPU kernels.

§Arguments

a - First operand buffer (single column)
b - Second operand buffer (single column)

§Returns

A new CudaBuffer containing the element-wise maximums

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn pow_columns( &self, base: &CudaBuffer, exp: &CudaBuffer, ) -> Result<CudaBuffer>

Element-wise power of two single-column buffers

Converts both operands to f64, computes x^y on the GPU, and returns f64 result. This matches the behavior of most database systems where pow() returns a float.

§Arguments

base - Base values buffer (single column)
exp - Exponent values buffer (single column)

§Returns

A new CudaBuffer containing the element-wise powers as f64

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Type is not supported for arithmetic

Source

pub fn select_columns( &self, mask: &CudaBuffer, then_vals: &CudaBuffer, else_vals: &CudaBuffer, ) -> Result<CudaBuffer>

Conditional select between two single-column buffers based on a boolean mask.

For each row: out[i] = mask[i] ? then_vals[i] : else_vals[i]

§Arguments

mask - Boolean mask buffer (single column, type Bool/u8)
then_vals - Values to select when mask is true
else_vals - Values to select when mask is false

§Returns

A new CudaBuffer with values selected based on the mask

§Errors

Returns XlogError::Kernel if:

Row counts don’t match
Buffers are not single-column
Types of then/else values don’t match

Source

pub fn cast_column( &self, a: &CudaBuffer, target: ScalarType, ) -> Result<CudaBuffer>

Cast a single-column buffer to a different type

Casts data on the GPU using the arithmetic cast kernel.

§Arguments

a - Input buffer (single column)
target - Target scalar type

§Returns

A new CudaBuffer with the cast values

§Errors

Returns XlogError::Kernel if:

Buffer is not single-column
Source or target type is not supported for casting

Source

pub fn combine_columns( &self, columns: Vec<CudaBuffer>, types: Vec<ScalarType>, ) -> Result<CudaBuffer>

Combine multiple single-column buffers into a multi-column buffer

§Arguments

columns - Vector of single-column CudaBuffers to combine
types - Vector of ScalarTypes for each column

§Returns

A new CudaBuffer with all columns combined

Source §

impl CudaKernelProvider

Source

pub fn compare_columns<T: GpuScalar>( &self, input: &CudaBuffer, left: usize, right: usize, op: CompareOp, ) -> Result<TrackedCudaSlice<u8>>

Generic compare-columns: produce a device mask for left <op> right.

Replaces: compare_columns_u32, compare_columns_i32, compare_columns_i64, compare_columns_u64, compare_columns_f32, compare_columns_f64, compare_columns_u8.

Source

pub fn filter<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, ) -> Result<CudaBuffer>

Generic filter: keep rows where column[col] <op> value.

Dispatches between fused compare+scan+compact (u32, f64) and mask+compact (all other types).

Replaces: filter_u32, filter_f64, filter_i32, filter_u64, filter_f32, filter_bool.

Source

pub fn filter_recorded<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder, end-to-end variant of Self::filter — the first composed migrated DATA path.

Composes Self::compare_const_mask_recorded and Self::compact_buffer_by_device_mask_counted_recorded on a single launch_stream. Each primitive builds its own crate::launch::LaunchRecorder, records its uses, preflights, runs its kernels, and commits independently.

Composition correctness rests on the runtime’s “record-all, wait-all” semantics: every record_block_use call APPENDS a fresh event to the live entry’s last_use_events: Vec<CudaEvent>, and deallocate waits on EVERY event in that vector before queueing cuMemFreeAsync. So the compare’s commit and the compact’s later commit each push their own event for input.column[i] (and other shared buffers), and the deallocate gates the free behind both — closing the cross-stream lifetime gap end-to-end. (Latest-event coalescing per (block, launch_stream) is a possible future optimization; today every recorded use is retained and waited on.)

§Dispatch

For types with a fused filter_compare_*_scan_phase1 kernel (u32, f64), routes to Self::filter_fused_scan_recorded — single-pass compare+scan+compact mirror of the legacy fast path. For all other types, composes Self::compare_const_mask_recorded + Self::compact_buffer_by_device_mask_counted_recorded.

§Errors

Propagates the structured XlogError::Kernel errors produced by either underlying recorded primitive (legacy manager, unresolved launch_stream, external column, preflight / commit failures, kernel launch failures, cu_stream.synchronize() before host scalar read).

Source

pub fn filter_fused_scan_recorded<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder variant of Self::filter_fused_scan — the migrated fused compare+scan+compact fast path for u32 and f64.

Mirrors the legacy chain on a single explicit launch_stream:

filter_compare_T_scan_phase1 — fused compare + block-local scan that produces d_mask, d_prefix_sum, d_block_sums in one launch.
When num_blocks > 1, multiblock_scan_u32_inplace_on_stream on d_block_sums followed by multiblock_scan_phase3 to propagate block offsets into d_prefix_sum.
capture_compact_count — writes d_out_count for the masked total.
cu_stream.synchronize() — explicitly orders the host scalar read of d_out_count against the pending capture kernel.
dtoh_scalar_untracked(&d_out_count, 0) → output_rows.
Per-input-column compact_bytes_by_mask on the same launch_stream.

§Strict-mode contract

Identical to Self::compact_buffer_by_device_mask_counted_recorded: input.num_rows_device() and every input.column(i) recorded as reads BEFORE preflight; every fresh runtime-backed allocation (d_mask, d_prefix_sum, d_block_sums, d_out_count, each dst_col) recorded via write BEFORE preflight; the recorder snapshots block identity at record time and drops the source borrow, so kernel &mut borrows after preflight remain valid before the kernels enqueue.

§Panics

T::filter_scan_phase1_kernel() must be Some — callers should only reach this method for u32 / f64.

Source

pub fn filter_columns_recorded<T: GpuScalar>( &self, input: &CudaBuffer, left: usize, right: usize, op: CompareOp, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder, end-to-end variant of column-column filter: keep rows where column[left] <op> column[right].

Composes Self::compare_columns_mask_recorded and Self::compact_buffer_by_device_mask_counted_recorded on a single launch_stream. Same composition contract as Self::filter_recorded: each primitive builds its own recorder and commits independently; the runtime appends every recorded event to last_use_events, and deallocate waits on every event, so input columns referenced by BOTH the compare AND the per-column compacts are correctly gated end-to-end.

§Errors

Propagates the structured XlogError::Kernel errors produced by either underlying recorded primitive (legacy manager, unresolved launch_stream, external column on either input side, preflight / commit failures, kernel launch failures, cu_stream.synchronize() before host scalar read).

Source

pub fn compare_const_mask_recorded<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, launch_stream: StreamId, ) -> Result<TrackedCudaSlice<u8>>

Strict-recorder variant of Self::compare_const_mask.

Runs the filter compare kernel on the caller-supplied launch_stream and threads the column read through the runtime via LaunchRecorder. This is the second migrated launch path (after memset_recorded) and the first kernel-driven one — it is intentionally a sibling of the legacy Self::compare_const_mask rather than a replacement. Existing callers stay on the legacy path until the broader filter migration lands.

§Strict-mode contract

Requires the provider’s manager to be built via crate::GpuMemoryManager::with_runtime; otherwise returns XlogError::Kernel before any allocation.
input.column(col) is recorded as a read; external (CudaColumn::Dlpack / CudaColumn::ArrowDevice) columns are rejected at preflight, before the kernel is enqueued.
d_mask is freshly allocated through the same runtime-backed manager. By construction its runtime_block() is Some, so its write recording cannot strict-reject. The write is therefore noted AFTER the kernel is enqueued — this sidesteps the borrow conflict between &mut d_mask (cudarc kernel param) and &d_mask (recorder). A future migration that may write to a buffer of unknown provenance must instead capture identity pre-launch (e.g. via a raw view) so strict rejection happens at preflight.

§Errors

XlogError::Kernel if the manager has no runtime, or if launch_stream does not resolve.
XlogError::Kernel from preflight (external column, unsupported active resource).
XlogError::Kernel from the underlying CUDA launch.
XlogError::Kernel from commit on transient record_block_use failure.

Source

pub fn compare_columns_mask_recorded<T: GpuScalar>( &self, input: &CudaBuffer, left: usize, right: usize, op: CompareOp, launch_stream: StreamId, ) -> Result<TrackedCudaSlice<u8>>

Strict-recorder variant of Self::compare_columns_mask.

Runs the column-column compare kernel on the caller-supplied launch_stream and threads BOTH column reads through the runtime via LaunchRecorder. Sibling of the legacy Self::compare_columns_mask; existing callers stay on the legacy path.

§Strict-mode contract

Requires the provider’s manager to be built via crate::GpuMemoryManager::with_runtime; otherwise returns XlogError::Kernel before any allocation.
input.column(left) and input.column(right) are both recorded as reads BEFORE preflight. External (DLPack / Arrow) columns on either side are rejected at preflight, before the kernel is enqueued.
d_mask is freshly allocated by the same runtime-backed manager; its write is recorded via the standard write API BEFORE preflight (the recorder snapshots block identity, so the kernel &mut d_mask borrow after preflight is unaffected).

§Errors

XlogError::Kernel if the manager has no runtime, or if launch_stream does not resolve.
XlogError::Kernel from preflight (external column on either side, unsupported active resource).
XlogError::Kernel from the underlying CUDA launch.
XlogError::Kernel from commit on transient record_block_use failure.

Source

pub fn prefix_sum_mask(&self, mask: &[u8]) -> Result<(Vec<u32>, u32)>

Source

pub fn filter_by_mask( &self, input: &CudaBuffer, mask: &[u8], ) -> Result<CudaBuffer>

Filter buffer by pre-computed mask.

§Arguments

input - The input buffer to filter
mask - Mask slice where non-zero means keep the row

§Errors

Returns error if mask length doesn’t match buffer rows.

Source

pub fn compact_buffer_by_device_mask_counted_recorded( &self, input: &CudaBuffer, d_mask: &TrackedCudaSlice<u8>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Compact a buffer using a device-resident mask.

Computes prefix sum and output count fully on-device. Strict-recorder variant of Self::compact_buffer_by_device_mask_counted — the first migrated COMPACT path.

The compact pipeline is a multi-kernel chain: mask_clamp_rows → multiblock_scan_phase1 → multiblock_scan_u32_inplace_on_stream (recursive, only when num_blocks > 1) → multiblock_scan_phase3 → capture_compact_count → host scalar read of d_out_count → per-column compact_bytes_by_mask. Every kernel runs on the same explicit launch_stream via launch_on_stream, and the host scalar read at the chain’s middle is explicitly ordered by cu_stream.synchronize() — non-blocking streams do NOT get default-stream implicit ordering.

§Strict-mode contract

Requires the provider’s manager to be built via crate::GpuMemoryManager::with_runtime; otherwise returns XlogError::Kernel before any allocation.
d_mask is recorded as a read.
input.num_rows_device() is recorded as a read.
Each input.column(i) is recorded as a read; external columns on any side are rejected at preflight, before any CUDA work is enqueued.
Every fresh runtime-backed allocation that this function makes (d_mask_clamped, d_prefix_sum, d_block_sums, d_out_count, each dst_col) is recorded via write BEFORE the kernel chain enqueues. Locals that drop at end-of-scope (d_mask_clamped, d_prefix_sum, d_block_sums) stay safe because the runtime’s deallocate queues cuStreamWaitEvent(alloc_stream, recorded_event) BEFORE cuMemFreeAsync, gating the free on the launch_stream chain.
Intermediate block_sums allocations created by the recursive scan helper are recorded directly inside the helper (they don’t outlive the helper call).

§Errors

XlogError::Kernel if the manager has no runtime, or if launch_stream does not resolve.
XlogError::Kernel from preflight (external column on any side, unsupported active resource).
XlogError::Kernel from any underlying CUDA launch or from the launch_stream synchronize before the host scalar read.
XlogError::Kernel from commit on transient record_block_use failure.

Source

pub fn compact_buffer_by_device_mask_counted( &self, input: &CudaBuffer, d_mask: &TrackedCudaSlice<u8>, ) -> Result<CudaBuffer>

Source

pub fn filter_by_device_mask( &self, input: &CudaBuffer, d_mask: &CudaSlice<u8>, ) -> Result<CudaBuffer>

Source §

impl CudaKernelProvider

Source

pub fn free_join_execute_u32_recorded( &self, inputs: &[&CudaBuffer], plan: &FjPlan, launch_stream: StreamId, ) -> Result<CudaBuffer>

Execute a hand-built Free Join plan over u32/Symbol relations via the level-synchronous frontier engine. See the module docs for the algorithm and invariants; the plan contract is documented on FjPlan.

Inputs are layout-normalized per dispatch (sorted + deduped via the existing WCOJ layout entries — already-normalized inputs take the recorded fast-path check). The output contains one column per output_vars entry (all U32) holding the join’s projected row set under set semantics.

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input violates the u32 width-class layout contract, the plan is invalid (unbound probe variables, over/under-consumed atom columns, rebound variables, unknown output variables), the frontier exceeds the u32 work-index space, or any kernel launch fails.

Source

pub fn free_join_execute_u64_recorded( &self, inputs: &[&CudaBuffer], plan: &FjPlan, launch_stream: StreamId, ) -> Result<CudaBuffer>

u64 width-class twin of Self::free_join_execute_u32_recorded: identical pipeline, contract, and invariants; every input column must be U64 and the output columns are U64.

Source

pub fn free_join_count_by_root_u32_recorded( &self, inputs: &[&CudaBuffer], plan: &FjPlan, launch_stream: StreamId, ) -> Result<CudaBuffer>

Design §2.4 factorized count-by-root over the Free Join frontier: runs the same pipeline but reduces to (group, count) instead of materializing rows. The plan’s output_vars must be exactly [group_var]; atoms may be PARTIALLY consumed — each surviving frontier row contributes the product of its remaining live trie-range lengths (the d-representation count), so trailing private variables never expand the frontier. Output schema: (group: U32, count: U64).

u32/Symbol width-class only: the reduction reuses the recorded groupby, whose KEY columns are bounded engine-wide to U32/Symbol (multi-type recorded sort is deferred there) — u64 bodies stay on the materialize path.

Source §

impl CudaKernelProvider

Source

pub fn fj_delta_novel_u32_recorded( &self, delta: &CudaBuffer, edge: &CudaBuffer, full_r: &CudaBuffer, cols: FjDeltaCols, domain: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

One factorized semi-naive delta step: returns the full-row-deduped novel set {head(carry, value) : delta(carry, key), edge(key, value), head ∉ full_r} with column roles given by cols (orientation- and head-order-agnostic; the buffer is built in full_r’s schema).

edge must be layout-normalized key-first (lex-sorted, deduped); delta and full_r are order-insensitive. All ids must be < domain (fail-closed in-kernel check).

Source

pub fn fj_delta_columns_max_u32( &self, inputs: &[(&CudaBuffer, &[usize])], launch_stream: StreamId, ) -> Result<u32>

Max value over the given u32/Symbol columns of the given buffers (one atomicMax kernel launch per column into a single zeroed cell). Used once per SCC fixpoint to derive the factorized-delta domain bound. Returns 0 for all-empty inputs.

Source §

impl CudaKernelProvider

Source

pub fn fj_delta_sparse_novel_u32_recorded( &self, delta: &CudaBuffer, edge: &CudaBuffer, full_r: &CudaBuffer, cols: FjDeltaCols, max_table_bytes: u64, launch_stream: StreamId, ) -> Result<Option<CudaBuffer>>

Sparse-domain twin of Self::fj_delta_novel_u32_recorded: one factorized semi-naive delta step over a hash set, with no domain cap. Forbids the single key (u32::MAX, u32::MAX) (its packed key+1 overflows the empty sentinel) — fails closed if present.

Returns Ok(None) when the distinct-sized hash table (2×(|R| + distinct-candidate estimate), power of two) would exceed max_table_bytes, or when an insert overflows an under-sized table — both are clean route-declines so the caller falls back to the legacy path. max_table_bytes == 0 disables the budget guard (standalone spike/parity tests).

Source §

impl CudaKernelProvider

Source

pub fn groupby_agg( &self, input: &CudaBuffer, key_cols: &[usize], agg: AggOp, value_col: usize, ) -> Result<CudaBuffer>

Perform groupby aggregation

Assumes input is already sorted by key columns.

§Arguments

input - The input buffer
key_cols - Column indices for grouping
agg - Aggregation operation to perform
value_col - Column index for the value to aggregate

§Returns

A buffer with one row per group, containing key columns and aggregated value

§Errors

Returns XlogError::Kernel if kernel execution fails

Source

pub fn groupby_multi_agg( &self, buffer: &CudaBuffer, key_cols: &[usize], aggs: &[(usize, AggOp)], ) -> Result<CudaBuffer>

Multi-aggregation groupby

Performs groupby with multiple aggregation operations at once. This is more efficient than running separate groupby operations because it only sorts and computes group boundaries once.

§Arguments

buffer - The input buffer
key_cols - Column indices for grouping (currently only single-column supported)
aggs - A slice of (value_col, AggOp) pairs specifying which aggregations to perform

§Returns

A buffer with one row per group, containing key columns followed by aggregated values in the same order as the aggs parameter

§Errors

Returns XlogError::Kernel if kernel execution fails

§Example

let result = provider.groupby_multi_agg(
    &buffer,
    &[0],  // group by column 0
    &[(1, AggOp::Sum), (1, AggOp::Count), (1, AggOp::Min)],
)?;
// result has columns: key, sum, count, min

Source

pub fn groupby_multi_agg_recorded( &self, buffer: &CudaBuffer, key_cols: &[usize], aggs: &[(usize, AggOp)], launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder variant of Self::groupby_multi_agg.

Sort + pack + boundary detect + scan + capture-num-groups

group-id derivation + per-aggregation kernels + key gather/unpack — every kernel runs on the caller-supplied launch_stream via launch_on_stream. Composition with existing recorded primitives:
- sort_recorded does the typed multi-column sort and commits its own LaunchRecorder.
- pack_keys_gpu_on_stream runs the fused pack+hash kernel on launch_stream and records its buffers directly via record_block_use.
- multiblock_scan_u32_inplace_on_stream drives the boundary-position scan tail.
- The groupby-specific chain has its own LaunchRecorder for the boundary mask, group ids, group_first indices, num_groups scalar, per-aggregation outputs, and key gather/unpack outputs.

Composition correctness: each recorder commits independently; the runtime’s record-all + wait-all last_use_events: Vec<CudaEvent> semantics chain the deallocate safety end-to-end across the four primitive commits.

§Scope (narrow)

U32 / Symbol key columns only (sort_recorded constraint).
Aggs: Count, Sum, Min, Max. LogSumExp is rejected with a structured error — its multi-kernel chain is outside this recorded provider surface.
Manager must be runtime-backed.

Source

pub fn groupby_agg_recorded( &self, input: &CudaBuffer, key_cols: &[usize], agg: AggOp, value_col: usize, launch_stream: StreamId, ) -> Result<CudaBuffer>

Convenience single-aggregation entry, mirrors Self::groupby_agg. Forwards to Self::groupby_multi_agg_recorded.

Source §

impl CudaKernelProvider

Source

pub fn build_selected_id_mask( &self, ids_buf: &CudaBuffer, candidate_count: usize, ) -> Result<CudaBuffer>

Source

pub fn validate_selected_ids( &self, ids_buf: &CudaBuffer, candidate_count: usize, ) -> Result<()>

Source

pub fn filter_buffer_by_candidate_flag( &self, input: &CudaBuffer, candidate_flags: &CudaBuffer, candidate_idx: usize, ) -> Result<CudaBuffer>

Source

pub fn ilp_coo_fill_launch( &self, compacted_fact_indices: &TrackedCudaSlice<u32>, cidx: u32, count: u32, offset: u32, coo_fact: &mut TrackedCudaSlice<u32>, coo_cand: &mut TrackedCudaSlice<u32>, ) -> Result<()>

Launch ilp_coo_fill kernel: writes (compacted_fact_indices[i], cidx) pairs at coo_fact[offset..] and coo_cand[offset..].

Source

pub fn ilp_credit_forward_f32_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, cand_probs: &CudaColumn, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, eps: f32, ) -> Result<(TrackedCudaSlice<f32>, TrackedCudaSlice<f32>)>

Launch ilp_credit_forward_f32: CSR credit gather + clamp + NLL loss. Returns (credit_out, loss_contrib) device slices of length num_facts.

Source

pub fn ilp_credit_forward_f64_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, cand_probs: &CudaColumn, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, eps: f64, ) -> Result<(TrackedCudaSlice<f64>, TrackedCudaSlice<f64>)>

Launch ilp_credit_forward_f64: CSR credit gather + clamp + NLL loss. Returns (credit_out, loss_contrib) device slices of length num_facts.

Source

pub fn ilp_credit_backward_f32_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, credit_out: &TrackedCudaSlice<f32>, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, num_cands: u32, ) -> Result<TrackedCudaSlice<f32>>

Launch ilp_credit_backward_f32: gradient scatter via CSR + atomicAdd. Returns d_cand_probs gradient of length num_cands (zeroed, then accumulated).

Source

pub fn ilp_credit_backward_f64_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, credit_out: &TrackedCudaSlice<f64>, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, num_cands: u32, ) -> Result<TrackedCudaSlice<f64>>

Launch ilp_credit_backward_f64: gradient scatter via CSR + atomicAdd. Returns d_cand_probs gradient of length num_cands (zeroed, then accumulated).

Source

pub fn ilp_reduce_sum_f32_launch( &self, input: &TrackedCudaSlice<f32>, n: u32, ) -> Result<TrackedCudaSlice<f32>>

GPU-side sum reduction (f32).

Sums n elements of input on device and returns a single-element device buffer containing the result. The caller must zero the output buffer before launching the kernel — this function handles that.

Source

pub fn ilp_reduce_sum_f64_launch( &self, input: &TrackedCudaSlice<f64>, n: u32, ) -> Result<TrackedCudaSlice<f64>>

GPU-side sum reduction (f64).

Sums n elements of input on device and returns a single-element device buffer containing the result. Requires sm_60+ for double atomicAdd (this project targets sm_75 baseline).

Source

pub fn ilp_coo_fill_from_mask_launch( &self, mask: &TrackedCudaSlice<u8>, prefix_sum: &TrackedCudaSlice<u32>, fact_indices: &TrackedCudaSlice<u32>, offset_idx: u32, cand_value: u32, num_query: u32, d_offsets: &TrackedCudaSlice<u32>, coo_fact: &mut TrackedCudaSlice<u32>, coo_cand: &mut TrackedCudaSlice<u32>, ) -> Result<()>

Fill COO arrays from a device-side mask and prefix-sum.

For each set bit in mask, writes the corresponding fact_indices entry into coo_fact and cand_value into coo_cand at the position determined by d_offsets[offset_idx] + prefix_sum[tid].

Parameters:

offset_idx: index into d_offsets for the write base position
cand_value: actual candidate index to write into coo_cand

This keeps COO assembly fully on device, eliminating the mask D2H transfer.

Source

pub fn ilp_csr_histogram_launch( &self, sorted_facts: &TrackedCudaSlice<u32>, nnz: u32, num_facts: u32, ) -> Result<TrackedCudaSlice<u32>>

Build a histogram of fact indices from sorted COO data.

For each entry in sorted_facts[0..nnz], atomically increments the corresponding bin in the output histogram. The result is a device-side count array of length num_facts, suitable for prefix-sum to produce CSR row_offsets.

The caller provides sorted fact indices; the histogram is zero-initialized internally.

Source §

impl CudaKernelProvider

Source

pub fn ilp_exact_score_topk( &self, candidate_buffers: &[&CudaBuffer], positives: &CudaBuffer, negatives: &CudaBuffer, k_per_topology: u32, ) -> Result<Vec<IlpExactTopkCandidate>>

Score on GPU, reduce per-topology top-K on GPU, and transfer only the compact selected rows back to host.

Source §

impl CudaKernelProvider

Source

pub fn create_buffer_from_u32_columns( &self, columns: &[&[u32]], schema: Schema, ) -> Result<CudaBuffer>

Create a CudaBuffer from multiple u32 column slices

§Arguments

columns - Slice of column data slices (each column as &u32)
schema - The schema for the buffer

§Returns

A new CudaBuffer containing all columns

§Errors

Returns XlogError::Kernel if upload fails or columns have mismatched lengths

Source

pub fn create_buffer_from_slices( &self, slices: &[&[u8]], schema: Schema, ) -> Result<CudaBuffer>

Create a buffer from multiple column slices (raw bytes)

This is a generic version that works with any column type by accepting raw byte slices. Each slice should contain the column data in little-endian format with the correct size for the column’s type.

§Arguments

slices - Slice of raw byte slices, one per column
schema - The schema for the buffer

§Returns

A new CudaBuffer containing all columns

§Errors

Returns XlogError::Kernel if:

Number of slices doesn’t match schema arity
Upload fails

Source

pub fn to_arrow_device_record_batch( &self, buffer: CudaBuffer, ) -> Result<ArrowDeviceArrayOwned>

Export CudaBuffer to Arrow C Data Interface (device-resident).

This is a zero-copy export: column buffers remain on device, and the returned ArrowDeviceArray describes CUDA-resident memory.

The export requires that the device row count matches the host row_cap

Source

pub fn to_arrow_record_batch(&self, buffer: &CudaBuffer) -> Result<RecordBatch>

Export CudaBuffer to Arrow RecordBatch

Downloads data from GPU and converts it to an Arrow RecordBatch for interoperability with Arrow-based tools like cuDF, Polars, or DuckDB.

§Arguments

buffer - The CudaBuffer to export

§Returns

An Arrow RecordBatch containing all columns from the buffer

§Errors

Returns XlogError::Kernel if:

Column download fails
RecordBatch creation fails

Source

pub fn from_arrow_record_batch( &self, record_batch: &RecordBatch, ) -> Result<CudaBuffer>

Import Arrow RecordBatch to CudaBuffer

Uploads Arrow data to GPU memory.

§Arguments

record_batch - The Arrow RecordBatch to import

§Returns

A new CudaBuffer with the data on GPU

§Errors

Returns error if Arrow type is not supported or upload fails

Source

pub fn to_arrow_ipc_stream(&self, buffer: &CudaBuffer) -> Result<Vec<u8>>

Export a CudaBuffer to an Arrow IPC stream (RecordBatchStream) as bytes.

This is a convenience wrapper around to_arrow_record_batch that enables interoperability with tools like cuDF via standard Arrow IPC readers.

Note: This is not zero-copy; data is downloaded from GPU to host memory.

Source

pub fn from_arrow_ipc_stream(&self, ipc: &[u8]) -> Result<CudaBuffer>

Import a single-batch Arrow IPC stream (RecordBatchStream) into a CudaBuffer.

Note: This uploads Arrow data from host to GPU memory.

Source

pub fn write_arrow_ipc_stream_file<P: AsRef<Path>>( &self, buffer: &CudaBuffer, path: P, ) -> Result<()>

Write a CudaBuffer to a file as an Arrow IPC stream (RecordBatchStream).

Source

pub fn read_arrow_ipc_stream_file<P: AsRef<Path>>( &self, path: P, ) -> Result<CudaBuffer>

Read a CudaBuffer from a file containing an Arrow IPC stream (RecordBatchStream).

Source §

impl CudaKernelProvider

Source

pub fn memset_recorded( &self, dst: &mut TrackedCudaSlice<u8>, value: u8, launch_stream: StreamId, ) -> Result<()>

Async memset of value into every byte of dst on launch_stream, then record the use against the runtime.

Requires the provider’s GpuMemoryManager to be built via crate::GpuMemoryManager::with_runtime (so dst.runtime_block() is Some and the runtime is reachable). On a legacy/no-runtime manager, returns [XlogError::Kernel].

§Errors

XlogError::Kernel("memset_recorded requires runtime-backed manager") if the manager has no runtime attached.
XlogError::Kernel from cuMemsetD8Async/stream-resolution failure.
XlogError::Kernel wrapping any ResourceError::StreamMisuse from the recorder’s commit (notably when the active resource is DirectCudaResource — the trait default that intentionally rejects record_block_use).

Source

pub fn memset_column_recorded( &self, dst: &mut CudaColumn, value: u8, launch_stream: StreamId, ) -> Result<()>

Column-level variant of Self::memset_recorded — exercises the LaunchRecorder::write_column path. Used by tests that prove CudaColumn::Owned records its runtime block automatically; strict mode rejects CudaColumn::Dlpack / CudaColumn::ArrowDevice at preflight (no CUDA work queued).

Source §

impl CudaKernelProvider

Source

pub fn sample_bernoulli_matrix( &self, probs: &[f32], num_samples: usize, seed: u64, force_mask: &CudaView<'_, u8>, forced_value: &CudaView<'_, u8>, ) -> Result<Vec<u8>>

Sample independent Bernoulli variables on the GPU.

Returns a row-major (sample, var) matrix as a flat Vec<u8> of length num_samples * probs.len(), where each entry is 0/1.

Source

pub fn sample_bernoulli_matrix_device( &self, probs: &[f32], num_samples: usize, seed: u64, force_mask: &CudaView<'_, u8>, forced_value: &CudaView<'_, u8>, ) -> Result<TrackedCudaSlice<u8>>

Sample Bernoulli matrix on GPU and return device-resident output.

Returns a row-major [num_samples][num_vars] matrix of 0/1 bytes on device.

Source §

impl CudaKernelProvider

Source

pub fn hash_join( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], ) -> Result<CudaBuffer>

Perform a hash join between two buffers

Uses a two-phase hash join:

Build phase: Insert keys from right into a hash table
Probe phase: Match keys from left against the hash table

§Arguments

left - The left (probe) buffer
right - The right (build) buffer
left_keys - Column indices for join keys in left buffer
right_keys - Column indices for join keys in right buffer

§Returns

A buffer containing the joined rows with columns from both inputs

§Errors

Returns XlogError::Kernel if kernel execution fails

Source

pub fn hash_join_with_limit( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, ) -> Result<CudaBuffer>

Hash join with configurable maximum output size

Uses a two-phase hash join:

Build phase: Insert keys from right into a hash table
Probe phase: Match keys from left against the hash table

§Arguments

left - The left (probe) buffer
right - The right (build) buffer
left_keys - Column indices for join keys in left buffer
right_keys - Column indices for join keys in right buffer
max_output - Maximum number of output rows (defaults to DEFAULT_JOIN_MAX_OUTPUT)

§Returns

A buffer containing the joined rows with columns from both inputs

§Errors

Returns XlogError::Kernel if kernel execution fails

Source

pub fn dedup( &self, input: &CudaBuffer, key_cols: &[usize], ) -> Result<CudaBuffer>

Remove duplicate rows based on key columns

Sorts the input by the provided key columns, then removes adjacent duplicates.

§Arguments

input - The input buffer
key_cols - Column indices to use for duplicate detection

§Returns

A buffer containing one row per duplicate-equivalence class

§Errors

Returns XlogError::Kernel if kernel execution fails

Source

pub fn dedup_sorted( &self, input: &CudaBuffer, key_cols: &[usize], ) -> Result<CudaBuffer>

Remove duplicate rows from a buffer that is already sorted by key columns

This is an optimized version of dedup that skips the sorting step. The caller must ensure the input is already sorted by the key columns.

§Arguments

input - The input buffer (must be sorted by key columns)
key_cols - Column indices to use for duplicate detection

§Returns

A buffer containing one row per duplicate-equivalence class

Source

pub fn union(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Compute union of two buffers (GPU-native, deduped)

§Arguments

a - First buffer
b - Second buffer

§Returns

A buffer containing the deduplicated union of both inputs

§Errors

Returns XlogError::Kernel if schemas don’t match or operation fails

Source

pub fn diff(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Compute set difference (a - b)

Returns rows from a that don’t exist in b. Uses hash-based approach: build hash table from b, probe with a.

§Arguments

a - Source buffer
b - Buffer to subtract

§Returns

A buffer containing rows in a but not in b

§Errors

Returns XlogError::Kernel if schemas don’t match or operation fails

Source

pub fn union_gpu(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

GPU-native union (no host roundtrip)

Computes the union of two buffers entirely on the GPU using:

Concatenate arrays using concat_u32 kernel
Sort the concatenated result
Deduplicate using existing dedup()

§Arguments

a - First buffer
b - Second buffer

§Returns

A buffer containing deduplicated union of both inputs, sorted

§Errors

Returns XlogError::Kernel if schemas don’t match or operation fails

Source

pub fn diff_gpu(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>

Set difference (a - b) with deterministic set semantics.

Single-column u32 buffers use a GPU sorted-diff fast path. General multi-column buffers use a byte-exact host set fallback after GPU dedup; the hash anti-join implementation is intentionally not used for Datalog delta subtraction because its unordered parallel probe path can leak nondeterminism into recursive fixed-point convergence.

§Arguments

a - Source buffer
b - Buffer to subtract

§Returns

A buffer containing elements in a but not in b, sorted and deduped

§Errors

Returns XlogError::Kernel if schemas don’t match or operation fails

Source

pub fn dedup_full_row(&self, input: &CudaBuffer) -> Result<CudaBuffer>

Public deterministic full-row dedup with totalOrder-bytewise equality semantics for all arities (including single-column float buffers).

Differs from dedup(input, &[0]) for single-column float columns: the legacy single-column GPU kernel collapses +0/-0 (IEEE == says they’re equal) and treats two NaNs with different payloads as distinct. dedup_full_row instead uses totalOrder-bijective bytewise equality, so:

+0.0 and -0.0 are distinct.
Two NaNs collapse iff bit-identical.

Routing today:

dedup(input, &all_cols) with arity > 1 routes to the full-row pipeline (same semantics as this method).
dedup(input, &[0]) with arity == 1 keeps the legacy single-column GPU kernel — IEEE == for floats, so +0/-0 collapse and NaNs collapse iff bit-identical-or-IEEE-eq.
dedup_full_row(input) always uses bytewise totalOrder equality for all arities, so single-column float callers must use this method explicitly to get the totalOrder semantics.

Multi-column callers that pass the all-columns key vector to dedup already route through the same deterministic full-row pipeline; single-column callers that want totalOrder semantics must call dedup_full_row directly.

Source

pub fn diff_full_row( &self, a: &CudaBuffer, b: &CudaBuffer, ) -> Result<CudaBuffer>

Public deterministic full-row set difference. Equivalent to diff_gpu(a, b) for the multi-column path but named explicitly so callers cannot mistake it for the older first-column-key diff. a and b must have type-compatible schemas.

Source

pub fn sort(&self, input: &CudaBuffer, key_cols: &[usize]) -> Result<CudaBuffer>

Sort buffer by key columns.

Computes a stable row permutation on the GPU (supports multi-column and all scalar types), then applies the permutation on the GPU to reorder all columns.

§Arguments

input - The input buffer to sort
key_cols - Column indices to use for sorting (lexicographic, first key is most significant)

§Returns

A new buffer with rows sorted by the key columns

§Errors

Returns XlogError::Kernel if:

key_cols is empty or out of bounds
Input has more than u32::MAX rows
Download/upload or kernel execution fails

Source

pub fn init_indices( &self, indices: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>

Initialize indices array with 0..n-1 on device.

Source

pub fn gather_u32_by_indices( &self, input: &TrackedCudaSlice<u32>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>

Gather u32 keys by permutation: out[i] = input[indices[i]].

Source

pub fn gather_u8_by_indices( &self, input: &TrackedCudaSlice<u8>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u8>, n: u32, ) -> Result<()>

Gather u8 values by permutation: out[i] = input[indices[i]].

Source

pub fn gather_u64_lo_by_indices( &self, input: &TrackedCudaSlice<u64>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>

Gather low 32 bits of u64 values by permutation.

Source

pub fn gather_u64_hi_by_indices( &self, input: &TrackedCudaSlice<u64>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>

Gather high 32 bits of u64 values by permutation.

Source

pub fn radix_sort_u32_pairs( &self, keys: &mut TrackedCudaSlice<u32>, values: &mut TrackedCudaSlice<u32>, n: u32, scratch: &mut RadixSortScratch, ) -> Result<()>

Stable radix sort of (key, value) u32 pairs using reusable scratch.

Source

pub fn scan_u8_mask_device( &self, mask: &TrackedCudaSlice<u8>, n: u32, ) -> Result<TrackedCudaSlice<u32>>

Compute exclusive prefix sum of u8 mask on device (no host reads).

Source

pub fn count_mask_device( &self, mask: &TrackedCudaSlice<u8>, n: u32, ) -> Result<TrackedCudaSlice<u32>>

Count non-zero entries in a u8 mask on device (no host reads).

Returns a 1-element device buffer containing the count.

Source

pub fn count_mask_into_slot( &self, mask: &TrackedCudaSlice<u8>, n: u32, task_counts: &mut TrackedCudaSlice<u32>, slot_idx: usize, ) -> Result<()>

Count 1-bits in mask[0..n] and write the result into task_counts[slot_idx] via the existing count_mask kernel.

The caller MUST ensure task_counts[slot_idx] is zero before calling (e.g. by zeroing the whole array once).

This avoids allocating a fresh 1-element device buffer per call, which matters when iterating over hundreds of tasks.

Source

pub fn hash_join_v2( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, ) -> Result<CudaBuffer>

Multi-column hash join with support for different join types.

§Arguments

left - The left (probe) buffer
right - The right (build) buffer
left_keys - Column indices for join keys in left buffer
right_keys - Column indices for join keys in right buffer
join_type - Type of join to perform (Inner, Semi, Anti, LeftOuter)

§Errors

Returns XlogError::Kernel if kernel execution fails or parameters are invalid

Source

pub fn hash_join_v2_with_limit( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, max_output: Option<usize>, ) -> Result<CudaBuffer>

V2 hash join with configurable maximum output size

Multi-column join with typed key comparison, supporting different join types. Uses composite hashing (FNV-1a) for multi-column keys with full key verification.

§Arguments

left - The left (probe) buffer
right - The right (build) buffer
left_keys - Column indices for join keys in left buffer
right_keys - Column indices for join keys in right buffer
join_type - Type of join to perform (Inner, Semi, Anti, LeftOuter)
max_output - Optional maximum number of output rows (None = unlimited, subject to memory budget)

§Errors

Returns XlogError::Kernel if kernel execution fails or parameters are invalid

Source

pub fn nested_loop_join_v2_inner_u32_1key( &self, left: &CudaBuffer, right: &CudaBuffer, left_key: usize, right_key: usize, ) -> Result<CudaBuffer>

Nested-loop inner join (emit-pairs design).

Drop-in compatible with hash_join_v2(_, _, &[left_key], &[right_key], JoinType::Inner): same input types, same output schema (combine_schemas(left, right)), same row set. Caller (the executor’s dispatch site) is responsible for choosing between hash_join_v2 and this fn based on the eligibility predicate + threshold check; this fn validates the same contract fail-closed and returns Err if a caller violates it.

§Eligibility (validated inside; `Err` on violation)

left.arity() > left_key && right.arity() > right_key.
Left and right key columns share the same ScalarType, and that shared type is U32 or Symbol (Symbol is u32 at the byte level — same kernel applies).
Each key column’s allocation is at least num_rows * 4 bytes (preflight lower-bound validation; mirrors the crates/xlog-cuda/src/provider/ilp.rs:18 codebase idiom col.num_bytes() < required_bytes). CudaColumn::num_bytes() reports the allocation size, which can exceed num_rows * 4 when the buffer has spare capacity (row_cap > num_rows); strict-equality validation would false-positive-reject normal over-allocated buffers reaching this path through Executor::execute_node.
num_left * num_right <= NESTED_LOOP_TOTAL_THRESHOLD (computed via checked_mul; release-mode wrapping multiply is forbidden).

§Implementation outline

Read logical row counts via device_row_count (NOT row_cap).
Empty-input fast path: if either side is empty, return create_empty_buffer(combine_schemas(...)) — mirrors hash_join_inner_v2’s pattern at crates/xlog-cuda/src/provider/relational.rs:3165-3170.
Validate eligibility (above).
Allocate two u32 index arrays of length num_left * num_right (bounded at 32 MB total under the threshold).
Launch nested_loop_join_inner_u32_1key_pairs with &CudaColumn key pointers (variant-agnostic).
D2H the output count.
Materialize via gather_buffer_by_indices for both sides + concatenate columns.

Source

pub fn is_sorted_ascending_u32( &self, buf: &CudaBuffer, key_col: usize, ) -> Result<bool>

Sort-merge sortedness-detection wrapper. Returns Ok(true) iff the column at key_col of buf is sorted ascending (keys[i] <= keys[i+1] for all i in [0, num_rows-1)), Ok(false) if a violation is detected, Err(_) on kernel-launch / D2H failure.

Empty / single-row fast path: n < 2 returns Ok(true) BEFORE allocation or kernel launch. The detection kernel’s grid (n + 255) / 256 is undefined for n == 0; single-row sequences are trivially sorted. This is the load-bearing invariant the empty-input sortedness checks verify.

Validation:

Key column index within arity bounds.
Key column type is U32 or Symbol (byte-identical at the kernel level).
Key column allocation >= num_rows * 4 bytes (mirrors the nested-loop byte-length lower-bound idiom).

Caller surface: this fn has no executor-dispatch caller after benchmark-backed unwiring. Its only callers are operator-level tests and the production sort-merge benchmark (sort-merge-with-detection timing). The provider returns the honest Result<bool> — the kernel can fail (allocation, launch, D2H), and Err(_) is preserved so callers can log or surface it at their abstraction level. There is no fail-closed dispatch contract anymore. Earlier fail-closed callers used matches!(_, Ok(true)); after the dispatch site was unwired, any later caller must decide its own Err-handling policy.

Source

pub fn sort_merge_join_v2_inner_u32_1key( &self, left: &CudaBuffer, right: &CudaBuffer, left_key: usize, right_key: usize, ) -> Result<CudaBuffer>

Sort-merge inner join (caller-asserted pre-sorted inputs). Drop-in compatible with hash_join_v2(_, _, &[left_key], &[right_key], JoinType::Inner): same input types, same output schema (combine_schemas(left, right)), same row set.

Caller surface: this fn has no executor-dispatch caller after benchmark-backed unwiring. Production benchmark evidence rejected default executor precedence for sort-merge at execute_join; this fn remains graduated operator work for direct provider callers and tests. Current callers: operator-level provider parity tests in crates/xlog-integration/tests/test_w43_sort_merge_dispatch.rs and the production sort-merge benchmark at crates/xlog-integration/benches/sort_merge_production_bench.rs (sort-merge-with-detection Path 1 timing).

Caller contract: both inputs are pre-sorted ascending by their respective key column. The kernel does NOT detect or enforce sortedness; callers may pre-check via is_sorted_ascending_u32. On unsorted inputs the row-set output is undefined; the dispatch-site fallback path no longer exists.

§Eligibility (validated inside; `Err` on violation)

left.arity() > left_key && right.arity() > right_key.
Left and right key columns share the same ScalarType, and that shared type is U32 or Symbol.
Each key column’s allocation is at least num_rows * 4 bytes (lower-bound check, mirrors the nested-loop byte-length guard).
num_left * num_right <= NESTED_LOOP_TOTAL_THRESHOLD (shared with the nested-loop operator; computed via checked_mul; release-mode wrapping multiply is forbidden).

§Implementation outline

Mirrors nested_loop_join_v2_inner_u32_1key implementation idioms: empty fast path with no ?, byte-length lower-bound < check, checked_mul for threshold, as u64 for row_cap, and variant-agnostic &CudaColumn launch.

Read logical row counts via device_row_count (NOT row_cap).
Empty-input fast path: if either side is empty, return create_empty_buffer(combine_schemas(...)) — mirrors hash_join_inner_v2 at relational.rs:3165-3170 AND nested_loop_join_v2_inner_u32_1key’s identical pattern.
Validate eligibility (above).
Allocate two u32 index arrays of length num_left * num_right (bounded at 32 MB total under the shared threshold).
Launch sort_merge_join_inner_u32_1key_pairs with &CudaColumn key pointers (variant-agnostic).
D2H the output count.
Materialize via gather_buffer_by_indices for both sides + concatenate columns.

Source

pub fn sort_merge_join_v2_inner_u32_1key_bounded( &self, left: &CudaBuffer, right: &CudaBuffer, left_key: usize, right_key: usize, output_capacity: usize, ) -> Result<CudaBuffer>

Sorted-chain variant of Self::sort_merge_join_v2_inner_u32_1key.

The sort-merge operator is product-thresholded because it allocates |left| * |right| candidate pairs. Chain routing uses this bounded variant only for sorted large inputs where the expected fanout is one-to-one; capacity is caller supplied and the kernel’s logical output counter is checked after launch. If duplicates make the true output exceed output_capacity, this returns an error so the caller can fail closed to the hash fallback.

Source

pub fn build_join_index_v2( &self, right: &CudaBuffer, right_keys: &[usize], ) -> Result<JoinIndexV2>

Build a cached join index for the right/build side of v2 hash join.

Source

pub fn build_join_index_v2_background( &self, right: &CudaBuffer, right_keys: &[usize], ) -> Result<JoinIndexV2>

Build a cached join index for background persistent-index mode.

When recorded hash joins are enabled and the provider has a runtime-backed manager, the build is enqueued on the provider’s recorded operation stream and dependency-recorded like the indexed join consumer path. Otherwise this falls back to the legacy synchronous builder.

Source

pub fn build_join_index_v2_recorded( &self, right: &CudaBuffer, right_keys: &[usize], launch_stream: StreamId, ) -> Result<JoinIndexV2>

Recorded-stream join-index builder used by persistent background builds.

The build side is packed and bucketized on launch_stream; the returned JoinIndexV2 carries runtime-tracked buffers whose writes were committed through the launch recorder / stream dependency machinery.

Source

pub fn hash_join_v2_with_index( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, index: &JoinIndexV2, max_output: Option<usize>, ) -> Result<CudaBuffer>

Hash join using a cached build-side join index.

The index must have been built for the same right buffer and right_keys.

Source

pub fn build_hash_table_u64( &self, hashes: &TrackedCudaSlice<u64>, num_rows: u32, ) -> Result<HashTableU64>

Build a bucketed hash table from a u64 hash array.

Source

pub fn membership_mask_device( &self, probe: &CudaBuffer, build: &CudaBuffer, probe_keys: &[usize], build_keys: &[usize], ) -> Result<TrackedCudaSlice<u8>>

Compute a per-row membership mask on device: for each row in probe, check whether a matching row exists in build (by the specified key columns). Returns a TrackedCudaSlice<u8> of length = probe row count that stays GPU-resident (no D2H transfer).

Source

pub fn membership_mask( &self, probe: &CudaBuffer, build: &CudaBuffer, probe_keys: &[usize], build_keys: &[usize], ) -> Result<Vec<bool>>

Compute a per-row membership mask: for each row in probe, check whether a matching row exists in build (by the specified key columns). Returns a Vec<bool> of length = probe row count. This downloads only num_probe bytes (the mask), NOT column data.

Source

pub fn clone_buffer(&self, buffer: &CudaBuffer) -> Result<CudaBuffer>

Clone a buffer (deep copy) on-device.

This is primarily used when a caller needs owned buffer state for a separate runtime object while preserving the original relation store.

Source

pub fn extract_column( &self, buffer: &CudaBuffer, col_idx: usize, ) -> Result<CudaBuffer>

Extract a single column from a buffer as a new single-column buffer

§Arguments

buffer - The source buffer
col_idx - The column index to extract

§Returns

A new single-column CudaBuffer containing just the specified column

Source

pub fn extract_active_rule_indices( &self, mask_hard: &CudaBuffer, mask_soft: &CudaBuffer, n: usize, max_active: usize, ) -> Result<Vec<(u32, u32, u32)>>

Extract active (i,j,k) rule indices from a flattened N×N×N mask. Returns up to max_active entries sorted by soft-mask priority.

Source

pub fn sort_recorded( &self, input: &CudaBuffer, key_cols: &[usize], launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder variant of Self::sort — narrow to u32 / Symbol key columns. The whole sort chain (init → LSD radix passes → multi-column gather) runs on the caller-supplied launch_stream; every input column and the input row-count buffer are recorded as reads before preflight; every fresh runtime-backed allocation (scratch + output columns + output d_num_rows) is recorded via write BEFORE preflight (snapshot drops the borrow so kernel &mut borrows after preflight remain valid) enqueue.

§Errors

Manager not runtime-backed.
launch_stream does not resolve.
Empty key_cols or out-of-bounds index.
Any key column type other than U32 / Symbol (multi-type recorded sort is outside this API surface).
Preflight / kernel / commit failures.

Source

pub fn dedup_full_row_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder variant of Self::dedup_full_row — narrow to U32 / Symbol / U64 columns.

Composes Self::sort_recorded (typed multi-column sort) → on-stream mark_unique_full_row_bytewise → Self::compact_buffer_by_device_mask_counted_recorded (gather kept rows). All three primitives commit independently; the runtime’s record-all + wait-all last_use_events: Vec<CudaEvent> semantics chain the deallocate safety end-to-end.

Source

pub fn hash_join_inner_v2_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder variant of hash_join_inner_v2. JoinType::Inner only. Same count-then-materialize algorithm as the legacy variant, but every kernel runs on the caller-supplied launch_stream and host scalar reads of the join output count are explicitly ordered against the stream.

Source

pub fn hash_join_inner_v2_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder, deterministic-ordering Inner hash join using the deterministic binary-join path.

Algorithm: count → exclusive scan → device-resident total → host scalar read → materialize with per-probe-row offsets. Each probe row writes its local-th match to output[per_probe_offsets[tid] + local] directly — no global atomicAdd(output_count) on the materialize pass, so the output ordering is a deterministic function of (probe-row index, per-row match discovery order). Compare to Self::hash_join_inner_v2_recorded which uses the legacy count-then-atomic-materialize chain (correct but with atomic-induced order non-determinism across threads/blocks).

Sourced from the archived archive/gpu-resident-binary-join-prototype-* branches — three new kernels migrated: hash_join_probe_v2_count_per_row, hash_join_probe_v2_materialize, hash_join_total_from_scan. LeftOuter / Semi / Anti / indexed variants from the prototype are intentionally not migrated here.

Reuses the recorded helpers pack_keys_gpu_on_stream, build_hash_table_v2_on_stream, multiblock_scan_u32_inplace_on_stream, and gather_buffer_by_indices_on_stream. Inherits the compact / pack fixes via composition.

Source

pub fn hash_join_left_outer_v2_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Non-indexed LeftOuter CSM using the deterministic binary-join path.

Deterministic count → scan → materialize chain producing MATCHED (left_idx, right_idx) pairs first (Inner CSM machinery), then a per-probe-row unmatched mask (hash_join_csm_unmatched_mask) compacted via the recorded compact tail to produce unmatched_left. The final result is inner_left | unmatched_left per left column and inner_right | zeros per right column — matching the legacy hash_join_left_outer_v2_recorded row-ordering invariant downstream consumers depend on.

This path does not adopt the archived prototype’s hash_join_left_outer_count_per_row / hash_join_left_outer_materialize design — those kernels interleave matched and null-sentinel rows by probe-row index, which would change the legacy LeftOuter ordering downstream consumers depend on.

§Errors

Manager not runtime-backed.
launch_stream does not resolve.
left_keys/right_keys empty, mismatched length, or > 4 (pack_keys constraint).
Key column type mismatch.
Preflight / kernel / commit failures.

Source

pub fn hash_join_inner_v2_with_index_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], index: &JoinIndexV2, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Indexed-Inner CSM using the deterministic binary-join path.

Same deterministic count→scan→materialize algorithm as Self::hash_join_inner_v2_count_scan_materialize_recorded but skips pack-right + table-build — the cached crate::provider::JoinIndexV2 supplies index.packed_keys and &index.table. Only the probe (left) side is packed on launch_stream.

Reuses the three CSM kernels from the non-indexed inner path (hash_join_probe_v2_count_per_row, hash_join_probe_v2_materialize, hash_join_total_from_scan) — no new kernel additions. Composes pack_keys_gpu_on_stream, multiblock_scan_u32_inplace_on_stream, and gather_buffer_by_indices_on_stream unchanged from recorded helper paths.

Index buffers (packed_keys + 4 table buckets) are owned by the caller and recorded as reads on launch_stream for the count and materialize recorders — dropping the index after the call returns is correctly serialized through the runtime’s record-all + wait-all event chain.

Source

pub fn hash_join_left_outer_v2_with_index_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], index: &JoinIndexV2, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Indexed LeftOuter CSM using the indexed deterministic binary-join path.

Combines the indexed-Inner CSM Phases A+B (probe-only pack on launch_stream; cached crate::provider::JoinIndexV2 supplies the build side’s packed_keys and &index.table) with the non-indexed LeftOuter CSM Phases C–E (per-probe unmatched-mask via hash_join_csm_unmatched_mask → recorded compact tail → gather matched left + right → per-column inner | unmatched / inner | zeros concat). Same row-ordering invariant as Self::hash_join_left_outer_v2_count_scan_materialize_recorded: matched rows first, unmatched-with-zero-right second.

No new kernels — reuses the four already-migrated CSM kernels plus hash_join_csm_unmatched_mask from the non-indexed LeftOuter CSM path.

§Errors

Manager not runtime-backed.
launch_stream does not resolve.
left_keys/right_keys empty, mismatched length, or > 4 (pack_keys constraint).
Key column type mismatch.
index.right_num_rows() mismatches the right buffer’s logical row count.
index.right_keys() mismatches the requested right_keys.
left_packed.key_bytes mismatches index.key_bytes.
Preflight / kernel / commit failures.

Source

pub fn hash_join_v2_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder, launch_stream-routed variant of hash_join_v2. Covers all four join types (Inner / Semi / Anti / LeftOuter) via dedicated per-type recorded methods.

When Self::use_recorded_csm_env is on, Inner and LeftOuter route through the CSM (count-scan-materialize) methods; otherwise they route through the legacy recorded methods. Semi / Anti always route through their existing recorded methods — no CSM implementation exists for them. All eligibility checks (runtime-backed manager, ≤4 keys, key-type match, row-count caps) are validated upstream by the public hash_join_v2_with_limit and inside each per-type method.

Source

pub fn hash_join_v2_with_index_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, index: &JoinIndexV2, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>

Strict-recorder, launch_stream-routed variant of hash_join_v2_with_index. Supports all four join types — the indexed variants share the same (packed_keys, table) shape, so a single recorded surface covers them.

When Self::use_recorded_csm_env is on, Inner and LeftOuter route through the indexed CSM (count-scan-materialize) methods; otherwise they route through the legacy indexed recorded methods. Semi / Anti always route through their existing indexed recorded methods — no CSM implementation exists for them.

Source §

impl CudaKernelProvider

Source

pub fn download_column<T: GpuScalar>( &self, buffer: &CudaBuffer, col_idx: usize, ) -> Result<Vec<T>>

Download a single column from GPU to host as Vec<T>.

Replaces: download_column_u32, download_column_u64, download_column_i32, download_column_i64, download_column_f32, download_column_f64, download_column_u8, download_column_bool.

Increments d2h_transfer_count (the per-call ILP-style counter) and is checked by the strict deterministic-Datalog D2H guard when it is enabled (see enable_strict_deterministic_d2h).

Source

pub fn download_column_untracked<T: GpuScalar>( &self, buffer: &CudaBuffer, col_idx: usize, ) -> Result<Vec<T>>

Download a column WITHOUT incrementing the per-call d2h_transfer_count (the ILP-style counter). Records in transfer_tracker for byte/call profiling stats.

IS still checked by the strict deterministic-Datalog D2H guard when it is enabled — “untracked” only refers to d2h_transfer_count, not to the deterministic gate. Use dtoh_scalar_untracked for metadata reads that must remain allowed under the gate.

Replaces: download_f64_untracked (now generic over T).

Source

pub fn create_buffer_from_slice<T: GpuScalar>( &self, data: &[T], schema: Schema, ) -> Result<CudaBuffer>

Upload a typed slice as a single-column GPU buffer.

Replaces: create_buffer_from_u32_slice, create_buffer_from_u64_slice, create_buffer_from_i32_slice, create_buffer_from_i64_slice, create_buffer_from_f32_slice, create_buffer_from_f64_slice, create_buffer_from_u8_slice.

Source §

impl CudaKernelProvider

Source

pub fn wcoj_layout_u32_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Build the sorted+deduped WCOJ physical layout for a 2-column u32 relation.

Output: a 2-column u32 CudaBuffer sorted lexicographically by (col0, col1) and deduplicated. The output is suitable for direct consumption by Self::wcoj_triangle_u32_recorded in any of the three slot positions (e_xy, e_yz, e_xz); the caller chooses which logical relation each input represents by the slot it passes the layout into.

Fast-path: if the input is already strictly lex-sorted and full-row unique, a recorded checker proves that property and the method returns a recorded device-side clone. Otherwise it falls back to Self::dedup_full_row_recorded, which invokes Self::sort_recorded (typed multi-column radix sort on (col0, col1)) followed by an on-stream mark_unique_full_row_bytewise mask + counted compaction. Both paths are launch-recorder disciplined and preserve the sorted+deduped output contract.

This entry exists for two reasons:

Narrowing the input contract to 2-column u32 lets the WCOJ-specific call site fail fast with a clear error rather than the more generic dedup error if the caller passes the wrong arity / type.
Naming the WCOJ pipeline boundary makes downstream callers (planner / executor wiring, cert harness) target the WCOJ-specific layout API rather than the general-purpose dedup primitive — separating concerns that may diverge as the WCOJ stack grows.

§Errors

XlogError::Kernel if the manager has no runtime (with_runtime is required), the input is not 2-column, any column is not [ScalarType::U32] or [ScalarType::Symbol] (both share the same 4-byte physical layout, so the underlying sort/dedup primitives handle either with no kernel changes), or any inner sort/dedup primitive fails.

Source

pub fn wcoj_layout_sort_u32_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Generic full-row WCOJ layout sort+dedup for relations of any arity ≥ 2 in the 4-byte width-class (U32, Symbol, mixable within the class).

Design: this entry point leaves the existing arity-2 Self::wcoj_layout_u32_recorded is unchanged for the triangle / 4-cycle / project-then-layout callers — it retains its arity-2-specific fast-path branch. This generic surface delegates straight to Self::dedup_full_row_recorded for any arity ≥ 2.

Validation order (runtime → arity ≥ 2 → per-column width-class → delegate):

Manager runtime-backed.
input.arity() >= 2.
Every column type ∈ {U32, Symbol} (4-byte width-class). Mixed U32 + Symbol within one relation is permitted; U64 is rejected — use Self::wcoj_layout_sort_u64_recorded instead.
Delegate to dedup_full_row_recorded(input, launch_stream).

Stream resolution is owned by dedup_full_row_recorded and is NOT in this entry point’s validation list. The n == 0 short-circuit (returns create_empty_buffer(input.schema().clone())) is also owned downstream — single source of truth, no duplicated empty-buffer semantics.

Composition: dedup_full_row_recorded only — there is no fast-path branch for arity ≥ 3 in this generic full-row layout-sort accessor (the existing arity-2 fast-path stays untouched and reachable only via wcoj_layout_u32_recorded).

§Errors

XlogError::Kernel if the manager has no runtime (with_runtime is required).
XlogError::Kernel if input.arity() < 2.
XlogError::Kernel if any column is not U32 / Symbol.
Whatever dedup_full_row_recorded returns for stream-resolution / kernel-launch failures.

Source

pub fn wcoj_triangle_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Evaluate tri(X, Y, Z) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) on already-sorted, already-deduped binary 4-byte-key relations. See module-level docs for the full contract.

Each column may be [ScalarType::U32] or [ScalarType::Symbol] — both share the same 4-byte physical layout, so the kernel reads the bits unchanged. Cross-relation type compatibility (e.g., that Y is the same type in e_xy.col1 and e_yz.col0) is the planner’s responsibility upstream; this entry only enforces width.

The output schema preserves per-head-position scalar types from the inputs:

out.col0 = e_xy.col0 type (X)
out.col1 = e_xy.col1 type (Y)
out.col2 = e_yz.col1 type (Z)

§Errors

XlogError::Kernel if the manager has no runtime (with_runtime is required), the launch stream does not resolve, an input is not 2-column with U32/Symbol columns, or any kernel launch fails.

Source

pub fn wcoj_4cycle_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Evaluate cyc4(W, X, Y, Z) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) on already-sorted, already-deduped binary 4-byte-key relations. Structural mirror of Self::wcoj_triangle_u32_recorded for the 4-cycle case; see that entry’s contract and the module-level docs for the shared two-phase recorder discipline.

Each column may be [ScalarType::U32] or [ScalarType::Symbol] — both share the same 4-byte physical layout, so the kernel reads the bits unchanged. Cross-relation type compatibility (e.g., that X is the same type in e1.col1 and e2.col0) is the planner’s responsibility upstream; this entry only enforces width.

The 4-cycle slot order is [e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W)]. The output schema preserves per-head-position scalar types from the inputs:

out.col0 = e1.col0 type (W)
out.col1 = e1.col1 type (X)
out.col2 = e2.col1 type (Y)
out.col3 = e3.col1 type (Z)

§Errors

XlogError::Kernel if the manager has no runtime (with_runtime is required), the launch stream does not resolve, an input is not 2-column with U32/Symbol columns, or any kernel launch fails.

Source

pub fn wcoj_layout_u64_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Build the sorted+deduped WCOJ physical layout for a 2-column U64 relation. Output: a 2-column U64 CudaBuffer sorted lexicographically by (col0, col1) and deduplicated. Suitable for direct consumption by Self::wcoj_triangle_u64_recorded.

Composition mirrors Self::wcoj_layout_u32_recorded: already sorted+unique inputs take the recorded fast-path clone; other inputs fall back to Self::dedup_full_row_recorded, whose U64 sort_recorded path ports the legacy sort() hi/lo radix-pass strategy into recorded launch discipline.

§Errors

XlogError::Kernel if the manager has no runtime, the input is not 2-column, any column is not [ScalarType::U64], or any inner sort/dedup primitive fails.

Source

pub fn wcoj_layout_sort_u64_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Generic full-row WCOJ layout sort+dedup for relations of any arity ≥ 2 in the 8-byte width-class (U64 only).

Design: this entry point leaves the existing arity-2 Self::wcoj_layout_u64_recorded is unchanged for the existing 2-column callers — it retains its arity-2-specific fast-path branch. This generic surface delegates straight to Self::dedup_full_row_recorded for any arity ≥ 2.

Mirrors Self::wcoj_layout_sort_u32_recorded’s contract at the 8-byte width-class — see that entry’s doc for the validation order, stream-resolution ownership, n==0 semantics, and “no fast-path for arity ≥ 3” lock.

Validation order (runtime → arity ≥ 2 → per-column width-class → delegate):

Manager runtime-backed.
input.arity() >= 2.
Every column type = U64. U32 / Symbol are rejected — use Self::wcoj_layout_sort_u32_recorded instead.
Delegate to dedup_full_row_recorded(input, launch_stream).

§Errors

XlogError::Kernel if the manager has no runtime (with_runtime is required).
XlogError::Kernel if input.arity() < 2.
XlogError::Kernel if any column is not U64.
Whatever dedup_full_row_recorded returns for stream-resolution / kernel-launch failures.

Source

pub fn wcoj_triangle_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Evaluate tri(X, Y, Z) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) on already-sorted, already-deduped binary U64 relations. Mirrors Self::wcoj_triangle_u32_recorded’s contract; the only differences are the 8-byte join-key reads/writes and the U64-specific count/materialize kernels. Counters and the total reducer remain u32.

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input is not 2-column with U64 columns, or any kernel launch fails.

Source

pub fn wcoj_4cycle_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Evaluate cycle4(W, X, Y, Z) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) on already-sorted, already-deduped binary U64 relations. Mirrors Self::wcoj_4cycle_u32_recorded’s contract; the only differences are the 8-byte join-key reads/writes and the U64 HG planner/count/materialize kernels. Counters and the total reducer remain u32 (bounded by the upstream host- side row-count guard).

4-cycle slot order: [e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W)].

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input is not 2-column with U64 columns, or any kernel launch fails.

Source §

impl CudaKernelProvider

Source

pub fn wcoj_clique5_metadata_recorded_u32( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u32>>

Build leader-edge runtime metadata for a 5-clique 4-byte-width dispatch.

Source

pub fn wcoj_clique5_metadata_recorded_u64( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u64>>

Build leader-edge runtime metadata for a 5-clique 8-byte-width dispatch.

Source

pub fn wcoj_clique6_metadata_recorded_u32( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u32>>

Build leader-edge runtime metadata for a 6-clique 4-byte-width dispatch.

Source

pub fn wcoj_clique6_metadata_recorded_u64( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u64>>

Build leader-edge runtime metadata for a 6-clique 8-byte-width dispatch.

Source

pub fn wcoj_clique5_u32_recorded( &self, edges: &[&CudaBuffer; 10], launch_stream: StreamId, ) -> Result<CudaBuffer>

5-clique WCOJ at 4-byte width-class.

edges must contain exactly 10 2-column buffers in canonical lex (i, j) order (i < j): (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). Each column may be U32 or Symbol (mixable within the 4-byte width-class). All edges must be lex-sorted+deduped on (col0, col1) — the runtime dispatcher routes through wcoj_layout_sort_u32_recorded before calling here.

Source

pub fn wcoj_clique5_u32_recorded_planned( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

5-clique WCOJ at 4-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique5_u64_recorded( &self, edges: &[&CudaBuffer; 10], launch_stream: StreamId, ) -> Result<CudaBuffer>

5-clique WCOJ at 8-byte width-class (U64 only).

Source

pub fn wcoj_clique5_u64_recorded_planned( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

5-clique WCOJ at 8-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique6_u32_recorded( &self, edges: &[&CudaBuffer; 15], launch_stream: StreamId, ) -> Result<CudaBuffer>

6-clique WCOJ at 4-byte width-class.

edges must contain exactly 15 2-column buffers in canonical lex (i, j) order. Width-class + sort+dedup pre-condition match wcoj_clique5_u32_recorded.

Source

pub fn wcoj_clique6_u32_recorded_planned( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

6-clique WCOJ at 4-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique6_u64_recorded( &self, edges: &[&CudaBuffer; 15], launch_stream: StreamId, ) -> Result<CudaBuffer>

6-clique WCOJ at 8-byte width-class (U64 only).

Source

pub fn wcoj_clique6_u64_recorded_planned( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

6-clique WCOJ at 8-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique7_u32_recorded( &self, edges: &[&CudaBuffer; 21], launch_stream: StreamId, ) -> Result<CudaBuffer>

7-clique WCOJ at 4-byte width-class.

Source

pub fn wcoj_clique7_u32_recorded_planned( &self, edges: &[&CudaBuffer; 21], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

7-clique WCOJ at 4-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique7_u64_recorded( &self, edges: &[&CudaBuffer; 21], launch_stream: StreamId, ) -> Result<CudaBuffer>

7-clique WCOJ at 8-byte width-class (U64 only).

Source

pub fn wcoj_clique7_u64_recorded_planned( &self, edges: &[&CudaBuffer; 21], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

7-clique WCOJ at 8-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique8_u32_recorded( &self, edges: &[&CudaBuffer; 28], launch_stream: StreamId, ) -> Result<CudaBuffer>

8-clique WCOJ at 4-byte width-class.

Source

pub fn wcoj_clique8_u32_recorded_planned( &self, edges: &[&CudaBuffer; 28], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

8-clique WCOJ at 4-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique8_u64_recorded( &self, edges: &[&CudaBuffer; 28], launch_stream: StreamId, ) -> Result<CudaBuffer>

8-clique WCOJ at 8-byte width-class (U64 only).

Source

pub fn wcoj_clique8_u64_recorded_planned( &self, edges: &[&CudaBuffer; 28], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

8-clique WCOJ at 8-byte width-class using plan-derived launch params.

Source

pub fn wcoj_clique5_groupby_root_count_u32_recorded_planned( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

Fused 5-clique count-by-root at the 4-byte width-class, using plan-derived launch params. See Self::wcoj_clique_groupby_root_count_recorded_inner.

Source

pub fn wcoj_clique6_groupby_root_count_u32_recorded_planned( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>

Fused 6-clique count-by-root at the 4-byte width-class, using plan-derived launch params. See Self::wcoj_clique_groupby_root_count_recorded_inner.

Source §

impl CudaKernelProvider

Source

pub fn wcoj_build_metadata_u32_recorded( &self, input: &CudaBuffer, key_col_idx: usize, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u32>>

Source

pub fn wcoj_build_metadata_u64_recorded( &self, input: &CudaBuffer, key_col_idx: usize, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u64>>

Source

pub fn wcoj_triangle_hg_work_plan_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojTriangleHgWorkPlanU32>

Source

pub fn wcoj_triangle_count_hg_u32_recorded( &self, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source

pub fn wcoj_triangle_hg_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source

pub fn wcoj_triangle_groupby_root_count_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Aggregate-fused triangle group-by-root count: evaluate q(X, count) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) grouped by the variable-order root X, WITHOUT materializing the triangle rows.

Pipeline (all recorded; the triangle result never exists as rows):

the standard histogram-guided work plan;
wcoj_triangle_groupby_root_count_hg_u32 accumulates per-e_xy-row match counts (integer atomicAdd — order-insensitive, deterministic values) into a zero-initialized n_xy-long array;
a 2-column (X, count) staging buffer over the input rows is compacted to count>0 rows (group-by over the join result must not emit roots with no completion) and reduced per X via the recorded groupby Sum (rows are already X-sorted because e_xy is lex-sorted).

All reduction work is O(n_xy) — input-sized, never join-output-sized.

Output schema matches the unfused materialize+groupby-count baseline: col0 = X (e_xy.col0 type, U32/Symbol), col1 = count (U64).

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.

Source

pub fn wcoj_triangle_groupby_root_agg_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, agg_op: AggOp, value: WcojRootAggValue, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Aggregate-fused triangle group-by-root sum/min/max: evaluate q(X, agg(V)) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) with agg ∈ {Sum, Min, Max} and V ∈ {Y, Z} grouped by the variable-order root X, WITHOUT materializing the triangle rows.

Pipeline (all recorded; the triangle result never exists as rows):

the standard histogram-guided work plan;
the per-op fused kernel accumulates, per e_xy row, a match count (compaction mask) and the per-row partial aggregate (integer atomics — order-insensitive, deterministic values). Sum partials are u64 (a per-row partial can exceed u32::MAX); min partials start at u32::MAX, max partials at 0;
a 3-column (X, count, agg) staging buffer over the input rows is compacted to count>0 rows (groups with no completion must be absent) and reduced per X via the recorded groupby with the same AggOp (Sum over the u64 partials; Min/Max over u32).

All reduction work is O(n_xy) — input-sized, never join-output-sized.

Output schema matches the unfused materialize+groupby baseline: col0 = X (e_xy.col0 type, U32/Symbol), col1 = U64 for Sum, U32 for Min/Max.

Bag semantics: every (Y, Z) completion contributes its value, exactly like aggregating the materialized projection.

§Errors

XlogError::Kernel if agg_op is not Sum/Min/Max, the value columns are not plain U32, the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.

Source

pub fn wcoj_triangle_groupby_root_count_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

U64-key aggregate-fused triangle count sibling of Self::wcoj_triangle_groupby_root_count_u32_recorded: evaluate q(X, count) over the triangle shape grouped by the root X for U64 relations, WITHOUT materializing the triangle rows.

The recorded groupby is U32/Symbol-key only, so the per-X reduction reuses the WCOJ relation metadata instead: e_xy is lex-sorted, so wcoj_build_metadata_u64_recorded yields one (unique X, group start) pair per root, and wcoj_groupby_root_segment_sum_counts_u32 accumulates the per-row match counts into per-unique-root u64 totals (integer atomicAdd — deterministic). Roots with zero completions are compacted away. All reduction work is O(n_xy).

Output schema matches the unfused materialize+groupby baseline: col0 = X (U64), col1 = count (U64).

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input is not 2-column U64, or any kernel launch fails.

Source

pub fn wcoj_triangle_groupby_root_agg_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, agg_op: AggOp, value: WcojRootAggValue, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

U64-key aggregate-fused triangle sum/min/max sibling of Self::wcoj_triangle_groupby_root_agg_u32_recorded: evaluate q(X, agg(V)) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) with agg ∈ {Sum, Min, Max} and V ∈ {Y, Z} over U64 relations, grouped by the variable-order root X, WITHOUT materializing the triangle rows.

The recorded groupby is U32/Symbol-key only, so the per-X reduction reuses the WCOJ relation metadata (one unique root per group, e_xy lex-sorted) like the u64 count path:

the per-op fused kernel accumulates, per e_xy row, a match count and a u64 aggregate partial (integer atomics — deterministic; sum wraps on overflow exactly like groupby_sum_u64; min partials start at u64::MAX, max partials at 0);
wcoj_groupby_root_segment_sum_counts_u32 reduces per-row match counts to per-unique-root totals (the presence mask), and the per-op wcoj_groupby_root_segment_{sum,min,max}_values_u64 kernel folds the per-row partials into per-unique-root u64 aggregates, skipping zero-match rows;
a (X, agg) staging buffer over the unique roots is compacted to count>0 groups.

All reduction work is O(n_xy) — input-sized, never join-output-sized.

Output schema matches the unfused materialize+groupby baseline (legacy groupby widened to u64 values): col0 = X (U64), col1 = U64 for sum, min and max alike.

§Errors

XlogError::Kernel if agg_op is not Sum/Min/Max, the manager has no runtime, the launch stream does not resolve, an input is not 2-column U64, or any kernel launch fails.

Source

pub fn wcoj_triangle_hg_count_phase_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, launch_stream: StreamId, ) -> Result<WcojTriangleHgCountPhaseU32>

Source

pub fn wcoj_triangle_hg_materialize_phase_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, count: WcojTriangleHgCountPhaseU32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source

pub fn wcoj_triangle_hg_u32_with_plan_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source

pub fn wcoj_triangle_hg_work_plan_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojTriangleHgWorkPlanU64>

Source

pub fn wcoj_triangle_hg_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source

pub fn wcoj_4cycle_hg_work_plan_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojCycle4HgWorkPlanU32>

Source

pub fn wcoj_4cycle_hg_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source

pub fn wcoj_4cycle_groupby_root_count_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Aggregate-fused 4-cycle group-by-root count: evaluate q(W, count) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) grouped by the variable-order root W, WITHOUT materializing the 4-cycle rows.

Pipeline (all recorded; the 4-cycle result never exists as rows):

the standard 4-cycle histogram-guided work plan;
wcoj_4cycle_groupby_root_count_hg_u32 accumulates, per e1 row, a match count (integer atomicAdd — order-insensitive, deterministic values);
a (W, count) staging buffer over the input rows is compacted to count>0 rows (roots with no completion must be absent) and reduced per W via the recorded groupby Sum.

All reduction work is O(n_e1) — input-sized, never join-output-sized.

Output schema matches the unfused materialize+groupby baseline: col0 = W (e1.col0 type, U32/Symbol), col1 = count (U64).

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.

Source

pub fn wcoj_4cycle_groupby_root_agg_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, agg_op: AggOp, value: Wcoj4CycleRootAggValue, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Aggregate-fused 4-cycle group-by-root sum/min/max: evaluate q(W, agg(V)) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) with agg ∈ {Sum, Min, Max} and V ∈ {X, Y, Z} grouped by the variable-order root W, WITHOUT materializing the 4-cycle rows.

Pipeline (all recorded; the 4-cycle result never exists as rows):

the standard 4-cycle histogram-guided work plan;
the per-op fused kernel accumulates, per e1 row, a match count (compaction mask) and the per-row partial aggregate (integer atomics — order-insensitive, deterministic values). Sum partials are u64 (a per-row partial can exceed u32::MAX); min partials start at u32::MAX, max partials at 0;
a 3-column (W, count, agg) staging buffer over the input rows is compacted to count>0 rows (roots with no completion must be absent) and reduced per W via the recorded groupby with the same AggOp (Sum over the u64 partials; Min/Max over u32).

All reduction work is O(n_e1) — input-sized, never join-output-sized.

Output schema matches the unfused materialize+groupby baseline: col0 = W (e1.col0 type, U32/Symbol), col1 = U64 for Sum, U32 for Min/Max.

Bag semantics: every (X, Y, Z) completion contributes its value, exactly like aggregating the materialized projection.

§Errors

XlogError::Kernel if agg_op is not Sum/Min/Max, the value column is not plain U32, the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.

Source

pub fn wcoj_4cycle_groupby_root_count_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

U64-key aggregate-fused 4-cycle count sibling of Self::wcoj_4cycle_groupby_root_count_u32_recorded: evaluate q(W, count) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) grouped by the variable-order root W for U64 relations, WITHOUT materializing the 4-cycle rows.

The recorded groupby is U32/Symbol-key only, so the per-W reduction reuses the WCOJ relation metadata instead (mirroring Self::wcoj_triangle_groupby_root_count_u64_recorded): e1 is lex-sorted, so wcoj_build_metadata_u64_recorded yields one (unique W, group start) pair per root, and wcoj_groupby_root_segment_sum_counts_u32 accumulates the per-row match counts into per-unique-root u64 totals (integer atomicAdd — deterministic). Roots with zero completions are compacted away. All reduction work is O(n_e1).

Output schema matches the unfused materialize+groupby baseline: col0 = W (U64), col1 = count (U64).

§Errors

XlogError::Kernel if the manager has no runtime, the launch stream does not resolve, an input is not 2-column U64, or any kernel launch fails.

Source

pub fn wcoj_4cycle_hg_work_plan_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojCycle4HgWorkPlanU64>

Source

pub fn wcoj_4cycle_hg_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>

Source §

impl CudaKernelProvider

Source

pub fn wcoj_project_2col_swap_recorded( &self, src: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>

Produce an owned 2-col CudaBuffer whose columns are [src.col(1), src.col(0)]. See module docs for the full recorded / failure-drain contract.

Used by the variable-ordering dispatcher when a triangle non-default leader requires a col-swap before Self::wcoj_layout_u32_recorded sorts the result. The 4-cycle path is rotation-only and never invokes this helper.

§Errors

XlogError::Kernel if the manager has no runtime, the stream doesn’t resolve, the input isn’t 2-col, or any queued DtoD copy fails. On any failure after the first queued copy, the launch stream is synchronized before the function returns.

Source

pub fn wcoj_project_output_columns_recorded( &self, src: &CudaBuffer, perm: &[usize], head_schema: Schema, launch_stream: StreamId, ) -> Result<CudaBuffer>

Produce an owned N-col CudaBuffer with columns reordered per perm and the schema replaced with head_schema.

perm[i] is the source column index that becomes output column i. The dispatcher uses this post-kernel to remap the kernel-direct output (in leader’s (a, b, c[, d]) order) into the rule’s head order.

See module docs for the full recorded / failure-drain contract.

§Errors

XlogError::Kernel if the manager has no runtime, the stream doesn’t resolve, perm.len() != head_schema.arity(), any perm index is ≥ src.arity(), or any queued DtoD copy fails. Failure-drain on Err.

Source §

impl CudaKernelProvider

Source

pub const DTOH_SMALL_METADATA_MAX_BYTES: usize = 4096

Hard cap (in bytes) for Self::dtoh_small_metadata_untracked. Set deliberately small (4 KB) so the helper cannot become a general-purpose vector D2H escape hatch — it’s strictly for classifier histograms and similar small metadata round-trips.

Source

pub fn new( device: Arc<CudaDevice>, memory: Arc<GpuMemoryManager>, ) -> Result<Self>

Create a new CUDA kernel provider

Loads all kernel modules into the CUDA device. Prefers cubin for the detected SM arch, falls back to portable PTX (sm_75+).

§Arguments

device - The CUDA device to load modules into
memory - The GPU memory manager for kernel allocations

§Errors

Returns XlogError::Kernel if PTX loading fails

§Example

let device = Arc::new(CudaDevice::new(0)?);
let memory = Arc::new(GpuMemoryManager::new(device.clone(), MemoryBudget::default()));
let provider = CudaKernelProvider::new(device, memory)?;

Source

pub fn with_runtime( device: Arc<CudaDevice>, memory: Arc<GpuMemoryManager>, ) -> Result<Self>

Construct a provider whose GpuMemoryManager must already have a v0.6 crate::device_runtime::XlogDeviceRuntime attached via GpuMemoryManager::with_runtime.

Equivalent to Self::new in every respect — same kernel loading, same field initialization — but rejects managers that lack a runtime. This guards against the misconfiguration in which a caller asks for runtime-routed provider semantics (by calling with_runtime) but supplies a legacy manager built via GpuMemoryManager::new; without the check, the resulting provider would silently keep using the cudarc default allocator and the runtime budget/logging stack would never observe the allocations the caller expected to be routed through it.

Note: a runtime-routed manager passed to Self::new still routes correctly — alloc::<T> and alloc_raw consult memory.runtime() regardless of which provider constructor was used. with_runtime exists for callers that want the requirement enforced at construction time, not for correctness of the routing itself.

This is the opt-in runtime entry point for providers. Self::new continues to accept managers without a runtime (the legacy default) and remains the production constructor until the runtime stack is certified end-to-end.

§Errors

Returns XlogError::Kernel if memory.runtime() is None, or anything Self::new would return.

§Example

let device = Arc::new(CudaDevice::new(0)?);
let runtime = Arc::new(XlogDeviceRuntime::with_resource(
    Arc::clone(&device),
    0,
    Arc::new(StreamPool::with_defaults(Arc::clone(&device))),
    Box::new(AsyncCudaResource::new(/* ... */)),
));
let memory = Arc::new(GpuMemoryManager::with_runtime(
    Arc::clone(&device),
    MemoryBudget::default(),
    runtime,
));
let provider = CudaKernelProvider::with_runtime(device, memory)?;

Source

pub fn wcoj_layout_fast_path_hit_count(&self) -> u64

Number of times wcoj_layout_*_recorded short-circuited to the fast-path (recorded clone) instead of running dedup_full_row_recorded. Increments by 1 per fast-path hit (3 hits per dispatch when all inputs are already sorted+unique). Used by tests + the phase report to confirm the fast-path fired.

Source

pub fn wcoj_triangle_hg_dispatch_count(&self) -> u64

Histogram-guided block-slice triangle WCOJ test/diagnostic counter: successful dispatches that routed through the provider entry.

Source

pub fn reset_wcoj_layout_fast_path_hit_count(&self)

Reset the fast-path hit counter to 0. Tests use this to scope counter assertions to a single dispatch.

Source

pub fn wcoj_layout_sort_invocation_count(&self) -> u64

Number of calls to wcoj_layout_sort_*_recorded since the last reset. Diagnostic-only; used by dispatch-plan certification.

Source

pub fn reset_wcoj_layout_sort_invocation_count(&self)

Reset the WCOJ layout-sort invocation counter to 0.

Source

pub fn kclique_metadata_build_count(&self) -> u64

Number of K-clique leader-edge metadata builds since the last reset.

Source

pub fn kclique_metadata_build_nanos(&self) -> u64

Cumulative nanoseconds spent building K-clique leader-edge metadata since the last reset.

Source

pub fn reset_kclique_metadata_build_metrics(&self)

Reset K-clique metadata build diagnostics.

Source

pub fn device(&self) -> &Arc<CudaDevice>

Get the CUDA device

Source

pub fn memory(&self) -> &Arc<GpuMemoryManager>

Get the GPU memory manager

Source

pub fn ptx_load_profile(&self) -> Option<&PtxLoadProfile>

Get PTX load profiling data (only populated when XLOG_WARMUP_PROFILE=1).

Source

pub fn reset_host_transfer_stats(&self)

Reset tracked host transfer statistics.

Source

pub fn host_transfer_stats(&self) -> HostTransferStats

Snapshot tracked host transfer statistics.

Source

pub fn host_launch_metadata_transfer_stats( &self, ) -> HostLaunchMetadataTransferStats

Snapshot launch-parameter H2D uploads tracked separately from host_transfer_stats.

Source

pub fn d2h_transfer_count(&self) -> u64

Read the column-level D2H transfer counter.

This counter increments once per download_column_* call, enabling callers (e.g. the ILP trainer) to assert that no column downloads occurred during a performance-critical section.

Source

pub fn reset_d2h_transfer_count(&self)

Reset the column-level D2H transfer counter to zero.

Source

pub fn untracked_metadata_dtoh_count(&self) -> u64

Count of untracked control-plane metadata D2H reads (dtoh_scalar_untracked + dtoh_small_metadata_untracked).

Source

pub fn reset_untracked_metadata_dtoh_count(&self)

Reset the untracked metadata D2H read counter to zero.

Source

pub fn enable_strict_deterministic_d2h(&self)

Enable the strict deterministic-Datalog D2H gate.

While enabled, any data-plane device-to-host transfer (column downloads via download_column / download_column_untracked, and any internal transfer routed through dtoh_sync_copy_into_tracked) increments CudaKernelProvider::deterministic_d2h_violation_count and returns XlogError::Execution from the originating call.

Metadata reads via CudaKernelProvider::dtoh_scalar_untracked are allowed and never trip the gate.

Default is false; the runtime opts in via RuntimeConfig::strict_deterministic_d2h. v0.5.5 ships the gate opt-in only — known-violating relational paths (set difference, join count/materialize) are scheduled for replacement before the default flips.

Source

pub fn disable_strict_deterministic_d2h(&self)

Disable the strict deterministic-Datalog D2H gate.

Source

pub fn strict_deterministic_d2h_enabled(&self) -> bool

Returns whether the strict deterministic-Datalog D2H gate is enabled.

Source

pub fn deterministic_d2h_violation_count(&self) -> u64

Cumulative deterministic-D2H gate violations since the last reset.

Source

pub fn reset_deterministic_d2h_violations(&self)

Reset the deterministic-D2H violation counter to zero.

Source

pub fn dtoh_small_metadata_untracked<T: DeviceRepr + Default + Copy>( &self, src: &TrackedCudaSlice<T>, count: usize, ) -> Result<Vec<T>>

Read a small metadata vector (≤ Self::DTOH_SMALL_METADATA_MAX_BYTES) from device to host WITHOUT updating the D2H transfer tracker.

Sibling of Self::dtoh_scalar_untracked for callers that need a few bucket counts (the WCOJ skew classifier reads a 3 × 64 × u32 = 768-byte histogram in one go) instead of count separate scalar reads. Like dtoh_scalar_untracked, this method is whitelisted by the strict deterministic-D2H gate (Self::enable_strict_deterministic_d2h) — it does NOT trip the gate, on purpose, because metadata reads are part of the determinism contract (just like a scalar total after a scan).

§Hard contract — DO NOT WIDEN THE CAP

The 4 KB cap is the contract. If a caller wants a larger D2H, it’s a data-plane transfer and must go through the tracked download_column* path. Widening this cap turns the helper into a backdoor for tracked-bypass column reads, which would silently invalidate the strict deterministic-D2H gate.

§Errors

XlogError::Kernel if count * size_of::<T>() exceeds DTOH_SMALL_METADATA_MAX_BYTES.
XlogError::Kernel if count exceeds the device slice’s length, or if the inner sync copy fails.

Source

pub fn dtoh_scalar_untracked<T: DeviceRepr + Default + Copy>( &self, src: &TrackedCudaSlice<T>, index: usize, ) -> Result<T>

Read a single scalar from device to host WITHOUT updating the D2H transfer tracker. Use ONLY for metadata reads (e.g. total_nnz after an exclusive scan), never for data-plane transfers.

This makes the “metadata != data-plane” contract explicit and auditable: callers that bypass tracking must call this method (which is grep-able) rather than reaching for device().inner().

Source

pub fn htod_sync_copy_into_tracked<T: DeviceRepr, Dst: DevicePtrMut<T>>( &self, src: &[T], dst: &mut Dst, ) -> Result<()>

Upload host data to device while recording data-plane H2D transfer stats.

Source

pub fn htod_sync_copy_tracked<T: DeviceRepr>( &self, src: &[T], ) -> Result<CudaSlice<T>>

Allocate a CUDA slice from host data while recording data-plane H2D transfer stats.

Source

pub fn htod_launch_metadata_sync_copy_into<T: DeviceRepr, Dst: DevicePtrMut<T>>( &self, src: &[T], dst: &mut Dst, ) -> Result<()>

Upload bounded launch metadata from host to device while recording it in the launch-metadata subcounter.

Source

pub fn exclusive_scan_u32_inplace( &self, data: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>

Compute exclusive prefix sum of u8 mask, returns (prefix_sum_vec, total_count)

This is useful for compaction operations where we need to know:

The output position for each input element (prefix sum)
The total number of elements that pass the mask (count)

§Arguments

mask - A slice of u8 values (0 or non-zero)

§Returns

A tuple of:

Vec<u32> containing the exclusive prefix sum
u32 containing the total count of non-zero mask elements

§Example

let mask = vec![1u8, 0, 1, 1, 0, 1];
let (prefix_sum, count) = provider.prefix_sum_mask(&mask)?;
// prefix_sum = [0, 1, 1, 2, 3, 3]
// count = 4

§Note

For small inputs (<=256 elements), a CPU scan is used for efficiency. For larger inputs, a three-phase multi-block GPU scan is used.

§Errors

Returns XlogError::Kernel if kernel execution fails

Source

pub fn device_row_count(&self, buffer: &CudaBuffer) -> Result<usize>

Read a buffer’s logical row count, using the host cache when available and falling back to a metadata-only device-to-host read when needed.

Source

pub fn validated_logical_row_count(&self, buffer: &CudaBuffer) -> Result<usize>

Read and validate a buffer’s logical row count for outward-facing APIs.

This keeps exported/query-visible lengths tied to the device logical row count while still rejecting impossible metadata (logical_rows > row_cap).

Source

pub fn create_empty_buffer(&self, schema: Schema) -> Result<CudaBuffer>

Create an empty buffer with the given schema (all columns are empty slices)

§Arguments

schema - The schema for the empty buffer

§Returns

A new CudaBuffer with zero rows

§Errors

Returns XlogError::Kernel if allocation fails

Source

pub fn create_zero_arity_buffer( &self, schema: Schema, rows: u32, ) -> Result<CudaBuffer>

Create a zero-arity (nullary) relation buffer carrying rows unit tuples.

A nullary relation holds exactly when it has at least one row; its single possible tuple is the empty tuple (). create_buffer_from_slices with no column slices routes to create_empty_buffer (0 rows), which represents the relation as absent — wrong for an asserted nullary fact. Nullary facts must use this path so presence is materialized as one row.