(left, right) candidate pairs across four
fixed 2-body Datalog rule templates and returning top-K per template with full
structured metadata.
It replaces pyxlog.ilp.induce_exact(backend="python"), whose host-orchestrated
set_rule_mask/evaluate/batch_fact_membership_device loop is documented as a
throwaway prototype.
- Crate:
crates/xlog-induce/ - CUDA kernels:
kernels/ilp_exact.cu(modulexlog_ilp_exact, entriesilp_exact_scoreforU64andilp_exact_score_u32forU32/Symbol) - Provider launcher:
crates/xlog-cuda/src/provider/ilp_exact.rs - pyxlog bridge:
crates/pyxlog/src/ilp_exact.rs+crates/pyxlog/python/pyxlog/ilp/exact_induce.py - Parity tests:
python/tests/test_ilp_exact_induce.py - Related: Rule Learning (Differentiable ILP) — the gradient-trained counterpart; this engine is exact / bounded and does not use learnable masks.
Design Goals
- Zero host marshalling in the scoring loop. Setup host-to-device and device-to-device uploads are
permitted but constant-size; the
(topology, L, R)sweep itself performs no host/device round trips. The productionilp_exact_score_topkpath reduces on device and the tracked device-to-host counter onCudaKernelProviderticks exactly once perinduce_exactcall for the compact selected-row export, independent of candidate count. - Per-topology semantic isolation. Each
(topology, L, R)triple is scored against the topology’s rule template in isolation. There is no interference across topologies — a direct fix for the Python prototype’s stale-mask contamination bug (see Semantics below). - Deterministic output. Integer counts only; each kernel block owns one unique output slot (no cross-block atomics); host-side reduction is a lex sort that is already locked bit-for-bit by 16 unit tests.
- Surgical crate boundary.
xlog-inducespeaksRelId+CudaBufferhandles; name resolution and DLPack unwrapping happen at the pyxlog boundary.
Topologies
The engine scores against four fixed 2-body rule templates. For a head relationH(X, Y) and candidate relations L (left body) and R (right
body), a query pair (qx, qy) is covered by a topology as follows:
| Topology | Rule | Coverage predicate |
|---|---|---|
chain | H(X,Y) :- L(X,Z), R(Z,Y) | ∃ z such that (qx, z) ∈ L ∧ (z, qy) ∈ R |
star | H(X,Y) :- L(X,Y), R(X,Y) | (qx, qy) ∈ L ∧ (qx, qy) ∈ R |
fanout | H(X,Y) :- L(X,Z), R(X,Y) | ∃ _ such that (qx, _) ∈ L ∧ (qx, qy) ∈ R |
fanin | H(X,Y) :- L(X,Y), R(Z,Y) | (qx, qy) ∈ L ∧ ∃ _ such that (_, qy) ∈ R |
evaluate() / set_rule_mask. This is what guarantees per-topology
isolation.
Architecture
Launch Geometry
- Grid dims:
(C, C, 4)blocks addressed as(L, R, topology), whereCis the number of candidate relations in the request and topology enumeration ischain=0, star=1, fanout=2, fanin=3. - Block dims:
(256, 1, 1)threads. Each block cooperatively scans all positive and negative query pairs for its single(topology, L, R)triple. - Output slot:
slot = topology * (C * C) + L * C + R. Each block writes exactly one slot in each ofpos_coveredandneg_covered, so there are no cross-block atomics on the scoring path. Shared-memory pair-halving reduction is used within a block to sum per-thread coverage counts.
chain and O(|L|+|R|) for the other
three topologies — microseconds per block in practice.
Data Layout
| Buffer | Type / length | Contents |
|---|---|---|
cand_arg0, cand_arg1 | u64 or u32 × total_rows | Concatenated device-to-device-copied arg0 / arg1 columns of all candidate relations |
cand_offsets | u32 × (C+1) | Exclusive prefix-sum of candidate row counts; host-to-device uploaded once per call |
pos_arg0, pos_arg1 | u64 or u32 × num_pos | Device-resident positive query pairs (DLPack-imported from the caller’s torch tensors) |
neg_arg0, neg_arg1 | u64 or u32 × num_neg | Same for negatives. When the caller passes no negatives, the engine materializes a zero-row pair buffer with the positive pair type so the kernel signature stays uniform. |
pos_covered, neg_covered | u32 × (4·C·C) | Output count arrays; kernel writes each slot exactly once. |
xlog_induce::induce_exactvalidates that positives, negatives, and every candidate buffer are arity-2 pairs with one uniform logical type:U64,U32, orSymbol.CudaKernelProvider::ilp_exact_scorerepeats that validation at the provider seam before choosing the physical kernel.U64buffers launchilp_exact_scoreoveruint64_tcolumns.U32buffers launchilp_exact_score_u32overuint32_tcolumns.Symbolbuffers also launchilp_exact_score_u32, but theCudaBufferschema remainsSymboland mixedU32/Symbolrequests fail with typed diagnostics rather than silently narrowing.
symbol
schemas only through schema-checked import. This preserves public logical
types while allowing CUDA-resident symbol-id tensors to reach the native scorer
without host widening.
Semantics
The Python prototype’sbackend="python" loop calls set_rule_mask(mask_name, ...) per inner iteration, updating only the current topology’s mask
while leaving the other three topologies’ masks at whatever state they
last held. Because evaluate() derives facts from all currently-set
masks, stale masks from earlier outer-loop iterations leak derivations
into later topologies’ coverage counts.
The native kernel bypasses evaluate() and set_rule_mask entirely:
it implements the four topology predicates directly on the candidate
row sets. Per-topology isolation is a structural property, not a
runtime hygiene requirement.
The Python prototype gained an opt-in strict_per_topology: bool = False
parameter that zeroes the three “other” topology masks before each
topology’s inner loop, yielding the same per-topology-isolated scoring
that the kernel produces by construction. Default is False for
backward compatibility with prototype callers that are calibrated
against the historical prototype numbers. The parity test sets
strict_per_topology=True explicitly to match the kernel.
Device-to-Host Transfer Budget
The kernel is designed around the xlog-native device-to-host transfer counter (CudaKernelProvider::d2h_transfer_count, exposed to Python as
prog.d2h_transfer_count()).
- Counted transfers per production
induce_exactcall: 1 (one compact selected-row export) regardless of candidate count, query count, or topology count. The parity testtest_induce_exact_native_does_not_scale_d2h_with_candidate_pairsenforceslarge.d2h_transfer_count ≤ small.d2h_transfer_count + 2, which passes trivially. - Setup host-to-device (
cand_offsets) and device-to-device (candidate column concatenation) are not counted as device-to-host transfers. - Setup reads that would otherwise require a device-to-host transfer (e.g. relation row
counts) go through the host-side
CudaBuffer::cached_row_count()cache. This is a load-bearing invariant — see the next section.
clone_buffer cached-row-count propagation
Every relation buffer in CompiledIlpProgram’s executor store was
deep-copied via CudaKernelProvider::clone_buffer on insertion
(CompiledIlpProgram::put_relation clones live_buffer before
executor.put_relation). The previous clone_buffer implementation
built the clone via CudaBuffer::from_columns, which does not
populate the host-side cached_row_count. Downstream consumers that
needed the row count were forced to either read num_rows_device() from device to host
or trust the (misleading) num_rows() capacity accessor.
clone_buffer now calls set_cached_row_count_if_unset(source.cached_row_count())
on the clone when the source has a populated cache, preserving the
host-visible count across clones. Pinned by
test_clone_buffer_preserves_cached_row_count.
Validation Contract
Buffer-level invariants enforced byxlog_induce::induce_exact:
| Check | Location | Failure mode |
|---|---|---|
candidates.is_empty() | Engine top of induce_exact | Returns default ExactInductionResult (all zeros) — matches Python reference’s if not body_indices: return … |
| Buffer arity == 2 | validate_pair_buffer | XlogError::Execution |
Column type is exactly one of U64, U32, Symbol | validate_pair_buffer | XlogError::Type |
| Positive, negative, and candidate pair types match exactly | require_pair_type / provider layout check | XlogError::Type / XlogError::Kernel |
cached_row_count() populated | cached_rows helper | XlogError::Execution — caller fed a buffer that skipped the DLPack ingest path |
positive_count == 0 | classify_request (pure) | Returns result with zero candidates but retained neg/candidate counts |
classify_request) is pure and
unit-tested without CUDA via 5 tests in xlog-induce/src/validate.rs.
Public Types
Generated Rule Provenance
Generated-rule provenance adds audit records for generated/mined rules incrates/xlog-induce/src/provenance.rs. The scorer still returns
ExactInductionResult; provenance is the companion layer for callers that
promote a candidate into a generated rule and need to retain why it was selected.
InductionSupportRow records retained positive support by relation, row index,
and caller-supplied row hash. InductionAlternative records rejected candidate
rules with support and falsification counts. InductionProvenanceRegistry
stores the generated-rule records in memory for the caller’s audit path.
Python Surface
ExactInductionResult dataclass with candidates: list[ScoredCandidate].
Testing
- Host-only unit tests (
cargo test -p xlog-induce --lib): 23 tests — 16 reduce-layer, 5 classify_request, 2 topology string/order. No CUDA required. - CUDA-gated launcher tests (
cargo test -p xlog-cuda --lib ilp_exact): hand-derived coverage against C=2 candidates, determinism across runs, empty-negatives handling,U32andSymbolpair-buffer acceptance, and mixed-pair-type rejection. Skipped when no CUDA device is present. - Python parity test (
python -m pytest python/tests/test_ilp_exact_induce.py): native-vs-Python reference and the host-transfer budget parity test. Usesstrict_per_topology=Trueagainst the Python reference so both backends compute clean per-topology coverage. - Typed parity tests:
U32andSymbolfixtures match the Python reference, preserve relation type annotations, keep device-to-host transfer accounting at exactly two count-array transfers, and reject mixed logical pair types.
Type Dispatch And Packaging Policy
U64: supported throughilp_exact_score.U32: supported throughilp_exact_score_u32.Symbol: supported through the same physicalu32kernel while retainingSymbolschema identity and rejecting mixedU32/Symbolrequests.- Chain topology shared-memory caching: supported through
ilp_exact_score_chain_smemandilp_exact_score_chain_smem_u32whenXLOG_ILP_EXACT_CHAIN_SMEMis enabled and the candidate-row threshold is met. Non-chain topologies inside those entry points continue to call the baseline matcher. - PTX policy:
kernels/ilp_exact.cuis checked in; generatedilp_exact.portable.ptxand architecture-specific.cubinfiles are build and packaging artifacts, not source artifacts.crates/xlog-cuda/build.rsgenerates portable PTX for every manifest module, includingilp_exact;scripts/stage_pyxlog_kernels.shstages those artifacts intopyxlog/kernels/;scripts/install_pyxlog_for_python.pyrejects a wheel that lacks portable PTX.
Profile-Gated Chain Shared Memory
The profile-gated chain shared-memory scorer adds an A/B-controlled chain scorer that tiles left-relation rows into dynamic shared memory for the strict chain topology. It is enabled by default only for candidate relations with at leastXLOG_ILP_EXACT_CHAIN_SMEM_MIN_ROWS rows, defaulting to 256, and can be
disabled with XLOG_ILP_EXACT_CHAIN_SMEM=0.
The optimization preserves the public exact-induction contract:
- chain-smem and baseline dispatch produce identical coverage signatures on the certified small and chain-hot fixtures;
- median runtime improves by more than
1.2xon the certified chain-hot fixture; - small fixtures do not regress by more than five percent;
- the device-to-host transfer budget remains the same two count-array transfers used by the baseline native exact-induction path.
See Also
- CUDA Certification — CUDA certification suite
with numbered certification categories;
ilp_exactis not yet in the formal certification registry because its PTX is not committed (see above). - Diagnostics and Provenance — generated-rule provenance and diagnostics surfaces for generated rules, source rules, proof traces, deltas, temporal metadata, and neural hot loops.
- Roadmap → “Bounded Exact Induction (
xlog-induce)” — current status and planned integration.