This document describes the dILP training subsystem: a GPU-accelerated differentiable Inductive Logic Programming engine that learns Datalog rules from positive/negative examples via gradient descent.

Design Goals

  1. Learn rules, not weights — discover symbolic Datalog clauses (e.g., reach(X,Y) :- edge(X,Z), edge(Z,Y).) from data
  2. GPU-resident hot loop — no semantic column downloads in the training step loop
  3. Sparse by default — candidate-indexed soft-probs instead of materializing N³ tensors
  4. Transactional promotion — learned rules pass gate checks before entering the knowledge base
  5. Auditable transfer evidence — learned rules carry fold, held-out-domain, gate, and base-kernel checksum metadata

Core Idea: Tensorized Super-Graph Masking

Traditional ILP systems compile candidate rules into executable programs — impossible at millisecond timescales. XLOG’s approach pre-compiles a “super-graph” of all candidate rules and activates them via continuous mask tensors optimized with Gumbel-Softmax:
Candidate rules          Logit tensor W (C floats)
  ┌──────────┐              ┌───────────┐
  │ r1: A←B,C│              │ w1  w2 .. │ ─── Gumbel-Softmax(τ) ──►  soft mask
  │ r2: A←B,D│              │           │                               │
  │ ...       │              └───────────┘                               ▼
  └──────────┘                                              set_rule_mask_sparse()


                                                            xlog evaluate (GPU)


                                                              BCE loss + ∇W
At convergence, argmax(W) picks the winning rule. Temperature annealing (τ → τ_floor) drives the soft mask toward a one-hot selection.

Architecture Overview

                           Python (pyxlog.ilp)
     ┌─────────────────────────────────────────────────────┐
     │  train_only()                                       │
     │    ├─ valid_candidates()  → candidate map           │
     │    ├─ MaskBackend.init_weights()                    │
     │    ├─ AdaptiveTempController                        │
     │    └─ step loop ──────────────────────────┐         │
     │         ├─ MaskBackend.apply_mask()       │         │
     │         ├─ program.evaluate_device()      │ GPU     │
     │         ├─ BCE loss (torch)               │ only    │
     │         ├─ loss.backward()                │         │
     │         └─ optimizer.step()               │         │
     │                                           │         │
     │  train_and_promote()                      │         │
     │    ├─ train_only()                        │         │
     │    ├─ trial compile (Rust)                │         │
     │    └─ promotion gates                     │         │
     └───────────────────────────────────────────┘         │
                        │                                   │
                        ▼                                   │
     ┌────────────────────────────────────┐                │
     │  Rust (xlog-runtime, xlog-cuda)   │                │
     │    ├─ set_rule_mask_sparse()       │◄───────────────┘
     │    ├─ IlpRegistry (mask storage)  │
     │    ├─ TensorMaskedJoin (executor) │
     │    ├─ batch_fact_membership()     │
     │    ├─ batch_fact_membership_device() │
     │    ├─ batch_tagged_credit()       │
     │    └─ batch_tagged_credit_device()│
     └────────────────────────────────────┘


     ┌────────────────────────────────────┐
     │  CUDA (kernels/ilp.cu)            │
     │    └─ extract_nonzero_indices()   │
     └────────────────────────────────────┘

Key Entry Points

Python (pyxlog.ilp)

FilePurpose
pyxlog/ilp/trainer.pytrain_only() — multi-start training loop
pyxlog/ilp/promoter.pytrain_and_promote() — training + gate pipeline
pyxlog/ilp/neurosymbolic.pytrain_neurosymbolic_program() — joint nn/4 and symbolic rule-weight training
pyxlog/ilp/inventory.pybuild_rule_inventory() — selected/rejected clause inventory with transfer metadata
pyxlog/ilp/backend.pyMaskBackend protocol, SparseMaskBackend, DenseMaskBackend
pyxlog/ilp/temperature.pyAdaptiveTempController — cosine-annealed τ schedule
pyxlog/ilp/entropy.pyEntropy regularization helpers
pyxlog/ilp/holdout.pyholdout_f1_and_variance() — LOO (<=20) and k-fold (>20) F1 scoring
pyxlog/ilp/types.pyTrainConfig, TrainResult, PromotionResult, LearnedArtifact, etc.
pyxlog/ilp/exceptions.pyIlpConfigError, IlpCandidateError, IlpTrainingError

Rust (xlog-runtime, xlog-cuda)

FilePurpose
crates/xlog-runtime/src/ilp_registry.rsIlpRegistry — mask storage, IlpTaggedResult metadata
crates/xlog-runtime/tests/ilp_integration_tests.rsRust-side integration tests for mask round-trips
crates/xlog-cuda/tests/ilp_kernel_tests.rsCUDA kernel unit tests (extract_nonzero_indices)

CUDA Kernels

FilePurpose
kernels/ilp.cuextract_nonzero_indices() — N³ mask → sparse index extraction

Mask Backends

The MaskBackend protocol abstracts how the learnable tensor W is applied to the XLOG executor:

SparseMaskBackend (default)

W (C logits) → Gumbel-Softmax(τ) → candidate_soft_probs (C,)

                    argsort/top-k on CUDA in Python/Torch

                    set_rule_mask_sparse_selected(selected_ids, selected_soft_probs)


                              Rust builds executor mask internally
                              (no N³ tensor materialized, no full soft-vector device-to-host transfer)
  • Learnable params: C floats (one per candidate rule)
  • Memory: O(C) — typically C < 100
  • Preferred hot-loop path calls set_rule_mask_sparse_selected() on the compiled program
  • Legacy compatibility path set_rule_mask_sparse() remains available when Rust-side ranking is desired

DenseMaskBackend (alpha-compatible, debug)

W (N×N×N logits) → Gumbel-Softmax(τ) → 3D soft mask

                           flatten → DLPack → set_rule_mask()


                                    Rust imports N³ tensor
  • Learnable params: N³ floats (N = schema size)
  • Memory: O(N³) — expensive for large schemas
  • Enabled via TrainConfig(debug_dense_mask=True) for parity testing

Training Pipeline

train_only()

  1. Candidate enumerationvalid_candidates(source, mask_name) returns all syntactically legal body-pair assignments
  2. Multi-start — up to max_attempts independent restarts with fresh logits
  3. Step loop (per attempt, up to step_budget_per_attempt):
    • Apply mask via backend
    • Forward pass: program.evaluate_device() (GPU-only, no host reads)
    • BCE loss between predicted and target fact membership
    • Backward pass: loss.backward() through PyTorch autograd
    • Optimizer step on W
    • Temperature anneal: τ_start → τ_floor (cosine schedule)
    • Optional deterministic controls (deterministic=True) for reproducible attempt seeding
    • Early stopping: when argmax is stable and loss < threshold
  4. Decodeargmax(W) maps to winning candidate → discovered rule string

train_and_promote()

  1. Call train_only() — get TrainResult
  2. If not converged → PromotionStatus.NOT_CONVERGED
  3. Trial compile — substitute discovered rule into source, compile via Rust
  4. Promotion gates (all must pass for PROMOTED):
    • Convergence gate — training converged (already checked)
    • Novel-rate gate — fraction of non-example derivations ≤ max_novel_rate
    • Protected-relation gate — no unwanted relation side-effects
    • Holdout F1 gate — F1 on held-out examples ≥ threshold
    • Ambiguity gate — top-M scan (or exhaustive mode) detects no alternative winning candidates
    • Typed-schema gate — optional hard gate requiring relation type metadata (or waiver-driven manual review)
  5. All pass → PromotionStatus.PROMOTED with committed_source

Higher-Level Neuro-Symbolic Training Surface

A higher-level training entry point handles sources that mix neural predicates and trainable symbolic clauses:
from pyxlog.ilp.neurosymbolic import NeuroSymbolicTrainingConfig, train_neurosymbolic_program

result = train_neurosymbolic_program(
    source,
    networks={"score": torch_module},
    examples=training_rows,
    config=NeuroSymbolicTrainingConfig(steps=16, learning_rate=0.05),
)
The source owns declarative nn(...), trainable_rule(...), and train(...) declarations. The result reports neural gradient norms, symbolic gradients, final symbolic weights, and a RuleInventory suitable for transfer audits.

Existential-join trainable bodies (Stage B)

A trainable_rule body may join a neural predicate to an ordinary relation on an existential (non-head) variable — the neural predicate is grounded over the real join domain inside the circuit and OR-aggregated at the head:
plastic(Edge) :- saliency(Event, strengthen), pre_before_post(Event, Edge).
Here Event appears only in the body. The engine materializes the join domain from pre_before_post’s ground facts, emits one neural leaf per joined event, and the differentiable provenance OR-aggregates the per-event contributions per head binding, yielding P(plastic(Edge)) = σ(w) · (1 − ∏_{e : pre_before_post(e,Edge)} (1 − p_saliency(e))). Gradient flows into the neural predicate (all joined events) and the rule guard, but never into the deterministic join relation. The per-event features arrive through a domain_inputs={"net": features} channel (row i = the i-th join-domain constant in sorted order), and examples carry only per-head-binding targets. Because saliency is learned as a function of the event feature (not an id lookup), the trained predicate generalizes to unseen events. Constraints: the join domain must be ground facts (a derived relation is rejected, since its extension is not materialized); head-binding ids must be 0..N-1 row-aligned with targets; a single join network is supported; and the exact d-DNNF compiler builds one circuit over all head-binding queries, so the planted graph must stay within the compiler’s fixed buffer (empirically ~6–7 events). A worked example lives in examples/plasticity_incircuit/ with a CUDA-gated recovery test in python/tests/test_plasticity_incircuit.py. Head-variable (“hard filter”) joins remain supported as pre-filters; only the existential-join case is new. train_and_promote(...) also accepts training_fold, held_out_domains, base_kernel_checksum_before, and base_kernel_checksum_after. These fields are recorded on PromotionResult.rule_inventory, along with selected and rejected candidate clauses and gate outcomes.

Artifact Persistence

LearnedArtifact captures the full training result for reproducibility:
artifact.save("artifact.json")   # JSON with SHA-256 candidate-map hash
loaded = LearnedArtifact.load("artifact.json", verify_hash=True)
Schema version: beta-v1. Fields: discovered rule, logits, candidate map, config, telemetry, precision/recall, metadata (timestamp, schema version, candidate map hash).

GPU Contract

The training step loop obeys XLOG’s GPU-resident contract:
  • evaluate_device() — no host reads for semantic results
  • batch_fact_membership_device() — returns a CUDA bool mask via DLPack with zero semantic-loop device-to-host transfer
  • batch_tagged_credit_device() — returns CSR-style CUDA credit data via DLPack with zero semantic-loop device-to-host transfer
  • batch_fact_membership() / batch_tagged_credit() remain available when host materialization is desired
  • AtomicU64 device-to-host counter on CudaKernelProvider — hard gate raises if download_column_* is observed during step loop
  • host_transfer_stats() / reset_host_transfer_stats() expose broader host transfer accounting for profiling
  • Legacy set_rule_mask_sparse() still performs a control-plane soft-probability download; the selected-candidate sparse path avoids it

Testing

  • 86+ static test functions across ILP Python test files (expanded by parametrized GA/beta gates)
  • Reliability gate: 20 consecutive train_only() runs must all converge (20/20 pass)
  • GA reliability gate: default 50-seed statistical run (test_ilp_ga_reliability.py)
  • GA performance/transfer tests: forward_p95_us + host transfer accounting (test_ilp_performance.py)
  • Dense/sparse parity: every sparse-path test has a debug_dense_mask=True variant
  • Rust-side: ilp_integration_tests.rs, ilp_kernel_tests.rs
  • CUDA certification: extract_nonzero_indices covered by kernel test suite

See Also