Running Benchmarks
Prerequisites
- CUDA-capable NVIDIA GPU (compute capability 7.0+; development device: RTX PRO 3000 Blackwell, SM120)
- CUDA Toolkit 13.x
- Decision-DNNF knowledge compiler (for exact inference benchmarks that use external compilation)
- Sufficient GPU memory (4GB minimum, 12GB recommended for neural-symbolic training)
Quick Start
Environment Variables
| Variable | Description | Default |
|---|---|---|
CUDA_VISIBLE_DEVICES | GPU device ordinal | 0 |
XLOG_BENCH_MEMORY_MB | GPU memory budget | 4096 |
WCOJ_BENCH_FULL | Run the full WCOJ triangle matrix (adds 100K + 250K row sizes) | 0 |
XLOG_USE_WCOJ_TRIANGLE_U32 | Force-on the WCOJ triangle dispatch (bypasses adaptive classifier) | unset |
XLOG_USE_WCOJ_TRIANGLE_ADAPTIVE | Adaptive (skew-classifier-gated) WCOJ dispatch — set to 0/false to opt out of default-on | unset (default-on) |
XLOG_DISABLE_WCOJ_TRIANGLE | Hard kill switch — pins all WCOJ triangle dispatch off, beats every other flag | unset |
Benchmark Categories
GPU Logic Benchmarks (xlog-gpu)
Location: crates/xlog-gpu/benches/logic_bench.rs
Transitive Closure
Tests recursive query evaluation (semi-naive fixpoint iteration).| Benchmark | Description | Metric |
|---|---|---|
tc_chain | Chain graph 0→1→2→…→n | Iteration depth |
tc_random | Random sparse graph | Rows/sec |
tc_dense | Complete bipartite K_{n,n} | Output explosion |
tc_chain: depth 100, 500, 1000, 2000tc_random: 10K, 100K, 1M edgestc_dense:K_{100,100},K_{200,200},K_{500,500}
Hash Join Throughput
Tests GPU hash join kernel performance.| Benchmark | Description | Metric |
|---|---|---|
join_throughput | Varying cardinality | Rows/sec |
join_selectivity | Varying key range | Output rows/input |
multiway_join | 3-way join | Intermediate explosion |
- Cardinalities: 10Kx10K to 1Mx100K
- Key ranges: 100 (high selectivity) to 100K (low selectivity)
- Multi-way: 10K, 50K, 100K rows per relation
Aggregation
Tests GROUP BY with COUNT aggregate.| Benchmark | Description | Metric |
|---|---|---|
aggregation | COUNT by group | Groups/sec |
- 100K rows with 1K groups
- 100K rows with 10K groups
- 1M rows with 10K groups
- 1M rows with 100K groups
WCOJ Triangle (default-on adaptive, xlog-integration)
Location: crates/xlog-integration/benches/wcoj_triangle_bench.rs
Compares the GPU 3-way Worst-Case Optimal Join dispatch against the existing binary-join chain on identical fixtures, across u32, u64, and a Symbol sanity case. Three modes per cell — Off (binary), Force (WCOJ pipeline always), Adaptive (default-on: classifier runs and dispatches WCOJ on high-skew triangles only). The bench overrides each mode explicitly via RuntimeConfig::with_wcoj_triangle_dispatch[_adaptive] to keep the measured path process-global-free. Production callers can pin behavior via the env vars in the table at the top of this file.
Run:
| Bench Group | Fixture | Targets |
|---|---|---|
wcoj_triangle/uniform | Uniform Erdős-Rényi (key range = rows/10) | Average-case baseline |
wcoj_triangle/superhub | Deterministic super-hub (~50% of edges concentrated on one Y / one X) | Histogram-targetable per-thread workload imbalance |
wcoj_triangle/empty | Three relations over disjoint key ranges | Count→scan→empty fast path |
wcoj_triangle/symbol_sanity | One uniform 10K case for Symbol | Symbol shares u32’s physical layout — sanity only |
- Timed region =
Executor::execute_planonly. Driven viab.iter_custom(...)so the per-iteration loop is owned by the harness. Each cell builds ONE long-livedExecutor;put_relationuploads +store.remove("tri")cleanup live OUTSIDE the timed region. The long-lived Executor is required so the executor’s cachedwcoj_triangle_stream(OnceLock<StreamId>) is acquired exactly once per cell and reused — a fresh Executor per iteration would drain the runtime’sStreamPool(cap 16, grow-only) past iteration 16. - Each
(width, fixture, size)cell pre-runs an untimed correctness check:gate=Some(false)(binary-join) andgate=Some(true)(WCOJ) must produce identical row sets (host-side dedup of fixtures aligns the two paths to set semantics). Counter delta is also asserted insideiter_custom: gate=true must increment byitersover the loop, gate=false must increment by 0 — a silent fallback anywhere in the hot loop fails the bench. - Bench-only: the
StreamPoolcap is bumped to 1024 inmake_provider(production default 16). The bench has many short-lived correctness-check executors that each acquire one stream; production runs at 16 because each long-lived process has one provider with one cached stream. - Baseline numbers, adaptive default-on acceptance, phase-timing evidence, and the post-layout-fast-path results are indexed in the WCOJ bench baseline evidence bundle. Default-on adaptive WCOJ for eligible non-recursive triangle rules ships with
XLOG_DISABLE_WCOJ_TRIANGLE=1as the hard kill switch; the WCOJ subsystem now covers triangles, cost-aware planning, recursive/SCC integration, and K-clique coverage — see the WCOJ architecture guide.
Probabilistic Benchmarks (xlog-prob)
Location: crates/xlog-prob/benches/prob_bench.rs
Exact Inference (Decision-DNNF)
Tests knowledge compilation and weighted model counting.| Benchmark | Description | Metric |
|---|---|---|
exact_path | Probabilistic path | Circuits/sec |
exact_grid | Probabilistic grid | Cells/sec |
exact_bayesian | Bayesian network | Variables/circuit |
exact_gradients | With gradient computation | Grads/sec |
- Path lengths: 5, 10, 15, 20, 25 nodes
- Grid sizes: 3x3, 4x4, 5x5, 6x6
- Bayesian: 10, 20, 30, 50 variables
Monte Carlo Inference
Tests GPU-accelerated random sampling.| Benchmark | Description | Metric |
|---|---|---|
mc_samples | Sample count scaling | Samples/sec |
mc_vars | Variable count scaling | Worlds/sec |
mc_path | Probabilistic path | (samples × vars)/sec |
mc_grid | Probabilistic grid | (samples × cells)/sec |
mc_bayesian | Bayesian network | (samples × vars)/sec |
- Sample counts: 1K, 5K, 10K, 50K, 100K
- AD counts: 10, 50, 100, 500, 1000
- Path lengths: 10, 25, 50, 100, 200
- Grid sizes: 5x5, 10x10, 15x15, 20x20
- Bayesian: 50, 100, 200, 500 variables
Statistics Manager Benchmarks (xlog-stats)
Location: crates/xlog-stats/benches/stats_bench.rs
Tests relation registration, cardinality tracking, join estimation.
Solver Benchmarks (xlog-solve)
Location: crates/xlog-solve/benches/solver_bench.rs
Tests SAT solving, gradient computation, state management.
Methodology
Measurement Approach
XLOG benchmarks use Criterion.rs for statistically rigorous performance measurement.| Setting | Value | Rationale |
|---|---|---|
| Sample size | 10-100 | GPU warmup + noise reduction |
| Warm-up | 3 iterations | JIT compilation, caching |
| Significance level | 0.1 | Detect 10% regressions |
| Noise threshold | 0.05 | Ignore <5% variance |
Throughput Calculation
Warm-up Protocol
GPU benchmarks include warm-up to ensure:- PTX modules are compiled and cached
- Memory pools are initialized
- CUDA context is established
Reproducibility
All random data generation uses deterministic seeding:- LCG with fixed seed (no system entropy)
- Same seed produces identical graphs
Baseline Metrics
Development hardware: NVIDIA RTX PRO 3000 Blackwell Generation Laptop GPU (12 GB, SM120, compute capability 12.0, driver 591.59).Status of the tables below (audited 2026-06-10): the Transitive Closure, Hash Join, Exact Inference, and Monte Carlo tables are aspirational targets, not measured results. No published in-repo run backs them — the Criterion harnesses exist (Throughput on desktop-class GPUs (e.g. RTX 4090, RTX 5090) will differ due to higher memory bandwidth and SM count.crates/xlog-gpu/benches/,crates/xlog-prob/benches/) but their output is git-ignored and no baseline has been committed. Do not cite these numbers as evidence. Measured, source-backed results in this repo are: the WCOJ super-hub speedups (10.5×–33.8×,docs/evidence/2026-05-01-wcoj-bench-baseline/) and the neural-symbolic cache ablation below (2.74×, CI-backed, measured 2026-02-18).
Transitive Closure (targets — unmeasured)
| Configuration | Target | Notes |
|---|---|---|
| 100K random edges | >1M rows/sec | Sparse graph |
| 1M random edges | >5M rows/sec | Medium graph |
K_{500,500} bipartite | >10M rows/sec | Dense output |
Hash Join (targets — unmeasured)
| Configuration | Target | Notes |
|---|---|---|
| 100K × 100K | >50M rows/sec | Medium cardinality |
| 1M × 100K | >100M rows/sec | Large left relation |
| High selectivity | >20M rows/sec | Many output rows |
Exact Inference (targets — unmeasured)
| Configuration | Target | Notes |
|---|---|---|
| 20-variable path | <100ms | Small circuit |
| 50-variable Bayesian | <500ms | Medium complexity |
| With gradients | <2× base | Backward pass overhead |
Monte Carlo (targets — unmeasured)
| Configuration | Target | Notes |
|---|---|---|
| 100K samples, 100 vars | >10M worlds/sec | Throughput mode |
| 10K samples, 500 vars | >5M worlds/sec | Complexity mode |
Neural-Symbolic Training
Measured on development hardware with01_minimal (MNIST addition, 512 images, 5 epochs, batch_size=64).
| Metric | Value | Notes |
|---|---|---|
PTX JIT (cold) | 0.02 s | Cubin loading (1750x speedup from ~35s) |
first_epoch_sec | ~75 s | Cold-start (Decision-DNNF compile + verify), warm-starts drop to 0.26s |
steady_epoch_sec_mean | ~0.25 s | Epochs 2-5 after warmup (Batched evaluation) |
per_query_ms | ~1.0 ms | Per-query forward+backward through circuit |
| Cache speedup | 2.74x | Circuit caching vs no caching (95% CI: [2.29, 3.18]) |
examples/neural/results/evidence/cache_ablation_20260218.json
Interpreting Results
Criterion Output
| Field | Meaning |
|---|---|
time | [lower bound, estimate, upper bound] at 95% CI |
thrpt | Throughput in million elements per second |
change | Comparison vs baseline |
p | Statistical significance |
Performance Regression Detection
A benchmark is flagged as a regression if:changelower bound > +5%p < 0.10
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| High variance | GPU thermal throttling | Cool-down period |
| First run slow | JIT compilation | Ignore first sample |
| OOM errors | Large input | Reduce memory budget |
| Missing benchmarks | No CUDA device | Check GPU availability |
CI Integration
GitHub Actions Workflow
Regression Alerts
CI fails if any benchmark shows:-
10% regression vs main branch
- Statistical significance
p < 0.05
Benchmark History
Historical results are stored in:target/criterion/(local)- GitHub Actions artifacts (CI)
Contributing Benchmarks
Adding a New Benchmark
- Create benchmark function:
- Add to criterion group:
- Add to Cargo.toml:
Benchmark Guidelines
| Guideline | Rationale |
|---|---|
Use black_box() | Prevent dead code elimination |
| Handle GPU errors gracefully | CI may lack GPU |
| Use deterministic data | Reproducibility |
| Document expected performance | Regression detection |
| Keep sample size reasonable | CI time budget |
Review Checklist
- Benchmark measures meaningful operation
- Throughput metric is appropriate
- Parameters cover realistic range
- Handles missing GPU gracefully
- Documentation updated
See Also
- Architecture — System design
- Roadmap — Development plans