Module device_runtime

Expand description

Stream-ordered device memory runtime, RMM-inspired.

v0.6 architecture work. Replaces the per-CudaKernelProvider GpuMemoryManager model (which cannot enforce a real per-device budget across parallel tests, Python users, or multiple executors on a single physical GPU) with a per-CUDA-ordinal singleton XlogDeviceRuntime composed of swappable DeviceMemoryResource adaptors:

XlogDeviceRuntime per CUDA ordinal
  -> StreamPool of non-blocking streams
  -> GlobalDeviceBudget per physical GPU
  -> Logging / Debug adaptor (optional)
  -> AsyncCudaResource (production) | DirectCudaResource (sanitizer/cert)

Required resources:

DirectCudaResource — cudarc default (non-pooled) allocation backend (CudaDeviceInner::alloc::<u8> / drop, which on async-alloc hosts forwards to cuMemAllocAsync). Candidate for the sanitizer/cert role because there is no xlog-level pool suballocation hiding out-of-bounds access from Compute Sanitizer; the sanitizer-visibility property itself is unproven until the manual Compute Sanitizer acceptance gate runs on a supported host. A genuine raw-driver cuMemAlloc/cuMemFree backend is a separate future commit.
AsyncCudaResource — cuMemAllocAsync/cuMemFreeAsync bound to a caller-supplied stream via the stream pool; production default when supported.
PoolResource — performance tier, not part of this PR; gated behind correctness certification of the direct/async backends.
DebugGuardResource — optional canary/poison/quarantine layer.
LoggingResource — CSV allocation log: thread, time, action, ptr, bytes, stream, device, tag, query id.

Stream-ordered contract: every alloc / dealloc names a stream; reuse across streams requires explicit event/sync. No reliance on the CUDA legacy null/default stream. Mirrors RMM’s stream-ordered rule — see https://github.com/rapidsai/RMM .

v0.5.5 closed at PRs #49 / #50 / #52 (metadata-read state for binary-join output counts). The fully GPU-resident binary-join materialization rebase is gated on this allocator landing first.

Re-exports§

pub use async_resource::AsyncCudaResource;
pub use budget::GlobalDeviceBudget;
pub use direct::DirectCudaResource;
pub use logging::InMemorySink;
pub use logging::LogAction;
pub use logging::LogRecord;
pub use logging::LogResult;
pub use logging::LoggingResource;
pub use logging::LoggingSink;
pub use logging::NullSink;
pub use logging::SinkError;
pub use resource::Access;
pub use resource::AllocTag;
pub use resource::BlockId;
pub use resource::BlockState;
pub use resource::DeviceBlock;
pub use resource::DeviceMemoryResource;
pub use resource::Generation;
pub use resource::ResourceError;
pub use resource::ResourceResult;
pub use resource::StreamId;
pub use runtime::XlogDeviceRuntime;
pub use runtime::MAX_DEVICE_ORDINALS;
pub use stream_pool::StreamPool;
pub use stream_pool::StreamPoolError;
pub use stream_pool::DEFAULT_MAX_STREAMS;

Modules§

async_resource: AsyncCudaResource — stream-ordered allocation backed by cudarc’s CudaStream::alloc (which forwards to cuMemAllocAsync when the context supports it).
budget: GlobalDeviceBudget — per-runtime byte-limit decorator.
direct: DirectCudaResource — cudarc default (non-pooled) allocation backend.
logging: LoggingResource — telemetry decorator for any DeviceMemoryResource.
resource: Core DeviceMemoryResource trait and supporting types.
runtime: XlogDeviceRuntime — per-CUDA-ordinal singleton hosting the device-runtime allocator stack.
stream_pool: StreamPool — owned non-blocking CUDA streams indexed by StreamId.