Expand description
Stream-ordered device memory runtime, RMM-inspired.
v0.6 architecture work. Replaces the per-CudaKernelProvider
GpuMemoryManager model (which cannot enforce a real per-device
budget across parallel tests, Python users, or multiple executors
on a single physical GPU) with a per-CUDA-ordinal singleton
XlogDeviceRuntime composed of swappable
DeviceMemoryResource adaptors:
XlogDeviceRuntime per CUDA ordinal
-> StreamPool of non-blocking streams
-> GlobalDeviceBudget per physical GPU
-> Logging / Debug adaptor (optional)
-> AsyncCudaResource (production) | DirectCudaResource (sanitizer/cert)Required resources:
DirectCudaResource— cudarc default (non-pooled) allocation backend (CudaDeviceInner::alloc::<u8>/ drop, which on async-alloc hosts forwards tocuMemAllocAsync). Candidate for the sanitizer/cert role because there is noxlog-level pool suballocation hiding out-of-bounds access from Compute Sanitizer; the sanitizer-visibility property itself is unproven until the manual Compute Sanitizer acceptance gate runs on a supported host. A genuine raw-drivercuMemAlloc/cuMemFreebackend is a separate future commit.AsyncCudaResource—cuMemAllocAsync/cuMemFreeAsyncbound to a caller-supplied stream via the stream pool; production default when supported.PoolResource— performance tier, not part of this PR; gated behind correctness certification of the direct/async backends.DebugGuardResource— optional canary/poison/quarantine layer.LoggingResource— CSV allocation log: thread, time, action, ptr, bytes, stream, device, tag, query id.
Stream-ordered contract: every alloc / dealloc names a stream; reuse across streams requires explicit event/sync. No reliance on the CUDA legacy null/default stream. Mirrors RMM’s stream-ordered rule — see https://github.com/rapidsai/RMM .
v0.5.5 closed at PRs #49 / #50 / #52 (metadata-read state for binary-join output counts). The fully GPU-resident binary-join materialization rebase is gated on this allocator landing first.
Re-exports§
pub use async_resource::AsyncCudaResource;pub use budget::GlobalDeviceBudget;pub use direct::DirectCudaResource;pub use logging::InMemorySink;pub use logging::LogAction;pub use logging::LogRecord;pub use logging::LogResult;pub use logging::LoggingResource;pub use logging::LoggingSink;pub use logging::NullSink;pub use logging::SinkError;pub use resource::Access;pub use resource::AllocTag;pub use resource::BlockId;pub use resource::BlockState;pub use resource::DeviceBlock;pub use resource::DeviceMemoryResource;pub use resource::Generation;pub use resource::ResourceError;pub use resource::ResourceResult;pub use resource::StreamId;pub use runtime::XlogDeviceRuntime;pub use runtime::MAX_DEVICE_ORDINALS;pub use stream_pool::StreamPool;pub use stream_pool::StreamPoolError;pub use stream_pool::DEFAULT_MAX_STREAMS;
Modules§
- async_
resource AsyncCudaResource— stream-ordered allocation backed by cudarc’sCudaStream::alloc(which forwards tocuMemAllocAsyncwhen the context supports it).- budget
GlobalDeviceBudget— per-runtime byte-limit decorator.- direct
DirectCudaResource— cudarc default (non-pooled) allocation backend.- logging
LoggingResource— telemetry decorator for anyDeviceMemoryResource.- resource
- Core
DeviceMemoryResourcetrait and supporting types. - runtime
XlogDeviceRuntime— per-CUDA-ordinal singleton hosting the device-runtime allocator stack.- stream_
pool StreamPool— owned non-blocking CUDA streams indexed byStreamId.