Optimemory:
Hardware-aware memory virtualization.
Unlock 2.5x more efficient model weights with research-backed virtual memory stitching. A specialized VMM layer that optimizes memory allocation across fragmented hardware.
2.5x
Model scale increase
-65%
Memory allocation overhead reduced
+42%
Training stability uplift
Built for Every AI Workload
Optimemory is not tied to any model architecture. If it runs on PyTorch and CUDA, it benefits from VMM-backed memory pooling.
Large Language Models
GPT, LLaMA, Mistral, Megatron. Pre-allocate batch buffers once and reuse them across tens of thousands of training steps.
Vision Transformers
ViT, CLIP, DINO, SigLIP. Image patch buffers stay resident in a fixed VMM pool through the full training run.
Diffusion Models
Stable Diffusion, FLUX, DiT. Noisy latent buffers across denoising timesteps share the same physical VRAM pool.
Inference Servers
ResNet, BERT, EfficientNet. Pre-allocate I/O buffers at startup and serve every request from reusable VMM slots with zero allocation on the hot path.
Hardware-Aware Memory for Any Architecture
Every deep learning workload is memory-bound. Optimemory decouples physical hardware limitations from model capacity by operating below the framework level, invisible to the model and the optimizer.
VMM Stitching Layer
Research-backed virtual memory stitching that presents fragmented physical VRAM as a contiguous address space.
Physical Memory Pooling
Pool and reuse freed physical VRAM chunks across training steps, eliminating repeated allocation overhead without stalling compute kernels.
Hardware-Aware Fragmenting
Queries the CUDA driver for hardware-specific allocation granularity and aligns chunk sizes accordingly. Works on any NVIDIA GPU with Compute Capability 6.0 or higher (Pascal through Hopper).
Upcoming in v2.4
Native FP8 quantization and weight optimization kernels.
Multi-GPU & NVlink Support
We are currently researching cross-GPU virtual address space stitching via high-speed NVlink interconnects.
Malloc Master: Active
42.2 GB
99.1%
Drop-in integration
Replace standard tensor allocation with a single call. No changes to your training loop required.
Install via pip: deep-variance
Pre-allocate a reusable GPU buffer once with vmm_empty_nd, backed by physical CUDA memory pooling
Copy into the buffer each step with zero allocation overhead. Inspect pool health anytime via cache_stats()
from deep_variance import vmm_empty_nd, cache_stats import torch # Pre-allocate a reusable GPU buffer once img_buf = vmm_empty_nd( (batch_size, 3, 224, 224), dtype=torch.float32 ) # Reuse across every training step, zero overhead for imgs, labels in dataloader: img_buf.copy_(imgs.cuda(non_blocking=True)) print(cache_stats())
Works Everywhere PyTorch Runs
Designed to fit your infrastructure, not the other way around.
HPC and SLURM
Built-in module-load support. Validated on Perlmutter and Summit with deep-variance-check environment diagnostics.
Multi-GPU Training
Process-local by design. Each DDP rank manages its own VMM pool independently on its assigned device, mapping cleanly to PyTorch's multi-process data-parallel patterns.
Mixed Precision
VMM tensors participate fully in FP16 and BF16 autocast regions. Autograd and nn.Module work without modification.
Pre-Compiled Wheel
Ships as a pre-compiled Python wheel for CUDA 12.x and Linux x86_64. No compiler, no build tools. One pip install and you are running.
Run massive models on the hardware you already have.
Drop Optimemory into your training loop and reclaim VRAM you're already paying for.