Stop guessing
run configurations.
DeepTuner uses intermediate code analysis to predict energy-efficient run configurations before any code runs. Up to 50% less energy and 2x throughput gains on multi-head attention kernels, no runtime profiling required.
Up to 50%
Less energy
Up to 2x
Throughput on MHA
Up to 70%
Search space saved
Static analysis. No runtime overhead.
Existing kernel tuners require exhaustive runtime profiling: O(2ⁿ) benchmark runs for n configuration parameters. For a production cluster running continuous training, that profiling tax is paid on every hardware migration and workload change.
DeepTuner analyzes intermediate GPU code before any execution, extracting memory access patterns, control flow, and instruction mix to predict the energy-efficient run configuration and GPU power cap. A one-time microbenchmark per GPU generation is all it needs.
Validated on NVIDIA RTX 5000 Ada (Hopper) and RTX 3070 (Ampere) across multi-head attention, convolution, and matrix multiplication kernels.
Intermediate code analysis
Analyzes intermediate GPU code to extract memory locality scores, register pressure, warp divergence, and instruction mix ratios, without launching a single kernel.
Architecture-agnostic calibration
Calibrated once per GPU generation. Optimal kernel configs are predicted without re-profiling when you migrate from Ampere to Hopper or Blackwell.
Joint shape and power-cap tuning
Jointly tunes run configuration and GPU power cap for minimum energy per token, without sacrificing throughput above 95% of peak.
Validated on HPC NVIDIA systems and consumer-grade GPUs.
Expanding beyond NVIDIA
DeepTuner currently runs on NVIDIA GPUs. Work is underway to bring the same intermediate code analysis approach to other hardware targets.
AMD ROCm
In researchPorting the intermediate code analysis pipeline to AMD's ROCm stack and CDNA architecture. Coming soon.
Google TPUs
In researchAdapting energy-aware run configuration search to XLA's compilation model for TPU v4 and v5 workloads. Coming soon.
* DeepTuner is architecture-agnostic in principle. Production support is currently NVIDIA-only.
Join the DeepTuner beta
We're onboarding HPC teams with active training or inference infrastructure. Tell us your hardware setup and we'll scope a pilot.
Get early access