DeepTuner:
The FP8 engine.
State-of-the-art quantization techniques for large language models. Currently supporting FP8 with near-zero perplexity loss.
Native FP8 Precision
Our research-backed FP8 kernels enable fine-tuning with 50% less memory than BF16 while maintaining 99.9% accuracy signatures.
- Reduced VRAM requirements
- Higher throughput kernels
- Minimal perplexity shift
Sparse-Aware Fine-Tuning
Exploiting activation sparsity during backward passes to further accelerate training on H100 hardware.
- 2:4 Sparsity integration
- Structured mask optimization
How DeepTuner Achieves Near-Zero Perplexity Loss
Three precision-aware techniques work in concert to maintain convergence quality while halving memory requirements.
Dual-Format FP8
Forward pass uses E4M3 (4-bit exponent, 3-bit mantissa) for high accuracy. Backward pass uses E5M2 (5-bit exponent, 2-bit mantissa) for wider dynamic range during gradient flow.
Adaptive Loss Scaling
Auto-scaling mu policy monitors per-tensor saturation ratios every step. When saturation exceeds 0.001%, mu is halved immediately. When stable, mu grows back toward its maximum over a 1,000-step window.
Compressed Optimizer States
First-order momentum stored in FP8, second-order variance in FP16, master weights in FP16. Reduces optimizer memory from 16 bytes per parameter to under 7 bytes.
Three Levels of FP8 Optimization
FP8 Gradient Communication
FP8 all-reduce for DDP gradient synchronization. Compresses inter-GPU bandwidth by up to 2x with no change to the model or optimizer code.
FP8 Optimizer States
Includes O1 plus first-order momentum compression to FP8. Cuts peak optimizer memory by over 2x, enabling larger models or larger batch sizes on the same GPU.
Full FP8 Pipeline with ZeRO
Includes O2 plus ZeRO-aware FP8 weight partitioning for multi-GPU setups. Enables the full distributed FP8 training pipeline with minimal precision loss across all ranks.