@ -14,12 +14,6 @@ Apart from the widely adopted Adam and SGD, many modern optimizers require layer
## Optimizers
Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducting a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without lossing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.
We now demonstrate how to use Distributed Adafactor with booster API combining Tensor Parallel and ZeRO 2 with 4 GPUs. **Note that even if you're not aware of distributed optimizers, the plugins automatically casts yours to the distributed version for convenience.**
@ -140,3 +134,10 @@ optim = DistGaloreAwamW(
</table>
<!-- doc-test-command: colossalai run --nproc_per_node 4 distributed_optimizers.py -->