History

Edenzzzz 43995ee436 [Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 ) * [feat] Add distributed lamb; minor fixes in DeviceMesh (#5476) * init: add dist lamb; add debiasing for lamb * dist lamb tester mostly done * all tests passed * add comments * all tests passed. Removed debugging statements * moved setup_distributed inside plugin. Added dist layout caching * organize better --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [hotfix] Improve tester precision by removing ZeRO on vanilla lamb (#5576) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [optim] add distributed came (#5526) * test CAME under LowLevelZeroOptimizer wrapper * test CAME TP row and col pass * test CAME zero pass * came zero add master and worker param id convert * came zero test pass * came zero test pass * test distributed came passed * reform code, Modify some expressions and add comments * minor fix of test came * minor fix of dist_came and test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix of dist_came and test * rebase dist-optim * rebase dist-optim * fix remaining comments * add test dist came using booster api --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [optim] Distributed Adafactor (#5484) * [feature] solve conflict; update optimizer readme; * [feature] update optimize readme; * [fix] fix testcase; * [feature] Add transformer-bert to testcase;solve a bug related to indivisible shape (induction in use_zero and tp is row parallel); * [feature] Add transformers_bert model zoo in testcase; * [feature] add user documentation to docs/source/feature. * [feature] add API Reference & Sample to optimizer Readme; add state check for bert exam; * [feature] modify user documentation; * [fix] fix readme format issue; * [fix] add zero=0 in testcase; cached augment in dict; * [fix] fix percision issue; * [feature] add distributed rms; * [feature] remove useless comment in testcase; * [fix] Remove useless test; open zero test; remove fp16 test in bert exam; * [feature] Extract distributed rms function; * [feature] add booster + lowlevelzeroPlugin in test; * [feature] add Start_with_booster_API case in md; add Supporting Information in md; * [fix] Also remove state movement in base adafactor; * [feature] extract factor function; * [feature] add LowLevelZeroPlugin test; * [fix] add tp=False and zero=True in logic; * [fix] fix use zero logic; * [feature] add row residue logic in column parallel factor; * [feature] add check optim state func; * [feature] Remove duplicate logic; * [feature] update optim state check func and percision test bug; * [fix] update/fix optim state; Still exist percision issue; * [fix] Add use_zero check in _rms; Add plugin support info in Readme; Add Dist Adafactor init Info; * [feature] removed print & comments in utils; * [feature] uodate Readme; * [feature] add LowLevelZeroPlugin test with Bert model zoo; * [fix] fix logic in _rms; * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [fix] remove comments in testcase; * [feature] add zh-Han Readme; --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; (#5676) * [feature] daily update; * [fix] fix dist came; * [feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; * [fix] open rms; fix low level zero test; fix dist came test function name; * [fix] remove redundant test; * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit (#5570) * init: add dist lamb; add debiasing for lamb * dist lamb tester mostly done * all tests passed * add comments * all tests passed. Removed debugging statements * moved setup_distributed inside plugin. Added dist layout caching * organize better * update comments * add initial distributed galore * add initial distributed galore * add galore set param utils; change setup_distributed interface * projected grad precision passed * basic precision tests passed * tests passed; located svd precision issue in fwd-bwd; banned these tests * Plugin DP + TP tests passed * move get_shard_dim to d_tensor * add comments * remove useless files * remove useless files * fix zero typo * improve interface * remove moe changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix import * fix deepcopy * update came & adafactor to main * fix param map * fix typo --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Hotfix] Remove one buggy test case from dist_adafactor for now (#5692) Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: chongqichuizi875 <107315010+chongqichuizi875@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: duanjunwen <54985467+duanjunwen@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com>		2024-05-14 13:52:45 +08:00
..
README.md	[hotfix] fix typo s/keywrods/keywords etc. (#5429 )	2024-03-12 11:25:16 +08:00
__init__.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
adafactor.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
came.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
cpu_adam.py	[devops] fix extention building (#5427 )	2024-03-05 15:35:54 +08:00
distributed_adafactor.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
distributed_came.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
distributed_galore.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
distributed_lamb.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
fused_adam.py	[misc] refactor launch API and tensor constructor (#5666 )	2024-04-29 10:40:11 +08:00
fused_lamb.py	[feat] refactored extension module (#5298 )	2024-01-25 17:01:48 +08:00
fused_sgd.py	[feat] refactored extension module (#5298 )	2024-01-25 17:01:48 +08:00
galore.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
hybrid_adam.py	[misc] refactor launch API and tensor constructor (#5666 )	2024-04-29 10:40:11 +08:00
lamb.py	[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694 )	2024-05-14 13:52:45 +08:00
lars.py	[misc] update pre-commit and run all files (#4752 )	2023-09-19 14:20:26 +08:00
nvme_optimizer.py	[misc] update pre-commit and run all files (#4752 )	2023-09-19 14:20:26 +08:00

README.md

Colossal-AI Optimization Techniques

Introduction

Welcome to the large-scale deep learning optimization techniques of Colossal-AI, which has been accepted as official tutorials by top conference NeurIPS, SC, AAAI, PPoPP, CVPR, ISC, NVIDIA GTC ,etc.

Colossal-AI, a unified deep learning system for the big model era, integrates many advanced technologies such as multi-dimensional tensor parallelism, sequence parallelism, heterogeneous memory management, large-scale optimization, adaptive task scheduling, etc. By using Colossal-AI, we could help users to efficiently and quickly deploy large AI model training and inference, reducing large AI model training budgets and scaling down the labor cost of learning and deployment.

🚀 Quick Links

Colossal-AI | Paper | Documentation | Forum | Slack

Table of Content

Large transformer models display promising performance on a wide spectrum of AI applications. Both academia and industry are scaling DL training on larger clusters. However, degrading generalization performance, non-negligible communication overhead, and increasing model size prevent DL researchers and engineers from exploring large-scale AI models.

We aim to provide a clear sketch of the optimizations for large-scale deep learning with regard to model accuracy and model efficiency. One way to achieve the goal of maintaining or improving the model accuracy in the large-scale setting while maintaining compute efficiency is to design algorithms that are less communication and memory hungry. Notably, they are not mutually exclusive but can be optimized jointly to further speed up training.

Model Accuracy
- Gradient Descent Optimization
  - Gradient Descent Variants
  - Momentum
  - Adaptive Gradient
- Large Batch Training Optimization
  - LARS
  - LAMB
  - Generalization Gap
- Second-Order Optimization
  - Hessian-Free
  - K-FAC
  - Shampoo
Model Accuracy
- Communication Efficiency
  - Reduce Volume of Comm.
  - Reduce Frequency of Comm.
- Memory Efficiency
  - Mix-Precision Training
  - Memory-Efficient Methods, e.g. ZeRO, Gemini, etc.

Some of the above are still under development. If you wish to make a contribution to this repository, please read the Contributing section below.

Discussion

Discussion about the Colossal-AI project is always welcomed! We would love to exchange ideas with the community to better help this project grow. If you think there is a need to discuss anything, you may jump to our Slack.

If you encounter any problem while running these optimizers, you may want to raise an issue in this repository.

Contributing

This project welcomes constructive ideas and implementations from the community.

Update an Optimizer

If you find that an optimizer is broken (not working) or not user-friendly, you may put up a pull request to this repository and update this optimizer.

Add a New Optimizer

If you wish to add an optimizer for a specific application, please follow the steps below.

create the new optimizer file in the current folder
Prepare the corresponding example files in the Examples repository to prove effectiveness of the new optimizer
Prepare a detailed readme on environment setup, dataset preparation, code execution, etc. in your example folder
Update the table of content (last section above) in this readme file

If your PR is accepted, we may invite you to put up a tutorial or blog in ColossalAI Documentation.