* [feat] Add distributed lamb; minor fixes in DeviceMesh (#5476)
* init: add dist lamb; add debiasing for lamb
* dist lamb tester mostly done
* all tests passed
* add comments
* all tests passed. Removed debugging statements
* moved setup_distributed inside plugin. Added dist layout caching
* organize better
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
* [hotfix] Improve tester precision by removing ZeRO on vanilla lamb (#5576)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
* [optim] add distributed came (#5526)
* test CAME under LowLevelZeroOptimizer wrapper
* test CAME TP row and col pass
* test CAME zero pass
* came zero add master and worker param id convert
* came zero test pass
* came zero test pass
* test distributed came passed
* reform code, Modify some expressions and add comments
* minor fix of test came
* minor fix of dist_came and test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* minor fix of dist_came and test
* rebase dist-optim
* rebase dist-optim
* fix remaining comments
* add test dist came using booster api
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [optim] Distributed Adafactor (#5484)
* [feature] solve conflict; update optimizer readme;
* [feature] update optimize readme;
* [fix] fix testcase;
* [feature] Add transformer-bert to testcase;solve a bug related to indivisible shape (induction in use_zero and tp is row parallel);
* [feature] Add transformers_bert model zoo in testcase;
* [feature] add user documentation to docs/source/feature.
* [feature] add API Reference & Sample to optimizer Readme; add state check for bert exam;
* [feature] modify user documentation;
* [fix] fix readme format issue;
* [fix] add zero=0 in testcase; cached augment in dict;
* [fix] fix percision issue;
* [feature] add distributed rms;
* [feature] remove useless comment in testcase;
* [fix] Remove useless test; open zero test; remove fp16 test in bert exam;
* [feature] Extract distributed rms function;
* [feature] add booster + lowlevelzeroPlugin in test;
* [feature] add Start_with_booster_API case in md; add Supporting Information in md;
* [fix] Also remove state movement in base adafactor;
* [feature] extract factor function;
* [feature] add LowLevelZeroPlugin test;
* [fix] add tp=False and zero=True in logic;
* [fix] fix use zero logic;
* [feature] add row residue logic in column parallel factor;
* [feature] add check optim state func;
* [feature] Remove duplicate logic;
* [feature] update optim state check func and percision test bug;
* [fix] update/fix optim state; Still exist percision issue;
* [fix] Add use_zero check in _rms; Add plugin support info in Readme; Add Dist Adafactor init Info;
* [feature] removed print & comments in utils;
* [feature] uodate Readme;
* [feature] add LowLevelZeroPlugin test with Bert model zoo;
* [fix] fix logic in _rms;
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [fix] remove comments in testcase;
* [feature] add zh-Han Readme;
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; (#5676)
* [feature] daily update;
* [fix] fix dist came;
* [feature] refractor dist came; fix percision error; add low level zero test with bert model zoo;
* [fix] open rms; fix low level zero test; fix dist came test function name;
* [fix] remove redundant test;
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit (#5570)
* init: add dist lamb; add debiasing for lamb
* dist lamb tester mostly done
* all tests passed
* add comments
* all tests passed. Removed debugging statements
* moved setup_distributed inside plugin. Added dist layout caching
* organize better
* update comments
* add initial distributed galore
* add initial distributed galore
* add galore set param utils; change setup_distributed interface
* projected grad precision passed
* basic precision tests passed
* tests passed; located svd precision issue in fwd-bwd; banned these tests
* Plugin DP + TP tests passed
* move get_shard_dim to d_tensor
* add comments
* remove useless files
* remove useless files
* fix zero typo
* improve interface
* remove moe changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix import
* fix deepcopy
* update came & adafactor to main
* fix param map
* fix typo
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Hotfix] Remove one buggy test case from dist_adafactor for now (#5692)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: chongqichuizi875 <107315010+chongqichuizi875@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: duanjunwen <54985467+duanjunwen@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
* add zero optimizer
* torch ok
* unit test ok
* polish code
* fix bugs
* polish unit test
* polish zero optim
* polish colo ddp v2
* refactor folder structure
* add comment
* polish unit test
* polish zero optim
* polish unit test
* Added CPU Adam
* finished the cpu adam
* updated the license
* delete useless parameters, removed resnet
* modified the method off cpu adam unittest
* deleted some useless codes
* removed useless codes
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>