Commit Graph

6 Commits (0d1fa037ddd3c899e3c42fbb9c013b17c4dd03dc)

Author SHA1 Message Date
Wenwen Qu 582ee000bd
feat(moe):support zero for expert local dp (#404)
* support zero for expert local dp

* fix above codes:
    *treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py
    *add overlap and zero check for moe in args_sanity_check(.)
2023-10-09 17:45:26 +08:00
Wenwen Qu 375240e039
feat(moe): add local data parallel support for experts (#376)
* add local data parallel support for experts

* fix model checkpoint for local dp mode of expert

* do not set ep size from config
2023-09-28 13:38:02 +08:00
Sun Peng b7a8af8133
Feat/sync grad use async op (#277)
* fix/brocast should not in commu stream

* fix/brocast should not in commu stream

* feat: support allreduce grad using async op

* fix bug of async op

* use reduceop.avg

* use torch flat

* delete unused stream

* delete unused stream

* feat: overap allreduce with memcapy

---------

Co-authored-by: yingtongxiong <974106207@qq.com>
2023-09-07 21:51:30 +08:00
huangting4201 db13bc46bc
fix(ci): fix ci train error (#199) 2023-08-15 20:09:54 +08:00
Sun Peng ef851d16c6
Feat/optimizer (#194)
* feat(optimier.py): reduce memory footprint and avoid _check_overflow call

* feat(optimier.py): reduce memory footprint and avoid _check_overflow call

* feat(optimizer.py): overlap compute norm with allreduce

* update var and function name

* update function compute norm (#197)

Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu>

* feat(optimizer/hybrid_zero_optim.py): overlap gradients last bucket allreduce and compute norm (#196)

* support gradients allreduce and compute norm overlap

* fix para set error

* remove timer cal_norm for testing

* feat(optimizer/hybrid_zero_optim.py): support group global norm

* format(lint): fix lint error

* feat(optimizer/store.py): update code based on comment

---------

Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu>
Co-authored-by: huangting4201 <1538303371@qq.com>
2023-08-15 18:55:10 +08:00
Sun Peng fa7337b37b initial commit 2023-07-06 12:55:23 +08:00