ColossalAI/tests/test_zero/test_gemini
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869)
* add test

* fix no_sync bug in low level zero plugin

* fix test

* add argument for grad accum

* add grad accum in backward hook for gemini

* finish implementation, rewrite tests

* fix test

* skip stuck model in low level zero test

* update doc

* optimize communication & fix gradient checkpoint

* modify doc

* cleaning codes

* update cpu adam fp16 case
2023-10-17 14:07:21 +08:00
..
test_chunk_mgrv2.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
test_chunkv2.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
test_fwd_bwd.py [gemini] support amp o3 for gemini (#4872) 2023-10-12 10:39:08 +08:00
test_gemini_use_rmt.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
test_grad_accum.py [gemini] support gradient accumulation (#4869) 2023-10-17 14:07:21 +08:00
test_grad_clip.py [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) 2023-10-16 21:56:53 +08:00
test_inference.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
test_optim.py [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) 2023-10-16 21:56:53 +08:00
test_runtime_mem_tracer.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
test_search.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
test_zeroddp_state_dict.py [gemini] support amp o3 for gemini (#4872) 2023-10-12 10:39:08 +08:00
test_zerooptim_state_dict.py [hotfix] fix lr scheduler bug in torch 2.0 (#4864) 2023-10-12 14:04:24 +08:00