ColossalAI/colossalai
Frank Lee cf6d1c9284
[CLI] refactored the launch CLI and fixed bugs in multi-node launching (#844)
* [cli] fixed multi-node job launching

* [cli] fixed a bug in version comparison

* [cli] support launching with env var

* [cli] fixed multi-node job launching

* [cli] fixed a bug in version comparison

* [cli] support launching with env var

* added docstring

* [cli] added extra launch arguments

* [cli] added default launch rdzv args

* [cli] fixed version comparison

* [cli] added docstring examples and requierment

* polish docstring

* polish code

* polish code
2022-04-24 13:26:26 +08:00
..
amp [hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
builder modefied the pp build for ckpt adaptation (#803) 2022-04-24 12:23:16 +08:00
cli [CLI] refactored the launch CLI and fixed bugs in multi-node launching (#844) 2022-04-24 13:26:26 +08:00
communication [util] fixed communication API depth with PyTorch 1.9 (#721) 2022-04-12 09:44:40 +08:00
context [compatibility] used backward-compatible API for global process group (#758) 2022-04-14 17:20:35 +08:00
engine [refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
gemini [gemini] add GeminiMemoryManger (#832) 2022-04-24 13:08:48 +08:00
kernel Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
logging Refactored docstring to google style 2022-03-29 17:17:47 +08:00
nn [gemini] add GeminiMemoryManger (#832) 2022-04-24 13:08:48 +08:00
registry [dependency] removed torchvision (#833) 2022-04-22 15:24:35 +08:00
tensor [hotfix] the bug of numel() in ColoTensor (#845) 2022-04-24 12:32:10 +08:00
testing [test] added a decorator for address already in use error with backward compatibility (#760) 2022-04-14 16:48:44 +08:00
trainer [log] local throughput metrics (#811) 2022-04-20 10:05:39 +08:00
utils [pipelinable]use pipelinable context to initialize non-pipeline model (#816) 2022-04-24 13:03:12 +08:00
zero [gemini] add GeminiMemoryManger (#832) 2022-04-24 13:08:48 +08:00
__init__.py Develop/experiments (#59) 2021-12-09 15:08:29 +08:00
constants.py fix format constants.py (#358) 2022-03-11 15:50:28 +08:00
core.py [polish] polish singleton and global context (#500) 2022-03-23 18:03:39 +08:00
global_variables.py [MOE] add unitest for MOE experts layout, gradient handler and kernel (#469) 2022-03-21 13:35:04 +08:00
initialize.py modefied the pp build for ckpt adaptation (#803) 2022-04-24 12:23:16 +08:00