digger yu
09fe9dc704
[nfc]fix ColossalaiOptimizer is not defined ( #4122 )
1 year ago
digger yu
7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. ( #3808 )
2 years ago
Hakjin Lee
46c009dba4
[format] Run lint on colossalai.engine ( #3367 )
2 years ago
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2 years ago
Ofey Chan
8706a8c66c
[NFC] polish colossalai/engine/gradient_handler/__init__.py code style ( #3329 )
2 years ago
Sze-qq
94eec1c5ad
[NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style ( #3277 )
...
Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>
2 years ago
Zirui Zhu
1168b50e33
[NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style ( #3275 )
2 years ago
LuGY
1ff7d5bfa5
[NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py ( #3260 )
2 years ago
ver217
823f3b9cf4
[doc] add deepspeed citation and copyright ( #2996 )
...
* [doc] add deepspeed citation and copyright
* [doc] add deepspeed citation and copyright
* [doc] add deepspeed citation and copyright
2 years ago
Michelle
c008d4ad0c
[NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style ( #2744 )
2 years ago
CZYCW
4ac8bfb072
[NFC] polish colossalai/engine/gradient_handler/utils.py code style ( #2708 )
2 years ago
Kirigaya Kazuto
e9460b45c8
[engin/schedule] use p2p_v2 to recontruct pipeline_schedule ( #1408 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [communication] add p2p_v2.py to support communication with List[Any]
* Delete _pipeline_schedule_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* Delete p2p_v2.py
* Delete test_boardcast_send_recv_v2.py
* Delete test_object_list_p2p_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [communication] remove print code
* [communication] remove print code
* [engin/schedule] shorten the running time of testing file to prevent cancelling in CI
2 years ago
ver217
7c70bfbefa
[hotfix] fix PipelineSharedModuleGradientHandler ( #1314 )
2 years ago
Jiarui Fang
4165eabb1e
[hotfix] remove potiential circle import ( #1307 )
...
* make it faster
* [hotfix] remove circle import
2 years ago
Kai Wang (Victor Kai)
50f2ad213f
[NFC] polish colossalai/engine/ophooks/utils.py code style ( #1256 )
2 years ago
YuliangLiu0306
17ed33350b
[hotfix] fix an assertion bug in base schedule. ( #1250 )
2 years ago
YuliangLiu0306
f1f51990b9
[hotfix]fix some bugs caused by refactored schedule. ( #1148 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by refactored schedule.
2 years ago
YuliangLiu0306
18091581c0
[pipeline]support more flexible pipeline ( #1138 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]support more flexible pipeline
2 years ago
YuliangLiu0306
946dbd629d
[hotfix]fix bugs caused by refactored pipeline ( #1133 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix bugs caused by refactored pipeline
2 years ago
YuliangLiu0306
3175bcb4d8
[pipeline]support List of Dict data ( #1125 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]support List of Dict data
* polish
2 years ago
Frank Lee
6f82ac9bcb
[pipeline] supported more flexible dataflow control for pipeline parallel training ( #1108 )
...
* [pipeline] supported more flexible dataflow control for pipeline parallel training
* polish code
* polish code
* polish code
3 years ago
YuliangLiu0306
1e9f9c227f
[hotfix]change to fit latest p2p ( #1100 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]change to fit latest p2p
* polish
* polish
3 years ago
Frank Lee
7f2d2b2b5b
[engine] fixed empty op hook check ( #1096 )
...
* [engine] fixed empty op hook check
* polish code
3 years ago
YuliangLiu0306
b167258b6a
[pipeline]refactor ppschedule to support tensor list ( #1050 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* refactor ppschedule to support tensor list
* polish
3 years ago
Frank Lee
e4685832f8
[engine] fixed bug in gradient accumulation dataloader to keep the last step ( #1030 )
3 years ago
YuliangLiu0306
32a45cd7ef
[pipelinable]use pipelinable to support GPT model. ( #903 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipelinable]use pipelinable to support GPT model.
* fix a bug caused by ShardedModel
* polish
* fix front func list
3 years ago
Frank Lee
11f54c7b6b
[doc] improved docstring and assertion messages for the engine module ( #871 )
3 years ago
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
3 years ago
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
3 years ago
HELSON
84c6700b2a
[zero] refactor memstats_collector ( #746 )
3 years ago
Jiarui Fang
4d90a7b513
[refactor] zero directory ( #724 )
3 years ago
Jiarui Fang
193dc8dacb
[refactor] refactor the memory utils ( #715 )
3 years ago
HELSON
ee112fe1da
[zero] adapt zero hooks for unsharded module ( #699 )
3 years ago
ver217
3c9cd5bb5e
[zero] stateful tensor manager ( #687 )
...
* [WIP] stateful tensor manager
* add eviction strategy
* polish code
* polish code
* polish comment
* add unit test
* fix sampler bug
* polish code
* fix max sampling cnt resetting bug
* fix sampler bug
* polish code
* fix bug
* fix unit test
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
3 years ago
YuliangLiu0306
0ed7042f42
[pipeline] refactor pipeline ( #679 )
...
* refactor pipeline---put runtime schedule into engine.
* add type hint for schedule Optional[BaseSchedule]
* preprocess schedule during engine initializing
* infer pipeline schedule params from config
3 years ago
RichardoLuo
ad1e7ab2b2
'[NFC] polish <colossalai/engine/_base_engine.py> code style' ( #631 )
...
Co-authored-by: RichardoLuo <14049555596@qq.com>
3 years ago
doubleHU
f2da21a827
fix format ( #586 )
3 years ago
fanjinfucool
ffad81e1d1
fix format ( #585 )
...
Co-authored-by: fanjifu <FAN>
3 years ago
Maruyama_Aya
d2dc6049b5
fix format ( #580 )
3 years ago
yuxuan-lou
cfb41297ff
'fix/format' ( #573 )
3 years ago
YuliangLiu0306
ade05a5d83
[refactor] pipeline, put runtime schedule into engine. ( #627 )
3 years ago
Jiarui Fang
e956d93ac2
[refactor] memory utils ( #577 )
3 years ago
HELSON
e6d50ec107
[zero] adapt zero for unsharded parameters ( #561 )
...
* support existing sharded and unsharded parameters in zero
* add unitest for moe-zero model init
* polish moe gradient handler
3 years ago
Jiarui Fang
7675366fce
[polish] rename col_attr -> colo_attr ( #558 )
3 years ago
ver217
014bac0c49
[zero] hijack p.grad in sharded model ( #554 )
...
* hijack p.grad in sharded model
* polish comments
* polish comments
3 years ago
Jiarui Fang
f552b11294
[zero] label state for param fp16 and grad ( #551 )
3 years ago
Jiarui Fang
214da761d4
[zero] add stateful tensor ( #549 )
3 years ago
Liang Bowen
ec5086c49c
Refactored docstring to google style
3 years ago
Jie Zhu
73d36618a6
[profiler] add MemProfiler ( #356 )
...
* add memory trainer hook
* fix bug
* add memory trainer hook
* fix import bug
* fix import bug
* add trainer hook
* fix #370 git log bug
* modify `to_tensorboard` function to support better output
* remove useless output
* change the name of `MemProfiler`
* complete memory profiler
* replace error with warning
* finish trainer hook
* modify interface of MemProfiler
* modify `__init__.py` in profiler
* remove unnecessary pass statement
* add usage to doc string
* add usage to trainer hook
* new location to store temp data file
3 years ago
HELSON
a30e2b4c24
[zero] adapt for no-leaf module in zero ( #535 )
...
only process module's own parameters in Zero context
add zero hooks for all modules that contrain parameters
gather parameters only belonging to module itself
3 years ago