flybird11111
0a94fcd351
[shardformer] update bert finetune example with HybridParallelPlugin ( #4584 )
...
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] fix epoch change
* [shardformer] broadcast add pp group
* [shardformer] fix opt test hanging
* fix
* test
* test
* [shardformer] zero1+pp and the corresponding tests (#4517 )
* pause
* finish pp+zero1
* Update test_shard_vit.py
* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516 )
* fix overlap bug and support bert, add overlap as an option in shardconfig
* support overlap for chatglm and bloom
* [shardformer] fix emerged bugs after updating transformers (#4526 )
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] Add overlap support for gpt2 (#4535 )
* add overlap support for gpt2
* remove unused code
* remove unused code
* [shardformer] support pp+tp+zero1 tests (#4531 )
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] fix submodule replacement bug when enabling pp (#4544 )
* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540 )
* implement sharded optimizer saving
* add more param info
* finish implementation of sharded optimizer saving
* fix bugs in optimizer sharded saving
* add pp+zero test
* param group loading
* greedy loading of optimizer
* fix bug when loading
* implement optimizer sharded saving
* add optimizer test & arrange checkpointIO utils
* fix gemini sharding state_dict
* add verbose option
* add loading of master params
* fix typehint
* fix master/working mapping in fp16 amp
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] fix epoch change
* [shardformer] broadcast add pp group
* rebase feature/shardformer
* update pipeline
* [shardformer] fix
* [shardformer] fix
* [shardformer] bert finetune fix
* [shardformer] add all_reduce operation to loss
add all_reduce operation to loss
* [shardformer] make compatible with pytree.
make compatible with pytree.
* [shardformer] disable tp
disable tp
* [shardformer] add 3d plugin to ci test
* [shardformer] update num_microbatches to None
* [shardformer] update microbatchsize
* [shardformer] update assert
* update scheduler
* update scheduler
---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
1 year ago
binmakeswell
ef4b99ebcd
add llama example CI
1 year ago
binmakeswell
7ff11b5537
[example] add llama pretraining ( #4257 )
1 year ago
digger yu
2d40759a53
fix #3852 path error ( #4058 )
1 year ago
Baizhou Zhang
4da324cd60
[hotfix]fix argument naming in docs and examples ( #4083 )
1 year ago
LuGY
160c64c645
[example] fix bucket size in example of gpt gemini ( #4028 )
1 year ago
Baizhou Zhang
b3ab7fbabf
[example] update ViT example using booster api ( #3940 )
1 year ago
digger yu
33eef714db
fix typo examples and docs ( #3932 )
1 year ago
Baizhou Zhang
e417dd004e
[example] update opt example using booster api ( #3918 )
1 year ago
Liu Ziming
b306cecf28
[example] Modify palm example with the new booster API ( #3913 )
...
* Modify torch version requirement to adapt torch 2.0
* modify palm example using new booster API
* roll back
* fix port
* polish
* polish
1 year ago
wukong1992
a55fb00c18
[booster] update bert example, using booster api ( #3885 )
1 year ago
jiangmingyan
5f79008c4a
[example] update gemini examples ( #3868 )
...
* [example]update gemini examples
* [example]update gemini examples
2 years ago
digger yu
518b31c059
[docs] change placememt_policy to placement_policy ( #3829 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
2 years ago
binmakeswell
15024e40d9
[auto] fix install cmd ( #3772 )
2 years ago
digger-yu
b9a8dff7e5
[doc] Fix typo under colossalai and doc( #3618 )
...
* Fixed several spelling errors under colossalai
* Fix the spelling error in colossalai and docs directory
* Cautious Changed the spelling error under the example folder
* Update runtime_preparation_pass.py
revert autograft to autograd
* Update search_chunk.py
utile to until
* Update check_installation.py
change misteach to mismatch in line 91
* Update 1D_tensor_parallel.md
revert to perceptron
* Update 2D_tensor_parallel.md
revert to perceptron in line 73
* Update 2p5D_tensor_parallel.md
revert to perceptron in line 71
* Update 3D_tensor_parallel.md
revert to perceptron in line 80
* Update README.md
revert to resnet in line 42
* Update reorder_graph.py
revert to indice in line 7
* Update p2p.py
revert to megatron in line 94
* Update initialize.py
revert to torchrun in line 198
* Update routers.py
change to detailed in line 63
* Update routers.py
change to detailed in line 146
* Update README.md
revert random number in line 402
2 years ago
binmakeswell
f1b3d60cae
[example] reorganize for community examples ( #3557 )
2 years ago
mandoxzhang
8f2c55f9c9
[example] remove redundant texts & update roberta ( #3493 )
...
* update roberta example
* update roberta example
* modify conflict & update roberta
2 years ago
mandoxzhang
ab5fd127e3
[example] update roberta with newer ColossalAI ( #3472 )
...
* update roberta example
* update roberta example
2 years ago
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
ver217
573af84184
[example] update examples related to zero/gemini ( #3431 )
...
* [zero] update legacy import
* [zero] update examples
* [example] fix opt tutorial
* [example] fix opt tutorial
* [example] fix opt tutorial
* [example] fix opt tutorial
* [example] fix import
2 years ago
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2 years ago
Yan Fang
189347963a
[auto] fix requirements typo for issue #3125 ( #3209 )
2 years ago
Zihao
18dbe76cae
[auto-parallel] add auto-offload feature ( #3154 )
...
* add auto-offload feature
* polish code
* fix syn offload runtime pass bug
* add offload example
* fix offload testing bug
* fix example testing bug
2 years ago
binmakeswell
360674283d
[example] fix redundant note ( #3065 )
2 years ago
Tomek
af3888481d
[example] fixed opt model downloading from huggingface
2 years ago
ramos
2ef855c798
support shardinit option to avoid OPT OOM initializing problem ( #3037 )
...
Co-authored-by: poe <poe@nemoramo>
2 years ago
Ziyue Jiang
400f63012e
[pipeline] Add Simplified Alpa DP Partition ( #2507 )
...
* add alpa dp split
* add alpa dp split
* use fwd+bwd instead of fwd only
---------
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2 years ago
github-actions[bot]
da056285f2
[format] applied code formatting on changed files in pull request 2922 ( #2923 )
...
Co-authored-by: github-actions <github-actions@github.com>
2 years ago
binmakeswell
12bafe057f
[doc] update installation for GPT ( #2922 )
2 years ago
Alex_996
a4fc125c34
Fix typos ( #2863 )
...
Fix typos, `6.7 -> 6.7b`
2 years ago
dawei-wang
55424a16a5
[doc] fix GPT tutorial ( #2860 )
...
Fix hpcaitech/ColossalAI#2851
2 years ago
Jiarui Fang
bf0204604f
[exmaple] add bert and albert ( #2824 )
2 years ago
cloudhuang
43dffdaba5
[doc] fixed a typo in GPT readme ( #2736 )
2 years ago
Jiatong (Julius) Han
a255a38f7f
[example] Polish README.md ( #2658 )
...
* [tutorial] polish readme.md
* [example] Update README.md
2 years ago
HELSON
6e0faa70e0
[gemini] add profiler in the demo ( #2534 )
2 years ago
HELSON
66dfcf5281
[gemini] update the gpt example ( #2527 )
2 years ago
HELSON
707b11d4a0
[gemini] update ddp strict mode ( #2518 )
...
* [zero] add strict ddp mode for chunk init
* [gemini] update gpt example
2 years ago
HELSON
2d1a7dfe5f
[zero] add strict ddp mode ( #2508 )
...
* [zero] add strict ddp mode
* [polish] add comments for strict ddp mode
* [zero] fix test error
2 years ago
Jiarui Fang
e327e95144
[hotfix] gpt example titans bug #2493 ( #2494 )
2 years ago
binmakeswell
fcc6d61d92
[example] fix requirements ( #2488 )
2 years ago
Jiarui Fang
3a21485ead
[example] titans for gpt ( #2484 )
2 years ago
Jiarui Fang
7c31706227
[CI] add test_ci.sh for palm, opt and gpt ( #2475 )
2 years ago
ver217
f525d1f528
[example] update gpt gemini example ci test ( #2477 )
2 years ago
Ziyue Jiang
fef5c949c3
polish pp middleware ( #2476 )
...
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2 years ago
Jiarui Fang
867c8c2d3a
[zero] low level optim supports ProcessGroup ( #2464 )
2 years ago
YuliangLiu0306
2731531bc2
[autoparallel] integrate device mesh initialization into autoparallelize ( #2393 )
...
* [autoparallel] integrate device mesh initialization into autoparallelize
* add megatron solution
* update gpt autoparallel examples with latest api
* adapt beta value to fit the current computation cost
2 years ago
ZijianYY
fe0f7970a2
[examples] adding tflops to PaLM ( #2365 )
2 years ago
HELSON
d84e747975
[hotfix] add DISTPAN argument for benchmark ( #2412 )
...
* change the benchmark config file
* change config
* revert config file
* rename distpan to distplan
2 years ago
HELSON
498b5ca993
[hotfix] fix gpt gemini example ( #2404 )
...
* [hotfix] fix gpt gemini example
* [example] add new assertions
2 years ago
Jiarui Fang
12c8bf38d7
[Pipeline] Refine GPT PP Example
2 years ago