Frank Lee
70c58cfd4f
[shardformer] supported fused qkv checkpoint ( #4073 )
1 year ago
FoolPlayer
0803a61412
[shardformer] add linearconv1d test ( #4067 )
...
* add linearconv1d test
* add linearconv1d test
1 year ago
Frank Lee
8eb09a4c69
[shardformer] support module saving and loading ( #4062 )
...
* [shardformer] support module saving and loading
* polish code
1 year ago
FoolPlayer
7740c55c55
support kit use for bert/gpt test ( #4055 )
...
* support kit use for bert test
* support kit test for gpt2
1 year ago
Frank Lee
f22ddacef0
[shardformer] refactored the shardformer layer structure ( #4053 )
1 year ago
Frank Lee
58df720570
[shardformer] adapted T5 and LLaMa test to use kit ( #4049 )
...
* [shardformer] adapted T5 and LLaMa test to use kit
* polish code
1 year ago
FoolPlayer
4021b9a8a2
[shardformer] add gpt2 test and layer class refactor ( #4041 )
...
* add gpt2 test and layer class refactor
* add dropout in gpt2 policy
1 year ago
Frank Lee
d857f3dbba
[shardformer] supported T5 and its variants ( #4045 )
1 year ago
Frank Lee
c1d5453e9f
[shardformer] adapted llama to the new API ( #4036 )
1 year ago
FoolPlayer
74d176c8d8
[shardformer] fix bert and gpt downstream with new api ( #4024 )
...
* fix bert downstream with new api
* remove comment line
1 year ago
Frank Lee
e253a07007
[shardformer] updated doc ( #4016 )
1 year ago
FoolPlayer
df018fc305
support bert with new api
1 year ago
FoolPlayer
507c0ad368
add vocabembedding layer
1 year ago
Frank Lee
45d9384346
[shardformer] removed inplace tensor sharding ( #4018 )
1 year ago
Frank Lee
3893fa1a8d
[shardformer] refactored embedding and dropout to parallel module ( #4013 )
...
* [shardformer] refactored embedding and dropout to parallel module
* polish code
1 year ago
FoolPlayer
dfca9678fa
integrate with dist layer ( #4011 )
1 year ago
Frank Lee
015af592f8
[shardformer] integrated linear 1D with dtensor ( #3996 )
...
* [shardformer] integrated linear 1D with dtensor
* polish code
1 year ago
FoolPlayer
d3bc530849
[shardformer] Refactor shardformer api ( #4001 )
...
* fix an error in readme
* simplify code
* refactor shardformer
* add todo
* remove slicer
* resolve code review
1 year ago
Frank Lee
611971248c
[device] support init device mesh from process group ( #3990 )
1 year ago
FoolPlayer
a2f9af810d
[shardformer] fix an error in readme ( #3988 )
...
* fix an error in readme
* simplify code
1 year ago
FoolPlayer
f7774ec0f3
[Shardformer] Downstream bert ( #3979 )
...
* add dist dropout in model
* update docstring and bert policy with dropout
* refactor basepolicy and sharded, update bert
* update format
* update gpt2 policy
* update bert policy
* remove unused code
* update readme for new policy usage
* add downstream model of bert
* remove unused code
1 year ago
wukong1992
c1c672d0f0
[shardformer] shardformer support t5 model ( #3994 )
...
test t5
1 year ago
wukong1992
6b30dfb7ce
[shardformer] support llama model using shardformer ( #3969 )
...
adjust layer attr
1 year ago
FoolPlayer
45927d5527
[shardformer] Add dropout layer in shard model and refactor policy api ( #3949 )
...
* add dist dropout in model
* update docstring and bert policy with dropout
* refactor basepolicy and sharded, update bert
* update format
* update gpt2 policy
* update bert policy
* remove unused code
* update readme for new policy usage
1 year ago
FoolPlayer
a73130482d
[shardformer] Unit test ( #3928 )
...
* fix bug in slicer, add slicer unit test
* add dropout test
* use pid as dropout seed
* updata dropout test with local pattern
* ad todo
1 year ago
FoolPlayer
f1cb5ac6bf
[shardformer] Align bert value ( #3907 )
...
* add bert align test, fix dist loss bug
* forward and backward align
* add ignore index
* add shardformer CI
* add gather_output optional for user in shardconfig
* update readme with optional gather_ouput
* add dist crossentropy loss test, remove unused files
* remove unused file
* remove unused file
* rename the file
* polish code
1 year ago
FoolPlayer
79f8d5d54b
[shardformer] add gpt2 policy and modify shard and slicer to support ( #3883 )
...
* add gpt2 policy and modify shard and slicer to support
* remove unused code
* polish code
1 year ago
FoolPlayer
70173e3123
update README ( #3909 )
1 year ago
FoolPlayer
ab8a47f830
[shardformer] add Dropout layer support different dropout pattern ( #3856 )
...
* add dropout layer, add dropout test
* modify seed manager as context manager
* add a copy of col_nn.layer
* add dist_crossentropy loss; separate module test
* polish the code
* fix dist crossentropy loss
1 year ago
FoolPlayer
c594dc2f1c
[shardformer] update readme with modules implement doc ( #3834 )
...
* update readme with modules content
* remove img
1 year ago
Frank Lee
4972e1f40e
[shardformer] refactored the user api ( #3828 )
...
* [shardformer] refactored the user api
* polish code
1 year ago
Frank Lee
235792f170
[shardformer] updated readme ( #3827 )
1 year ago
FoolPlayer
8cc11235c0
[shardformer]: Feature/shardformer, add some docstring and readme ( #3816 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
* add share weight and train example
* add train
* add docstring and readme
* add docstring for other files
* pre-commit
1 year ago
FoolPlayer
8d68de767d
[shardformer] init shardformer code structure ( #3731 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
1 year ago
Wenhao Chen
3d8d5d0d58
[chat] use official transformers and fix some issues ( #4117 )
...
* feat: remove on_learn_epoch fn as not used
* revert: add _on_learn_epoch fn
* feat: remove NaiveStrategy
* test: update train_prompts tests
* fix: remove prepare_llama_tokenizer_and_embedding
* test: add lora arg
* feat: remove roberta support in train_prompts due to runtime errs
* feat: remove deberta & roberta in rm as not used
* test: remove deberta and roberta tests
* feat: remove deberta and roberta models as not used
* fix: remove calls to roberta
* fix: remove prepare_llama_tokenizer_and_embedding
* chore: update transformers version
* docs: update transformers version
* fix: fix actor inference
* fix: fix ci
* feat: change llama pad token to unk
* revert: revert ddp setup_distributed
* fix: change llama pad token to unk
* revert: undo unnecessary changes
* fix: use pip to install transformers
1 year ago
Baizhou Zhang
1350ece492
[hotfix] fix import bug in checkpoint_io ( #4142 )
1 year ago
digger yu
8abc87798f
fix Tensor is not defined ( #4129 )
1 year ago
digger yu
7e46bc87b6
fix CheckpointIndexFile is not defined ( #4109 )
1 year ago
digger yu
09fe9dc704
[nfc]fix ColossalaiOptimizer is not defined ( #4122 )
1 year ago
Wenhao Chen
edd75a59ea
[chat] remove naive strategy and split colossalai strategy ( #4094 )
...
* feat: remove on_learn_epoch fn as not used
* revert: add _on_learn_epoch fn
* to: remove the use of NaiveStrategy
* test: remove NaiveStrategy tests
* feat: remove NaiveStrategy
* style: modify comments and params
* feat: split ColossalAIStrategy into LowLevelZeroStrategy and GeminiStrategy
* fix: remove naive
* fix: align with modified colossal strategy
* fix: fix ddp _try_init_dist arg
1 year ago
Wenhao Chen
b03d64d010
[chat] refactor trainer class ( #4080 )
...
* to: add SLTrainer
* refactor: refactor RMTrainer and SFTTrainer
* fix: fix init file
* feat: remove on_learn_epoch fn as not used
* fix: align with modified gemini arguments
* to: add OnPolicyTrainer
* revert: add _on_learn_epoch fn
* refactor: refactor PPOTrainer
* style: rename PPOTrainer argument
* fix: align with modified PPO arguments
* test: align with modified train_prompts arguments
* chore: modify train_prompts
* docs: align with modified arguments
* fix: remove unnecessary output
* fix: move dataloader to fit fn of SLTrainer
* fix: move dataloader to fit fn of OnPolicyTrainer
* fix: modify usage of prompt and pretrain dataloader
1 year ago
Jianghai
711e2b4c00
[doc] update and revise some typos and errs in docs ( #4107 )
...
* fix some typos and problems in doc
* fix some typos and problems in doc
* add doc test
1 year ago
digger yu
769cddcb2c
fix typo docs/ ( #4033 )
1 year ago
digger yu
2d40759a53
fix #3852 path error ( #4058 )
1 year ago
Frank Lee
1ee947f617
[workflow] added status check for test coverage workflow ( #4106 )
1 year ago
Jianghai
31dc302017
[examples] copy resnet example to image ( #4090 )
...
* copy resnet example
* add pytest package
* skip test_ci
* skip test_ci
* skip test_ci
1 year ago
Frank Lee
95e95b6d58
[testing] move pytest to be inside the function ( #4087 )
1 year ago
Baizhou Zhang
4da324cd60
[hotfix]fix argument naming in docs and examples ( #4083 )
1 year ago
Michelle
e89b127d8e
[chat]: fix chat evaluation possible bug ( #4064 )
...
* fix chat eval
* fix utils
* fix utils
* add comment
---------
Co-authored-by: Qianran Ma <qianranm@luchentech.com>
1 year ago
Baizhou Zhang
2c8ae37f61
Merge pull request #4056 from Fridge003/hotfix/fix_gemini_chunk_config_searching
...
[gemini] Rename arguments in chunk configuration searching
1 year ago