Wenhao Chen
7b9b86441f
[chat]: update rm, add wandb and fix bugs ( #4471 )
...
* feat: modify forward fn of critic and reward model
* feat: modify calc_action_log_probs
* to: add wandb in sft and rm trainer
* feat: update train_sft
* feat: update train_rm
* style: modify type annotation and add warning
* feat: pass tokenizer to ppo trainer
* to: modify trainer base and maker base
* feat: add wandb in ppo trainer
* feat: pass tokenizer to generate
* test: update generate fn tests
* test: update train tests
* fix: remove action_mask
* feat: remove unused code
* fix: fix wrong ignore_index
* fix: fix mock tokenizer
* chore: update requirements
* revert: modify make_experience
* fix: fix inference
* fix: add padding side
* style: modify _on_learn_batch_end
* test: use mock tokenizer
* fix: use bf16 to avoid overflow
* fix: fix workflow
* [chat] fix gemini strategy
* [chat] fix
* sync: update colossalai strategy
* fix: fix args and model dtype
* fix: fix checkpoint test
* fix: fix requirements
* fix: fix missing import and wrong arg
* fix: temporarily skip gemini test in stage 3
* style: apply pre-commit
* fix: temporarily skip gemini test in stage 1&2
---------
Co-authored-by: Mingyan Jiang <1829166702@qq.com>
1 year ago
ppt0011
07c2e3d09c
Merge pull request #4757 from ppt0011/main
...
[doc] explain suitable use case for each plugin
1 year ago
Pengtai Xu
4d7537ba25
[doc] put native colossalai plugins first in description section
1 year ago
Pengtai Xu
e10d9f087e
[doc] add model examples for each plugin
1 year ago
Pengtai Xu
a04337bfc3
[doc] put individual plugin explanation in front
1 year ago
Pengtai Xu
10513f203c
[doc] explain suitable use case for each plugin
1 year ago
Hongxin Liu
079bf3cb26
[misc] update pre-commit and run all files ( #4752 )
...
* [misc] update pre-commit
* [misc] run pre-commit
* [misc] remove useless configuration files
* [misc] ignore cuda for clang-format
1 year ago
github-actions[bot]
3c6b831c26
[format] applied code formatting on changed files in pull request 4743 ( #4750 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Hongxin Liu
b5f9e37c70
[legacy] clean up legacy code ( #4743 )
...
* [legacy] remove outdated codes of pipeline (#4692 )
* [legacy] remove cli of benchmark and update optim (#4690 )
* [legacy] remove cli of benchmark and update optim
* [doc] fix cli doc test
* [legacy] fix engine clip grad norm
* [legacy] remove outdated colo tensor (#4694 )
* [legacy] remove outdated colo tensor
* [test] fix test import
* [legacy] move outdated zero to legacy (#4696 )
* [legacy] clean up utils (#4700 )
* [legacy] clean up utils
* [example] update examples
* [legacy] clean up amp
* [legacy] fix amp module
* [legacy] clean up gpc (#4742 )
* [legacy] clean up context
* [legacy] clean core, constants and global vars
* [legacy] refactor initialize
* [example] fix examples ci
* [example] fix examples ci
* [legacy] fix tests
* [example] fix gpt example
* [example] fix examples ci
* [devops] fix ci installation
* [example] fix examples ci
1 year ago
Xuanlei Zhao
32e7f99416
[kernel] update triton init #4740 ( #4740 )
1 year ago
Baizhou Zhang
d151dcab74
[doc] explaination of loading large pretrained models ( #4741 )
1 year ago
flybird11111
4c4482f3ad
[example] llama2 add fine-tune example ( #4673 )
...
* [shardformer] update shardformer readme
[shardformer] update shardformer readme
[shardformer] update shardformer readme
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] change dataset
* [shardformer] change dataset
* [shardformer] fix CI
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
[example] update opt example
[example] resolve comments
fix
fix
* [example] llama2 add finetune example
* [example] llama2 add finetune example
* [example] llama2 add finetune example
* [example] llama2 add finetune example
* fix
* update llama2 example
* update llama2 example
* fix
* update llama2 example
* update llama2 example
* update llama2 example
* update llama2 example
* update llama2 example
* update llama2 example
* Update requirements.txt
* update llama2 example
* update llama2 example
* update llama2 example
1 year ago
Xuanlei Zhao
ac2797996b
[shardformer] add custom policy in hybrid parallel plugin ( #4718 )
...
* add custom policy
* update assert
1 year ago
Baizhou Zhang
451c3465fb
[doc] polish shardformer doc ( #4735 )
...
* arrange position of chapters
* fix typos in seq parallel doc
1 year ago
ppt0011
73eb3e8862
Merge pull request #4738 from ppt0011/main
...
[legacy] remove deterministic data loader test
1 year ago
Bin Jia
608cffaed3
[example] add gpt2 HybridParallelPlugin example ( #4653 )
...
* add gpt2 HybridParallelPlugin example
* update readme and testci
* update test ci
* fix test_ci bug
* update requirements
* add requirements
* update requirements
* add requirement
* rename file
1 year ago
Bin Jia
6a03c933a0
[shardformer] update seq parallel document ( #4730 )
...
* update doc of seq parallel
* fix typo
1 year ago
Pengtai Xu
cd4e61d149
[legacy] remove deterministic data loader test
1 year ago
flybird11111
46162632e5
[shardformer] update pipeline parallel document ( #4725 )
...
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
* [shardformer] update pipeline parallel document
1 year ago
digger yu
e4fc57c3de
Optimized some syntax errors in the documentation and code under applications/ ( #4127 )
...
Co-authored-by: flybird11111 <1829166702@qq.com>
1 year ago
Baizhou Zhang
50e5602c2d
[doc] add shardformer support matrix/update tensor parallel documents ( #4728 )
...
* add compatibility matrix for shardformer doc
* update tp doc
1 year ago
github-actions[bot]
8c2dda7410
[format] applied code formatting on changed files in pull request 4726 ( #4727 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Baizhou Zhang
f911d5b09d
[doc] Add user document for Shardformer ( #4702 )
...
* create shardformer doc files
* add docstring for seq-parallel
* update ShardConfig docstring
* add links to llama example
* add outdated massage
* finish introduction & supporting information
* finish 'how shardformer works'
* finish shardformer.md English doc
* fix doctest fail
* add Chinese document
1 year ago
binmakeswell
ce97790ed7
[doc] fix llama2 code link ( #4726 )
...
* [doc] fix llama2 code link
* [doc] fix llama2 code link
* [doc] fix llama2 code link
1 year ago
flybird11111
20190b49a5
[shardformer] to fix whisper test failed due to significant accuracy differences. ( #4710 )
...
* [shardformer] fix whisper test failed
* [shardformer] fix whisper test failed
* [shardformer] fix whisper test failed
* [shardformer] fix whisper test failed
1 year ago
Yuanheng Zhao
e2c0e7f92a
[hotfix] Fix import error: colossal.kernel without triton installed ( #4722 )
...
* [hotfix] remove triton kernels from kernel init
* revise bloom/llama kernel imports for infer
1 year ago
flybird11111
c7d6975d29
[shardformer] fix GPT2DoubleHeadsModel ( #4703 )
1 year ago
Baizhou Zhang
068372a738
[doc] add potential solution for OOM in llama2 example ( #4699 )
1 year ago
digger yu
9c2feb2f0b
fix some typo with colossalai/device colossalai/tensor/ etc. ( #4171 )
...
Co-authored-by: flybird11111 <1829166702@qq.com>
1 year ago
Baizhou Zhang
d8ceeac14e
[hotfix] fix typo in hybrid parallel io ( #4697 )
1 year ago
flybird11111
8844691f4b
[shardformer] update shardformer readme ( #4689 )
...
* [shardformer] update shardformer readme
* [shardformer] update shardformer readme
* [shardformer] update shardformer readme
* [shardformer] update shardformer readme
* [shardformer] update shardformer readme
1 year ago
Baizhou Zhang
1d454733c4
[doc] Update booster user documents. ( #4669 )
...
* update booster_api.md
* update booster_checkpoint.md
* update booster_plugins.md
* move transformers importing inside function
* fix Dict typing
* fix autodoc bug
* small fix
1 year ago
Cuiqing Li
bce0f16702
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system ( #4577 )
...
* [infer] Infer/llama demo (#4503 )
* add
* add infer example
* finish
* finish
* stash
* fix
* [Kernels] add inference token attention kernel (#4505 )
* add token forward
* fix tests
* fix comments
* add try import triton
* add adapted license
* add tests check
* [Kernels] add necessary kernels (llama & bloom) for attention forward and kv-cache manager (#4485 )
* added _vllm_rms_norm
* change place
* added tests
* added tests
* modify
* adding kernels
* added tests:
* adding kernels
* modify
* added
* updating kernels
* adding tests
* added tests
* kernel change
* submit
* modify
* added
* edit comments
* change name
* change commnets and fix import
* add
* added
* combine codes (#4509 )
* [feature] add KV cache manager for llama & bloom inference (#4495 )
* add kv cache memory manager
* add stateinfo during inference
* format
* format
* rename file
* add kv cache test
* revise on BatchInferState
* file dir change
* [Bug FIx] import llama context ops fix (#4524 )
* added _vllm_rms_norm
* change place
* added tests
* added tests
* modify
* adding kernels
* added tests:
* adding kernels
* modify
* added
* updating kernels
* adding tests
* added tests
* kernel change
* submit
* modify
* added
* edit comments
* change name
* change commnets and fix import
* add
* added
* fix
* add ops into init.py
* add
* [Infer] Add TPInferEngine and fix file path (#4532 )
* add engine for TP inference
* move file path
* update path
* fix TPInferEngine
* remove unused file
* add engine test demo
* revise TPInferEngine
* fix TPInferEngine, add test
* fix
* Add Inference test for llama (#4508 )
* add kv cache memory manager
* add stateinfo during inference
* add
* add infer example
* finish
* finish
* format
* format
* rename file
* add kv cache test
* revise on BatchInferState
* add inference test for llama
* fix conflict
* feature: add some new features for llama engine
* adapt colossalai triton interface
* Change the parent class of llama policy
* add nvtx
* move llama inference code to tensor_parallel
* fix __init__.py
* rm tensor_parallel
* fix: fix bugs in auto_policy.py
* fix:rm some unused codes
* mv colossalai/tpinference to colossalai/inference/tensor_parallel
* change __init__.py
* save change
* fix engine
* Bug fix: Fix hang
* remove llama_infer_engine.py
---------
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
* [infer] Add Bloom inference policy and replaced methods (#4512 )
* add bloom inference methods and policy
* enable pass BatchInferState from model forward
* revise bloom infer layers/policies
* add engine for inference (draft)
* add test for bloom infer
* fix bloom infer policy and flow
* revise bloom test
* fix bloom file path
* remove unused codes
* fix bloom modeling
* fix dir typo
* fix trivial
* fix policy
* clean pr
* trivial fix
* Revert "[infer] Add Bloom inference policy and replaced methods (#4512 )" (#4552 )
This reverts commit 17cfa57140
.
* [Doc] Add colossal inference doc (#4549 )
* create readme
* add readme.md
* fix typos
* [infer] Add Bloom inference policy and replaced methods (#4553 )
* add bloom inference methods and policy
* enable pass BatchInferState from model forward
* revise bloom infer layers/policies
* add engine for inference (draft)
* add test for bloom infer
* fix bloom infer policy and flow
* revise bloom test
* fix bloom file path
* remove unused codes
* fix bloom modeling
* fix dir typo
* fix trivial
* fix policy
* clean pr
* trivial fix
* trivial
* Fix Bugs In Llama Model Forward (#4550 )
* add kv cache memory manager
* add stateinfo during inference
* add
* add infer example
* finish
* finish
* format
* format
* rename file
* add kv cache test
* revise on BatchInferState
* add inference test for llama
* fix conflict
* feature: add some new features for llama engine
* adapt colossalai triton interface
* Change the parent class of llama policy
* add nvtx
* move llama inference code to tensor_parallel
* fix __init__.py
* rm tensor_parallel
* fix: fix bugs in auto_policy.py
* fix:rm some unused codes
* mv colossalai/tpinference to colossalai/inference/tensor_parallel
* change __init__.py
* save change
* fix engine
* Bug fix: Fix hang
* remove llama_infer_engine.py
* bug fix: fix bugs about infer_state.is_context_stage
* remove pollcies
* fix: delete unused code
* fix: delete unused code
* remove unused coda
* fix conflict
---------
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
* [doc] add colossal inference fig (#4554 )
* create readme
* add readme.md
* fix typos
* upload fig
* [NFC] fix docstring for colossal inference (#4555 )
Fix docstring and comments in kv cache manager and bloom modeling
* fix docstring in llama modeling (#4557 )
* [Infer] check import vllm (#4559 )
* change import vllm
* import apply_rotary_pos_emb
* change import location
* [DOC] add installation req (#4561 )
* add installation req
* fix
* slight change
* remove empty
* [Feature] rms-norm transfer into inference llama.py (#4563 )
* add installation req
* fix
* slight change
* remove empty
* add rmsnorm polciy
* add
* clean codes
* [infer] Fix tp inference engine (#4564 )
* fix engine prepare data
* add engine test
* use bloom for testing
* revise on test
* revise on test
* reset shardformer llama (#4569 )
* [infer] Fix engine - tensors on different devices (#4570 )
* fix diff device in engine
* [codefactor] Feature/colossal inference (#4579 )
* code factors
* remove
* change coding (#4581 )
* [doc] complete README of colossal inference (#4585 )
* complete fig
* Update README.md
* [doc]update readme (#4586 )
* update readme
* Update README.md
* bug fix: fix bus in llama and bloom (#4588 )
* [BUG FIX]Fix test engine in CI and non-vllm kernels llama forward (#4592 )
* fix tests
* clean
* clean
* fix bugs
* add
* fix llama non-vllm kernels bug
* modify
* clean codes
* [Kernel]Rmsnorm fix (#4598 )
* fix tests
* clean
* clean
* fix bugs
* add
* fix llama non-vllm kernels bug
* modify
* clean codes
* add triton rmsnorm
* delete vllm kernel flag
* [Bug Fix]Fix bugs in llama (#4601 )
* fix tests
* clean
* clean
* fix bugs
* add
* fix llama non-vllm kernels bug
* modify
* clean codes
* bug fix: remove rotary_positions_ids
---------
Co-authored-by: cuiqing.li <lixx3527@gmail.com>
* [kernel] Add triton layer norm & replace norm for bloom (#4609 )
* add layernorm for inference
* add test for layernorm kernel
* add bloom layernorm replacement policy
* trivial: path
* [Infer] Bug fix rotary embedding in llama (#4608 )
* fix rotary embedding
* delete print
* fix init seq len bug
* rename pytest
* add benchmark for llama
* refactor codes
* delete useless code
* [bench] Add bloom inference benchmark (#4621 )
* add bloom benchmark
* readme - update benchmark res
* trivial - uncomment for testing (#4622 )
* [Infer] add check triton and cuda version for tests (#4627 )
* fix rotary embedding
* delete print
* fix init seq len bug
* rename pytest
* add benchmark for llama
* refactor codes
* delete useless code
* add check triton and cuda
* Update sharder.py (#4629 )
* [Inference] Hot fix some bugs and typos (#4632 )
* fix
* fix test
* fix conflicts
* [typo]Comments fix (#4633 )
* fallback
* fix commnets
* bug fix: fix some bugs in test_llama and test_bloom (#4635 )
* [Infer] delete benchmark in tests and fix bug for llama and bloom (#4636 )
* fix rotary embedding
* delete print
* fix init seq len bug
* rename pytest
* add benchmark for llama
* refactor codes
* delete useless code
* add check triton and cuda
* delete benchmark and fix infer bugs
* delete benchmark for tests
* delete useless code
* delete bechmark function in utils
* [Fix] Revise TPInferEngine, inference tests and benchmarks (#4642 )
* [Fix] revise TPInferEngine methods and inference tests
* fix llama/bloom infer benchmarks
* fix infer tests
* trivial fix: benchmakrs
* trivial
* trivial: rm print
* modify utils filename for infer ops test (#4657 )
* [Infer] Fix TPInferEngine init & inference tests, benchmarks (#4670 )
* fix engine funcs
* TPInferEngine: receive shard config in init
* benchmarks: revise TPInferEngine init
* benchmarks: remove pytest decorator
* trivial fix
* use small model for tests
* [NFC] use args for infer benchmarks (#4674 )
* revise infer default (#4683 )
* [Fix] optimize/shard model in TPInferEngine init (#4684 )
* remove using orig model in engine
* revise inference tests
* trivial: rename
---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
1 year ago
flybird11111
eedaa3e1ef
[shardformer]fix gpt2 double head ( #4663 )
...
* [shardformer]fix gpt2 test
[shardformer]fix gpt2 test
[shardformer]fix gpt2 test
* fix
* [shardformer] add todo
* [shardformer] add todo
1 year ago
Hongxin Liu
554aa9592e
[legacy] move communication and nn to legacy and refactor logger ( #4671 )
...
* [legacy] move communication to legacy (#4640 )
* [legacy] refactor logger and clean up legacy codes (#4654 )
* [legacy] make logger independent to gpc
* [legacy] make optim independent to registry
* [legacy] move test engine to legacy
* [legacy] move nn to legacy (#4656 )
* [legacy] move nn to legacy
* [checkpointio] fix save hf config
* [test] remove useledd rpc pp test
* [legacy] fix nn init
* [example] skip tutorial hybriad parallel example
* [devops] test doc check
* [devops] test doc check
1 year ago
Hongxin Liu
536397cc95
[devops] fix concurrency group ( #4667 )
1 year ago
flybird11111
7486ed7d3a
[shardformer] update llama2/opt finetune example and fix llama2 policy ( #4645 )
...
* [shardformer] update shardformer readme
[shardformer] update shardformer readme
[shardformer] update shardformer readme
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] change dataset
* [shardformer] change dataset
* [shardformer] fix CI
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
[example] update opt example
[example] resolve comments
fix
fix
1 year ago
Hongxin Liu
a686f9ddc8
[devops] fix concurrency group and compatibility test ( #4665 )
...
* [devops] fix concurrency group
* [devops] fix compatibility test
* [devops] fix tensornvme install
* [devops] fix tensornvme install
* [devops] fix colossalai install
1 year ago
Baizhou Zhang
295b38fecf
[example] update vit example for hybrid parallel plugin ( #4641 )
...
* update vit example for hybrid plugin
* reset tp/pp size
* fix dataloader iteration bug
* update optimizer passing in evaluation/add grad_accum
* change criterion
* wrap tqdm
* change grad_accum to grad_checkpoint
* fix pbar
1 year ago
Baizhou Zhang
660eed9124
[pipeline] set optimizer to optional in execute_pipeline ( #4630 )
...
* set optimizer to optional in execute_pipeline
* arrange device and mixed precision in booster init
* fix execute_pipeline in booster.py
1 year ago
eric8607242
c3d5fa3bac
[shardformer] Support customized policy for llamav2 based model with HybridParallelPlugin ( #4624 )
...
* Enable policy assignment in HybridPlugin and enable llama policy for llamav2
* Remove Policy from Plugin
* revert changes of plugin
HybridParallelModule
* revert changes in plugin
* upgrade transformers
* revert transformers version
---------
Co-authored-by: flybird11111 <1829166702@qq.com>
1 year ago
Hongxin Liu
9709b8f502
[release] update version ( #4623 )
1 year ago
Hongxin Liu
efba0f44b9
Merge pull request #4612 from hpcaitech/feature/shardformer
...
[shardformer] update hybrid parallel plugin and fix bugs
1 year ago
Hongxin Liu
fae6c92ead
Merge branch 'main' into feature/shardformer
1 year ago
Hongxin Liu
ac178ca5c1
[legacy] move builder and registry to legacy ( #4603 )
1 year ago
Hongxin Liu
8accecd55b
[legacy] move engine to legacy ( #4560 )
...
* [legacy] move engine to legacy
* [example] fix seq parallel example
* [example] fix seq parallel example
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [example] update seq parallel requirements
1 year ago
Hongxin Liu
89fe027787
[legacy] move trainer to legacy ( #4545 )
...
* [legacy] move trainer to legacy
* [doc] update docs related to trainer
* [test] ignore legacy test
1 year ago
Hongxin Liu
bd18678478
[test] fix gemini checkpoint and gpt test ( #4620 )
1 year ago
Hongxin Liu
807e01a4ba
[zero] hotfix master param sync ( #4618 )
...
* [zero] add method to update master params
* [zero] update zero plugin
* [plugin] update low level zero plugin
1 year ago
Hongxin Liu
e71d245293
[test] ignore gpt2 shardformer test ( #4619 )
1 year ago