Hongxin Liu
b8e770c832
[test] merge old components to test to model zoo ( #4945 )
...
* [test] add custom models in model zoo
* [test] update legacy test
* [test] update model zoo
* [test] update gemini test
* [test] remove components to test
2023-10-20 10:35:08 +08:00
Cuiqing Li
3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention ( #4946 )
...
* add some req for inference
* clean codes
* add codes
* add some lightllm deps
* clean codes
* hello
* delete rms files
* add some comments
* add comments
* add doc
* add lightllm deps
* add lightllm cahtglm2 kernels
* add lightllm cahtglm2 kernels
* replace rotary embedding with lightllm kernel
* add some commnets
* add some comments
* add some comments
* add
* replace fwd kernel att1
* fix a arg
* add
* add
* fix token attention
* add some comments
* clean codes
* modify comments
* fix readme
* fix bug
* fix bug
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
2023-10-19 22:22:47 +08:00
github-actions[bot]
486d06a2d5
[format] applied code formatting on changed files in pull request 4820 ( #4886 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-10-18 11:46:37 +08:00
Zhongkai Zhao
c7aa319ba0
[test] add no master test for low level zero plugin ( #4934 )
2023-10-18 11:41:23 +08:00
Hongxin Liu
1f5d2e8062
[hotfix] fix torch 2.0 compatibility ( #4936 )
...
* [hotfix] fix launch
* [test] fix test gemini optim
* [shardformer] fix vit
2023-10-18 11:05:25 +08:00
Baizhou Zhang
21ba89cab6
[gemini] support gradient accumulation ( #4869 )
...
* add test
* fix no_sync bug in low level zero plugin
* fix test
* add argument for grad accum
* add grad accum in backward hook for gemini
* finish implementation, rewrite tests
* fix test
* skip stuck model in low level zero test
* update doc
* optimize communication & fix gradient checkpoint
* modify doc
* cleaning codes
* update cpu adam fp16 case
2023-10-17 14:07:21 +08:00
Hongxin Liu
4f68b3f10c
[kernel] support pure fp16 for cpu adam and update gemini optim tests ( #4921 )
...
* [kernel] support pure fp16 for cpu adam (#4896 )
* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919 )
* [kernel] fix cpu adam
* [test] update gemini optim test
2023-10-16 21:56:53 +08:00
Xu Kai
611a5a80ca
[inference] Add smmoothquant for llama ( #4904 )
...
* [inference] add int8 rotary embedding kernel for smoothquant (#4843 )
* [inference] add smoothquant llama attention (#4850 )
* add smoothquant llama attention
* remove uselss code
* remove useless code
* fix import error
* rename file name
* [inference] add silu linear fusion for smoothquant llama mlp (#4853 )
* add silu linear
* update skip condition
* catch smoothquant cuda lib exception
* prcocess exception for tests
* [inference] add llama mlp for smoothquant (#4854 )
* add llama mlp for smoothquant
* fix down out scale
* remove duplicate lines
* add llama mlp check
* delete useless code
* [inference] add smoothquant llama (#4861 )
* add smoothquant llama
* fix attention accuracy
* fix accuracy
* add kv cache and save pretrained
* refactor example
* delete smooth
* refactor code
* [inference] add smooth function and delete useless code for smoothquant (#4895 )
* add smooth function and delete useless code
* update datasets
* remove duplicate import
* delete useless file
* refactor codes (#4902 )
* rafactor code
* add license
* add torch-int and smoothquant license
2023-10-16 11:28:44 +08:00
Xu Kai
77a9328304
[inference] add llama2 support ( #4898 )
...
* add llama2 support
* fix multi group bug
2023-10-13 13:09:23 +08:00
Baizhou Zhang
39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 ( #4864 )
2023-10-12 14:04:24 +08:00
littsk
83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin ( #4837 )
...
* Add clip_grad_norm for hibrid_parallel_plugin
* polish code
* add unittests
* Move tp to a higher-level optimizer interface.
* bug fix
* polish code
2023-10-12 11:32:37 +08:00
Hongxin Liu
df63564184
[gemini] support amp o3 for gemini ( #4872 )
...
* [gemini] support no reuse fp16 chunk
* [gemini] support no master weight for optim
* [gemini] support no master weight for gemini ddp
* [test] update gemini tests
* [test] update gemini tests
* [plugin] update gemini plugin
* [test] fix gemini checkpointio test
* [test] fix gemini checkpoint io
2023-10-12 10:39:08 +08:00
littsk
ffd9a3cbc9
[hotfix] fix bug in sequence parallel test ( #4887 )
2023-10-11 19:30:41 +08:00
Xu Kai
fdec650bb4
fix test llama ( #4884 )
2023-10-11 17:43:01 +08:00
Bin Jia
08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main ( #4820 )
...
* [pipeline inference] pipeline inference (#4492 )
* add pp stage manager as circle stage
* fix a bug when create process group
* add ppinfer basic framework
* add micro batch manager and support kvcache-pp gpt2 fwd
* add generate schedule
* use mb size to control mb number
* support generate with kv cache
* add output, remove unused code
* add test
* reuse shardformer to build model
* refactor some code and use the same attribute name of hf
* fix review and add test for generation
* remove unused file
* fix CI
* add cache clear
* fix code error
* fix typo
* [Pipeline inference] Modify to tieweight (#4599 )
* add pp stage manager as circle stage
* fix a bug when create process group
* add ppinfer basic framework
* add micro batch manager and support kvcache-pp gpt2 fwd
* add generate schedule
* use mb size to control mb number
* support generate with kv cache
* add output, remove unused code
* add test
* reuse shardformer to build model
* refactor some code and use the same attribute name of hf
* fix review and add test for generation
* remove unused file
* modify the way of saving newtokens
* modify to tieweight
* modify test
* remove unused file
* solve review
* add docstring
* [Pipeline inference] support llama pipeline inference (#4647 )
* support llama pipeline inference
* remove tie weight operation
* [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708 )
* add benchmark verbose
* fix export tokens
* fix benchmark verbose
* add P2POp style to do p2p communication
* modify schedule as p2p type when ppsize is 2
* remove unused code and add docstring
* [Pipeline inference] Refactor code, add docsting, fix bug (#4790 )
* add benchmark script
* update argparse
* fix fp16 load
* refactor code style
* add docstring
* polish code
* fix test bug
* [Pipeline inference] Add pipeline inference docs (#4817 )
* add readme doc
* add a ico
* Add performance
* update table of contents
* refactor code (#4873 )
2023-10-11 11:40:06 +08:00
Hongxin Liu
cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility ( #4824 )
2023-10-07 10:45:52 +08:00
Zhongkai Zhao
db40e086c8
[test] modify model supporting part of low_level_zero plugin (including correspoding docs)
2023-10-05 15:10:31 +08:00
Xu Kai
d1fcc0fa4d
[infer] fix test bug ( #4838 )
...
* fix test bug
* delete useless code
* fix typo
2023-10-04 10:01:03 +08:00
Jianghai
013a4bedf0
[inference]fix import bug and delete down useless init ( #4830 )
...
* fix import bug and release useless init
* fix
* fix
* fix
2023-10-04 09:18:45 +08:00
Hongxin Liu
4965c0dabd
[lazy] support from_pretrained ( #4801 )
...
* [lazy] patch from pretrained
* [lazy] fix from pretrained and add tests
* [devops] update ci
2023-09-26 11:04:11 +08:00
Baizhou Zhang
64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel ( #4774 )
...
* support unsharded saving/loading for model
* support optimizer unsharded saving
* update doc
* support unsharded loading for optimizer
* small fix
2023-09-26 10:58:03 +08:00
Jianghai
ce7ade3882
[inference] chatglm2 infer demo ( #4724 )
...
* add chatglm2
* add
* gather needed kernels
* fix some bugs
* finish context forward
* finish context stage
* fix
* add
* pause
* add
* fix bugs
* finish chatglm
* fix bug
* change some logic
* fix bugs
* change some logics
* add
* add
* add
* fix
* fix tests
* fix
2023-09-22 11:12:50 +08:00
Xu Kai
946ab56c48
[feature] add gptq for inference ( #4754 )
...
* [gptq] add gptq kernel (#4416 )
* add gptq
* refactor code
* fix tests
* replace auto-gptq
* rname inferance/quant
* refactor test
* add auto-gptq as an option
* reset requirements
* change assert and check auto-gptq
* add import warnings
* change test flash attn version
* remove example
* change requirements of flash_attn
* modify tests
* [skip ci] change requirements-test
* [gptq] faster gptq cuda kernel (#4494 )
* [skip ci] add cuda kernels
* add license
* [skip ci] fix max_input_len
* format files & change test size
* [skip ci]
* [gptq] add gptq tensor parallel (#4538 )
* add gptq tensor parallel
* add gptq tp
* delete print
* add test gptq check
* add test auto gptq check
* [gptq] combine gptq and kv cache manager (#4706 )
* combine gptq and kv cache manager
* add init bits
* delete useless code
* add model path
* delete usless print and update test
* delete usless import
* move option gptq to shard config
* change replace linear to shardformer
* update bloom policy
* delete useless code
* fix import bug and delete uselss code
* change colossalai/gptq to colossalai/quant/gptq
* update import linear for tests
* delete useless code and mv gptq_kernel to kernel directory
* fix triton kernel
* add triton import
2023-09-22 11:02:50 +08:00
Hongxin Liu
3e05c07bb8
[lazy] support torch 2.0 ( #4763 )
...
* [lazy] support _like methods and clamp
* [lazy] pass transformers models
* [lazy] fix device move and requires grad
* [lazy] fix requires grad and refactor api
* [lazy] fix requires grad
2023-09-21 16:30:23 +08:00
Baizhou Zhang
c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic ( #4758 )
...
* fix master param sync for hybrid plugin
* rewrite unwrap for ddp/fsdp
* rewrite unwrap for zero/gemini
* rewrite unwrap for hybrid plugin
* fix geemini unwrap
* fix bugs
2023-09-20 18:29:37 +08:00
Hongxin Liu
079bf3cb26
[misc] update pre-commit and run all files ( #4752 )
...
* [misc] update pre-commit
* [misc] run pre-commit
* [misc] remove useless configuration files
* [misc] ignore cuda for clang-format
2023-09-19 14:20:26 +08:00
Hongxin Liu
b5f9e37c70
[legacy] clean up legacy code ( #4743 )
...
* [legacy] remove outdated codes of pipeline (#4692 )
* [legacy] remove cli of benchmark and update optim (#4690 )
* [legacy] remove cli of benchmark and update optim
* [doc] fix cli doc test
* [legacy] fix engine clip grad norm
* [legacy] remove outdated colo tensor (#4694 )
* [legacy] remove outdated colo tensor
* [test] fix test import
* [legacy] move outdated zero to legacy (#4696 )
* [legacy] clean up utils (#4700 )
* [legacy] clean up utils
* [example] update examples
* [legacy] clean up amp
* [legacy] fix amp module
* [legacy] clean up gpc (#4742 )
* [legacy] clean up context
* [legacy] clean core, constants and global vars
* [legacy] refactor initialize
* [example] fix examples ci
* [example] fix examples ci
* [legacy] fix tests
* [example] fix gpt example
* [example] fix examples ci
* [devops] fix ci installation
* [example] fix examples ci
2023-09-18 16:31:06 +08:00
Pengtai Xu
cd4e61d149
[legacy] remove deterministic data loader test
2023-09-15 15:52:18 +08:00
digger yu
9c2feb2f0b
fix some typo with colossalai/device colossalai/tensor/ etc. ( #4171 )
...
Co-authored-by: flybird11111 <1829166702@qq.com>
2023-09-12 17:41:52 +08:00
Cuiqing Li
bce0f16702
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system ( #4577 )
...
* [infer] Infer/llama demo (#4503 )
* add
* add infer example
* finish
* finish
* stash
* fix
* [Kernels] add inference token attention kernel (#4505 )
* add token forward
* fix tests
* fix comments
* add try import triton
* add adapted license
* add tests check
* [Kernels] add necessary kernels (llama & bloom) for attention forward and kv-cache manager (#4485 )
* added _vllm_rms_norm
* change place
* added tests
* added tests
* modify
* adding kernels
* added tests:
* adding kernels
* modify
* added
* updating kernels
* adding tests
* added tests
* kernel change
* submit
* modify
* added
* edit comments
* change name
* change commnets and fix import
* add
* added
* combine codes (#4509 )
* [feature] add KV cache manager for llama & bloom inference (#4495 )
* add kv cache memory manager
* add stateinfo during inference
* format
* format
* rename file
* add kv cache test
* revise on BatchInferState
* file dir change
* [Bug FIx] import llama context ops fix (#4524 )
* added _vllm_rms_norm
* change place
* added tests
* added tests
* modify
* adding kernels
* added tests:
* adding kernels
* modify
* added
* updating kernels
* adding tests
* added tests
* kernel change
* submit
* modify
* added
* edit comments
* change name
* change commnets and fix import
* add
* added
* fix
* add ops into init.py
* add
* [Infer] Add TPInferEngine and fix file path (#4532 )
* add engine for TP inference
* move file path
* update path
* fix TPInferEngine
* remove unused file
* add engine test demo
* revise TPInferEngine
* fix TPInferEngine, add test
* fix
* Add Inference test for llama (#4508 )
* add kv cache memory manager
* add stateinfo during inference
* add
* add infer example
* finish
* finish
* format
* format
* rename file
* add kv cache test
* revise on BatchInferState
* add inference test for llama
* fix conflict
* feature: add some new features for llama engine
* adapt colossalai triton interface
* Change the parent class of llama policy
* add nvtx
* move llama inference code to tensor_parallel
* fix __init__.py
* rm tensor_parallel
* fix: fix bugs in auto_policy.py
* fix:rm some unused codes
* mv colossalai/tpinference to colossalai/inference/tensor_parallel
* change __init__.py
* save change
* fix engine
* Bug fix: Fix hang
* remove llama_infer_engine.py
---------
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
* [infer] Add Bloom inference policy and replaced methods (#4512 )
* add bloom inference methods and policy
* enable pass BatchInferState from model forward
* revise bloom infer layers/policies
* add engine for inference (draft)
* add test for bloom infer
* fix bloom infer policy and flow
* revise bloom test
* fix bloom file path
* remove unused codes
* fix bloom modeling
* fix dir typo
* fix trivial
* fix policy
* clean pr
* trivial fix
* Revert "[infer] Add Bloom inference policy and replaced methods (#4512 )" (#4552 )
This reverts commit 17cfa57140
.
* [Doc] Add colossal inference doc (#4549 )
* create readme
* add readme.md
* fix typos
* [infer] Add Bloom inference policy and replaced methods (#4553 )
* add bloom inference methods and policy
* enable pass BatchInferState from model forward
* revise bloom infer layers/policies
* add engine for inference (draft)
* add test for bloom infer
* fix bloom infer policy and flow
* revise bloom test
* fix bloom file path
* remove unused codes
* fix bloom modeling
* fix dir typo
* fix trivial
* fix policy
* clean pr
* trivial fix
* trivial
* Fix Bugs In Llama Model Forward (#4550 )
* add kv cache memory manager
* add stateinfo during inference
* add
* add infer example
* finish
* finish
* format
* format
* rename file
* add kv cache test
* revise on BatchInferState
* add inference test for llama
* fix conflict
* feature: add some new features for llama engine
* adapt colossalai triton interface
* Change the parent class of llama policy
* add nvtx
* move llama inference code to tensor_parallel
* fix __init__.py
* rm tensor_parallel
* fix: fix bugs in auto_policy.py
* fix:rm some unused codes
* mv colossalai/tpinference to colossalai/inference/tensor_parallel
* change __init__.py
* save change
* fix engine
* Bug fix: Fix hang
* remove llama_infer_engine.py
* bug fix: fix bugs about infer_state.is_context_stage
* remove pollcies
* fix: delete unused code
* fix: delete unused code
* remove unused coda
* fix conflict
---------
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
* [doc] add colossal inference fig (#4554 )
* create readme
* add readme.md
* fix typos
* upload fig
* [NFC] fix docstring for colossal inference (#4555 )
Fix docstring and comments in kv cache manager and bloom modeling
* fix docstring in llama modeling (#4557 )
* [Infer] check import vllm (#4559 )
* change import vllm
* import apply_rotary_pos_emb
* change import location
* [DOC] add installation req (#4561 )
* add installation req
* fix
* slight change
* remove empty
* [Feature] rms-norm transfer into inference llama.py (#4563 )
* add installation req
* fix
* slight change
* remove empty
* add rmsnorm polciy
* add
* clean codes
* [infer] Fix tp inference engine (#4564 )
* fix engine prepare data
* add engine test
* use bloom for testing
* revise on test
* revise on test
* reset shardformer llama (#4569 )
* [infer] Fix engine - tensors on different devices (#4570 )
* fix diff device in engine
* [codefactor] Feature/colossal inference (#4579 )
* code factors
* remove
* change coding (#4581 )
* [doc] complete README of colossal inference (#4585 )
* complete fig
* Update README.md
* [doc]update readme (#4586 )
* update readme
* Update README.md
* bug fix: fix bus in llama and bloom (#4588 )
* [BUG FIX]Fix test engine in CI and non-vllm kernels llama forward (#4592 )
* fix tests
* clean
* clean
* fix bugs
* add
* fix llama non-vllm kernels bug
* modify
* clean codes
* [Kernel]Rmsnorm fix (#4598 )
* fix tests
* clean
* clean
* fix bugs
* add
* fix llama non-vllm kernels bug
* modify
* clean codes
* add triton rmsnorm
* delete vllm kernel flag
* [Bug Fix]Fix bugs in llama (#4601 )
* fix tests
* clean
* clean
* fix bugs
* add
* fix llama non-vllm kernels bug
* modify
* clean codes
* bug fix: remove rotary_positions_ids
---------
Co-authored-by: cuiqing.li <lixx3527@gmail.com>
* [kernel] Add triton layer norm & replace norm for bloom (#4609 )
* add layernorm for inference
* add test for layernorm kernel
* add bloom layernorm replacement policy
* trivial: path
* [Infer] Bug fix rotary embedding in llama (#4608 )
* fix rotary embedding
* delete print
* fix init seq len bug
* rename pytest
* add benchmark for llama
* refactor codes
* delete useless code
* [bench] Add bloom inference benchmark (#4621 )
* add bloom benchmark
* readme - update benchmark res
* trivial - uncomment for testing (#4622 )
* [Infer] add check triton and cuda version for tests (#4627 )
* fix rotary embedding
* delete print
* fix init seq len bug
* rename pytest
* add benchmark for llama
* refactor codes
* delete useless code
* add check triton and cuda
* Update sharder.py (#4629 )
* [Inference] Hot fix some bugs and typos (#4632 )
* fix
* fix test
* fix conflicts
* [typo]Comments fix (#4633 )
* fallback
* fix commnets
* bug fix: fix some bugs in test_llama and test_bloom (#4635 )
* [Infer] delete benchmark in tests and fix bug for llama and bloom (#4636 )
* fix rotary embedding
* delete print
* fix init seq len bug
* rename pytest
* add benchmark for llama
* refactor codes
* delete useless code
* add check triton and cuda
* delete benchmark and fix infer bugs
* delete benchmark for tests
* delete useless code
* delete bechmark function in utils
* [Fix] Revise TPInferEngine, inference tests and benchmarks (#4642 )
* [Fix] revise TPInferEngine methods and inference tests
* fix llama/bloom infer benchmarks
* fix infer tests
* trivial fix: benchmakrs
* trivial
* trivial: rm print
* modify utils filename for infer ops test (#4657 )
* [Infer] Fix TPInferEngine init & inference tests, benchmarks (#4670 )
* fix engine funcs
* TPInferEngine: receive shard config in init
* benchmarks: revise TPInferEngine init
* benchmarks: remove pytest decorator
* trivial fix
* use small model for tests
* [NFC] use args for infer benchmarks (#4674 )
* revise infer default (#4683 )
* [Fix] optimize/shard model in TPInferEngine init (#4684 )
* remove using orig model in engine
* revise inference tests
* trivial: rename
---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
2023-09-12 01:22:56 +08:00
flybird11111
eedaa3e1ef
[shardformer]fix gpt2 double head ( #4663 )
...
* [shardformer]fix gpt2 test
[shardformer]fix gpt2 test
[shardformer]fix gpt2 test
* fix
* [shardformer] add todo
* [shardformer] add todo
2023-09-11 18:35:03 +08:00
Hongxin Liu
554aa9592e
[legacy] move communication and nn to legacy and refactor logger ( #4671 )
...
* [legacy] move communication to legacy (#4640 )
* [legacy] refactor logger and clean up legacy codes (#4654 )
* [legacy] make logger independent to gpc
* [legacy] make optim independent to registry
* [legacy] move test engine to legacy
* [legacy] move nn to legacy (#4656 )
* [legacy] move nn to legacy
* [checkpointio] fix save hf config
* [test] remove useledd rpc pp test
* [legacy] fix nn init
* [example] skip tutorial hybriad parallel example
* [devops] test doc check
* [devops] test doc check
2023-09-11 16:24:28 +08:00
flybird11111
7486ed7d3a
[shardformer] update llama2/opt finetune example and fix llama2 policy ( #4645 )
...
* [shardformer] update shardformer readme
[shardformer] update shardformer readme
[shardformer] update shardformer readme
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] update llama2/opt finetune example and shardformer update to llama2
* [shardformer] change dataset
* [shardformer] change dataset
* [shardformer] fix CI
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
[example] update opt example
[example] resolve comments
fix
fix
2023-09-09 22:45:36 +08:00
Baizhou Zhang
660eed9124
[pipeline] set optimizer to optional in execute_pipeline ( #4630 )
...
* set optimizer to optional in execute_pipeline
* arrange device and mixed precision in booster init
* fix execute_pipeline in booster.py
2023-09-07 10:42:59 +08:00
Hongxin Liu
fae6c92ead
Merge branch 'main' into feature/shardformer
2023-09-05 21:54:08 +08:00
Hongxin Liu
8accecd55b
[legacy] move engine to legacy ( #4560 )
...
* [legacy] move engine to legacy
* [example] fix seq parallel example
* [example] fix seq parallel example
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [test] test gemini pluging hang
* [example] update seq parallel requirements
2023-09-05 21:53:10 +08:00
Hongxin Liu
89fe027787
[legacy] move trainer to legacy ( #4545 )
...
* [legacy] move trainer to legacy
* [doc] update docs related to trainer
* [test] ignore legacy test
2023-09-05 21:53:10 +08:00
Hongxin Liu
bd18678478
[test] fix gemini checkpoint and gpt test ( #4620 )
2023-09-05 16:02:23 +08:00
Hongxin Liu
807e01a4ba
[zero] hotfix master param sync ( #4618 )
...
* [zero] add method to update master params
* [zero] update zero plugin
* [plugin] update low level zero plugin
2023-09-05 15:04:02 +08:00
Hongxin Liu
e71d245293
[test] ignore gpt2 shardformer test ( #4619 )
2023-09-05 14:21:31 +08:00
Hongxin Liu
a39a5c66fe
Merge branch 'main' into feature/shardformer
2023-09-04 23:43:13 +08:00
Baizhou Zhang
e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins ( #4606 )
2023-09-04 23:25:01 +08:00
Jianghai
24c0768795
[shardformer] Pytree fix ( #4533 )
...
* pytree test
* test bert
* test bert
* test bert
* revise
* add register
* add register
2023-09-04 17:52:23 +08:00
Hongxin Liu
508ca36fe3
[pipeline] 1f1b schedule receive microbatch size ( #4589 )
2023-09-01 21:45:14 +08:00
LuGY
cbac782254
[zero]fix zero ckptIO with offload ( #4529 )
...
* fix zero ckptio with offload
* fix load device
* saved tensors in ckpt should be on CPU
* fix unit test
* fix unit test
* add clear cache
* save memory for CI
2023-09-01 17:41:19 +08:00
Baizhou Zhang
38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin ( #4575 )
...
* hybrid plugin support huggingface from_pretrained
* add huggingface compatibility tests
* add folder cleaning
* fix bugs
2023-09-01 17:40:01 +08:00
Baizhou Zhang
c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin ( #4540 )
...
* implement sharded optimizer saving
* add more param info
* finish implementation of sharded optimizer saving
* fix bugs in optimizer sharded saving
* add pp+zero test
* param group loading
* greedy loading of optimizer
* fix bug when loading
* implement optimizer sharded saving
* add optimizer test & arrange checkpointIO utils
* fix gemini sharding state_dict
* add verbose option
* add loading of master params
* fix typehint
* fix master/working mapping in fp16 amp
2023-08-31 14:50:47 +08:00
Baizhou Zhang
2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp ( #4544 )
2023-08-31 09:57:18 +08:00
flybird11111
ec18fc7340
[shardformer] support pp+tp+zero1 tests ( #4531 )
...
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
2023-08-30 21:29:18 +08:00
flybird11111
d367b88785
[shardformer] fix opt test hanging ( #4521 )
...
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
2023-08-30 14:50:34 +08:00