flybird11111
97cd0cd559
[shardformer] fix llama error when transformers upgraded. ( #5055 )
...
* fix-llama
* Update llama.py
1 year ago
flybird11111
3e02154710
[gemini] gemini support extra-dp ( #5043 )
...
* support ddp
* fix
* fix
* fix
fix
* support ddp
* fix
* fix
* fix
fix
* simplify tests
* fix
* fix
* fix
fix
fix
* fix
1 year ago
Elsa Granger
b2ad0d9e8f
[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark ( #5017 )
...
* Use p2p
* Cannot bidirectonal send p2p
* Refactor tensor creation and serialization in P2P
communication
* Fix llama forward args in flash attention
* Add flop estimate from megatron
* Support loading weight not in weight_map when strict=False in hybrid_parallel
* Use send_forward_recv_backward, etc in 1f1b
* Use dataclass for metdata
Remove torch.cuda.synchronize() as suggested
* Add comment about the torch.cuda.synchronize for potential error
* Typo
* Update hybrid_parallel_checkpoint_io.py
* Update p2p.py
* Update one_f_one_b.py
* Update p2p.py
---------
Co-authored-by: flybird11111 <1829166702@qq.com>
1 year ago
Cuiqing Li (李崔卿)
28052a71fb
[Kernels]Update triton kernels into 2.1.0 ( #5046 )
...
* update flash-context-attention
* adding kernels
* fix
* reset
* add build script
* add building process
* add llama2 exmaple
* add colossal-llama2 test
* clean
* fall back test setting
* fix test file
* clean
* clean
* clean
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
1 year ago
Orion-Zheng
43ad0d9ef0
fix wrong EOS token in ColossalChat
1 year ago
Zhongkai Zhao
70885d707d
[hotfix] Suport extra_kwargs in ShardConfig ( #5031 )
...
* [refactor]: replace inference args with extra_kwargs in ShardConfig
* modify shardconfig
* polish code
* fix policy bug in llama
* fix bug in auto policy
* remove setattr in ShardConfig
1 year ago
flybird11111
576a2f7b10
[gemini] gemini support tensor parallelism. ( #4942 )
...
* [colossalai]fix typo
* [inference] Add smmoothquant for llama (#4904 )
* [inference] add int8 rotary embedding kernel for smoothquant (#4843 )
* [inference] add smoothquant llama attention (#4850 )
* add smoothquant llama attention
* remove uselss code
* remove useless code
* fix import error
* rename file name
* [inference] add silu linear fusion for smoothquant llama mlp (#4853 )
* add silu linear
* update skip condition
* catch smoothquant cuda lib exception
* prcocess exception for tests
* [inference] add llama mlp for smoothquant (#4854 )
* add llama mlp for smoothquant
* fix down out scale
* remove duplicate lines
* add llama mlp check
* delete useless code
* [inference] add smoothquant llama (#4861 )
* add smoothquant llama
* fix attention accuracy
* fix accuracy
* add kv cache and save pretrained
* refactor example
* delete smooth
* refactor code
* [inference] add smooth function and delete useless code for smoothquant (#4895 )
* add smooth function and delete useless code
* update datasets
* remove duplicate import
* delete useless file
* refactor codes (#4902 )
* rafactor code
* add license
* add torch-int and smoothquant license
* Update flash_attention_patch.py
To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
https://github.com/huggingface/transformers/pull/25598
* [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921 )
* [kernel] support pure fp16 for cpu adam (#4896 )
* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919 )
* [kernel] fix cpu adam
* [test] update gemini optim test
* [format] applied code formatting on changed files in pull request 4908 (#4918 )
Co-authored-by: github-actions <github-actions@github.com>
* [gemini] support gradient accumulation (#4869 )
* add test
* fix no_sync bug in low level zero plugin
* fix test
* add argument for grad accum
* add grad accum in backward hook for gemini
* finish implementation, rewrite tests
* fix test
* skip stuck model in low level zero test
* update doc
* optimize communication & fix gradient checkpoint
* modify doc
* cleaning codes
* update cpu adam fp16 case
* [hotfix] fix torch 2.0 compatibility (#4936 )
* [hotfix] fix launch
* [test] fix test gemini optim
* [shardformer] fix vit
* [test] add no master test for low level zero plugin (#4934 )
* [format] applied code formatting on changed files in pull request 4820 (#4886 )
Co-authored-by: github-actions <github-actions@github.com>
* [nfc] fix some typo with colossalai/ docs/ etc. (#4920 )
* [Refactor] Integrated some lightllm kernels into token-attention (#4946 )
* add some req for inference
* clean codes
* add codes
* add some lightllm deps
* clean codes
* hello
* delete rms files
* add some comments
* add comments
* add doc
* add lightllm deps
* add lightllm cahtglm2 kernels
* add lightllm cahtglm2 kernels
* replace rotary embedding with lightllm kernel
* add some commnets
* add some comments
* add some comments
* add
* replace fwd kernel att1
* fix a arg
* add
* add
* fix token attention
* add some comments
* clean codes
* modify comments
* fix readme
* fix bug
* fix bug
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
* [test] merge old components to test to model zoo (#4945 )
* [test] add custom models in model zoo
* [test] update legacy test
* [test] update model zoo
* [test] update gemini test
* [test] remove components to test
* [inference] add reference and fix some bugs (#4937 )
* add reference and fix some bugs
* update gptq init
---------
Co-authored-by: Xu Kai <xukai16@foxamil.com>
* [Inference]ADD Bench Chatglm2 script (#4963 )
* add bench chatglm
* fix bug and make utils
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [Pipeline inference] Combine kvcache with pipeline inference (#4938 )
* merge kvcache with pipeline inference and refactor the code structure
* support ppsize > 2
* refactor pipeline code
* do pre-commit
* modify benchmark
* fix bench mark
* polish code
* add docstring and update readme
* refactor the code
* fix some logic bug of ppinfer
* polish readme
* fix typo
* skip infer test
* updated c++17 compiler flags (#4983 )
* [Inference] Dynamic Batching Inference, online and offline (#4953 )
* [inference] Dynamic Batching for Single and Multiple GPUs (#4831 )
* finish batch manager
* 1
* first
* fix
* fix dynamic batching
* llama infer
* finish test
* support different lengths generating
* del prints
* del prints
* fix
* fix bug
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [inference] Async dynamic batching (#4894 )
* finish input and output logic
* add generate
* test forward
* 1
* [inference]Re push async dynamic batching (#4901 )
* adapt to ray server
* finish async
* finish test
* del test
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* Revert "[inference]Re push async dynamic batching (#4901 )" (#4905 )
This reverts commit fbf3c09e67
.
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Revert "[inference] Async dynamic batching (#4894 )" (#4909 )
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* [infer]Add Ray Distributed Environment Init Scripts (#4911 )
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* support dynamic batch for bloom model and is_running function
* [Inference]Test for new Async engine (#4935 )
* infer engine
* infer engine
* test engine
* test engine
* new manager
* change step
* add
* test
* fix
* fix
* finish test
* finish test
* finish test
* finish test
* add license
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* add assertion for config (#4947 )
* [Inference] Finish dynamic batching offline test (#4948 )
* test
* fix test
* fix quant
* add default
* fix
* fix some bugs
* fix some bugs
* fix
* fix bug
* fix bugs
* reset param
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965 )
* adding flash-decoding
* clean
* adding kernel
* adding flash-decoding
* add integration
* add
* adding kernel
* adding kernel
* adding triton 2.1.0 features for inference
* update bloom triton kernel
* remove useless vllm kernels
* clean codes
* fix
* adding files
* fix readme
* update llama flash-decoding
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
* fix ColossalEval (#4992 )
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
* [doc]Update doc for colossal-inference (#4989 )
* update doc
* Update README.md
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
* [hotfix] Fix the bug where process groups were not being properly released. (#4940 )
* Fix the bug where process groups were not being properly released.
* test
* Revert "test"
This reverts commit 479900c139
.
* [hotfix] fix the bug of repeatedly storing param group (#4951 )
* [doc] add supported feature diagram for hybrid parallel plugin (#4996 )
* [Pipeline Inference] Merge pp with tp (#4993 )
* refactor pipeline into new CaiInferEngine
* updata llama modeling forward
* merge tp with pp
* update docstring
* optimize test workflow and example
* fix typo
* add assert and todo
* [release] update version (#4995 )
* [release] update version
* [hotfix] fix ci
* [gemini] gemini support tp
[gemini] gemini support tp
[gemini] gemini support tp
[gemini] gemini support tp
[gemini] gemini support tp
* fix
fix
fix
* update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
* support fused layernorm
support fused layernorm
support fused layernorm
* update fusedlayernorm
update fusedlayernorm
update fusedlayernorm
* add sequence parallel to gemini
add sequence parallel to gemini
* fix
* fix comments
fix comments
fix comments
* fix
* fix t5
* clear cache
* fix
* activate ci
* activate ci
* fix
* fix
* fix
* fix
* revert
* modify tp gather method
modify tp gather method
modify tp gather method
modify tp gather method
* fix test
---------
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: Xu Kai <xukai16@foxamil.com>
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com>
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: littsk <1214689160@qq.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
1 year ago
Jun Gao
a4489384d5
[shardformer] Fix serialization error with Tensor Parallel state saving ( #5018 )
...
* Fix serialization error with Tensor Parallel state saving
* Refactor state_dict CPU transfer using tree_map
1 year ago
Wenhao Chen
724441279b
[moe]: fix ep/tp tests, add hierarchical all2all ( #4982 )
...
* fix: add warning for EP different behavior
* fix: use shard_data in ep & tp model
* to: add used_capacity
* fix: fix router test
* feat: add create_ep_node_group
* feat: add create_ep_hierarchical_group fn
* feat: add HierarchicalAllToAll
* test: add hierarchical all2all test
* fix: fix test errors
* fix: simplify create_ep_hierarchical_group
* fix: add hierarchical_alltoall arg
* fix: fix environ typo
* revert: revert process mesh order
* to: add todo mark
* fix: skip hierarchical_comm if torch < 1.13.1
1 year ago
Yuanchen
239cd92eff
Support mtbench ( #5025 )
...
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
1 year ago
Xuanlei Zhao
f71e63b0f3
[moe] support optimizer checkpoint ( #5015 )
...
* Refactor MoE Manager setup method
* unshard optim ckpt
* optim io
* update transformer version
* update requirements
* update ckpt
* update ckpt
* update ckpt
* fix engine
* fix engine
1 year ago
Hongxin Liu
67f5331754
[misc] add code owners ( #5024 )
1 year ago
Jianghai
ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism ( #5014 )
...
* fix bug
* fix
* fix multiquery
* fix multiquery
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
1 year ago
github-actions[bot]
c36e782d80
[format] applied code formatting on changed files in pull request 4926 ( #5007 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
littsk
1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel ( #4926 )
...
* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915 )
* Add layer norm gradients all-reduce for sequence parallel.
* skip pipeline inference test
* [hotfix] fixing polices of sequence parallel (#4922 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
---------
Co-authored-by: littsk <1214689160@qq.com>
* Hotfix/add grad all reduce for sequence parallel (#4927 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
* fix bug using wrong variables
---------
Co-authored-by: littsk <1214689160@qq.com>
* fix policy initialization
* fix bloom and chatglm policices
* polish code of handling layernorm
* fix moe module
* polish code of class initializing
---------
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
1 year ago
Baizhou Zhang
d99b2c961a
[hotfix] fix grad accumulation plus clipping for gemini ( #5002 )
1 year ago
Xuanlei Zhao
dc003c304c
[moe] merge moe into main ( #4978 )
...
* update moe module
* support openmoe
1 year ago
Hongxin Liu
8993c8a817
[release] update version ( #4995 )
...
* [release] update version
* [hotfix] fix ci
1 year ago
Bin Jia
b6696beb04
[Pipeline Inference] Merge pp with tp ( #4993 )
...
* refactor pipeline into new CaiInferEngine
* updata llama modeling forward
* merge tp with pp
* update docstring
* optimize test workflow and example
* fix typo
* add assert and todo
1 year ago
ppt0011
335cb105e2
[doc] add supported feature diagram for hybrid parallel plugin ( #4996 )
1 year ago
Baizhou Zhang
c040d70aa0
[hotfix] fix the bug of repeatedly storing param group ( #4951 )
1 year ago
littsk
be82b5d4ca
[hotfix] Fix the bug where process groups were not being properly released. ( #4940 )
...
* Fix the bug where process groups were not being properly released.
* test
* Revert "test"
This reverts commit 479900c139
.
1 year ago
Cuiqing Li (李崔卿)
4f0234f236
[doc]Update doc for colossal-inference ( #4989 )
...
* update doc
* Update README.md
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
1 year ago
Yuanchen
abe071b663
fix ColossalEval ( #4992 )
...
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
1 year ago
Cuiqing Li
459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention ( #4965 )
...
* adding flash-decoding
* clean
* adding kernel
* adding flash-decoding
* add integration
* add
* adding kernel
* adding kernel
* adding triton 2.1.0 features for inference
* update bloom triton kernel
* remove useless vllm kernels
* clean codes
* fix
* adding files
* fix readme
* update llama flash-decoding
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
1 year ago
Jianghai
cf579ff46d
[Inference] Dynamic Batching Inference, online and offline ( #4953 )
...
* [inference] Dynamic Batching for Single and Multiple GPUs (#4831 )
* finish batch manager
* 1
* first
* fix
* fix dynamic batching
* llama infer
* finish test
* support different lengths generating
* del prints
* del prints
* fix
* fix bug
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [inference] Async dynamic batching (#4894 )
* finish input and output logic
* add generate
* test forward
* 1
* [inference]Re push async dynamic batching (#4901 )
* adapt to ray server
* finish async
* finish test
* del test
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* Revert "[inference]Re push async dynamic batching (#4901 )" (#4905 )
This reverts commit fbf3c09e67
.
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Revert "[inference] Async dynamic batching (#4894 )" (#4909 )
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* [infer]Add Ray Distributed Environment Init Scripts (#4911 )
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* support dynamic batch for bloom model and is_running function
* [Inference]Test for new Async engine (#4935 )
* infer engine
* infer engine
* test engine
* test engine
* new manager
* change step
* add
* test
* fix
* fix
* finish test
* finish test
* finish test
* finish test
* add license
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* add assertion for config (#4947 )
* [Inference] Finish dynamic batching offline test (#4948 )
* test
* fix test
* fix quant
* add default
* fix
* fix some bugs
* fix some bugs
* fix
* fix bug
* fix bugs
* reset param
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
1 year ago
アマデウス
4e4a10c97d
updated c++17 compiler flags ( #4983 )
1 year ago
Bin Jia
1db6727678
[Pipeline inference] Combine kvcache with pipeline inference ( #4938 )
...
* merge kvcache with pipeline inference and refactor the code structure
* support ppsize > 2
* refactor pipeline code
* do pre-commit
* modify benchmark
* fix bench mark
* polish code
* add docstring and update readme
* refactor the code
* fix some logic bug of ppinfer
* polish readme
* fix typo
* skip infer test
1 year ago
Jianghai
c6cd629e7a
[Inference]ADD Bench Chatglm2 script ( #4963 )
...
* add bench chatglm
* fix bug and make utils
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
1 year ago
Xu Kai
785802e809
[inference] add reference and fix some bugs ( #4937 )
...
* add reference and fix some bugs
* update gptq init
---------
Co-authored-by: Xu Kai <xukai16@foxamil.com>
1 year ago
Hongxin Liu
b8e770c832
[test] merge old components to test to model zoo ( #4945 )
...
* [test] add custom models in model zoo
* [test] update legacy test
* [test] update model zoo
* [test] update gemini test
* [test] remove components to test
1 year ago
Cuiqing Li
3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention ( #4946 )
...
* add some req for inference
* clean codes
* add codes
* add some lightllm deps
* clean codes
* hello
* delete rms files
* add some comments
* add comments
* add doc
* add lightllm deps
* add lightllm cahtglm2 kernels
* add lightllm cahtglm2 kernels
* replace rotary embedding with lightllm kernel
* add some commnets
* add some comments
* add some comments
* add
* replace fwd kernel att1
* fix a arg
* add
* add
* fix token attention
* add some comments
* clean codes
* modify comments
* fix readme
* fix bug
* fix bug
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
1 year ago
digger yu
11009103be
[nfc] fix some typo with colossalai/ docs/ etc. ( #4920 )
1 year ago
github-actions[bot]
486d06a2d5
[format] applied code formatting on changed files in pull request 4820 ( #4886 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Zhongkai Zhao
c7aa319ba0
[test] add no master test for low level zero plugin ( #4934 )
1 year ago
Hongxin Liu
1f5d2e8062
[hotfix] fix torch 2.0 compatibility ( #4936 )
...
* [hotfix] fix launch
* [test] fix test gemini optim
* [shardformer] fix vit
1 year ago
Baizhou Zhang
21ba89cab6
[gemini] support gradient accumulation ( #4869 )
...
* add test
* fix no_sync bug in low level zero plugin
* fix test
* add argument for grad accum
* add grad accum in backward hook for gemini
* finish implementation, rewrite tests
* fix test
* skip stuck model in low level zero test
* update doc
* optimize communication & fix gradient checkpoint
* modify doc
* cleaning codes
* update cpu adam fp16 case
1 year ago
github-actions[bot]
a41cf88e9b
[format] applied code formatting on changed files in pull request 4908 ( #4918 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Hongxin Liu
4f68b3f10c
[kernel] support pure fp16 for cpu adam and update gemini optim tests ( #4921 )
...
* [kernel] support pure fp16 for cpu adam (#4896 )
* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919 )
* [kernel] fix cpu adam
* [test] update gemini optim test
1 year ago
Zian(Andy) Zheng
7768afbad0
Update flash_attention_patch.py
...
To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
https://github.com/huggingface/transformers/pull/25598
1 year ago
Xu Kai
611a5a80ca
[inference] Add smmoothquant for llama ( #4904 )
...
* [inference] add int8 rotary embedding kernel for smoothquant (#4843 )
* [inference] add smoothquant llama attention (#4850 )
* add smoothquant llama attention
* remove uselss code
* remove useless code
* fix import error
* rename file name
* [inference] add silu linear fusion for smoothquant llama mlp (#4853 )
* add silu linear
* update skip condition
* catch smoothquant cuda lib exception
* prcocess exception for tests
* [inference] add llama mlp for smoothquant (#4854 )
* add llama mlp for smoothquant
* fix down out scale
* remove duplicate lines
* add llama mlp check
* delete useless code
* [inference] add smoothquant llama (#4861 )
* add smoothquant llama
* fix attention accuracy
* fix accuracy
* add kv cache and save pretrained
* refactor example
* delete smooth
* refactor code
* [inference] add smooth function and delete useless code for smoothquant (#4895 )
* add smooth function and delete useless code
* update datasets
* remove duplicate import
* delete useless file
* refactor codes (#4902 )
* rafactor code
* add license
* add torch-int and smoothquant license
1 year ago
Zhongkai Zhao
a0684e7bd6
[feature] support no master weights option for low level zero plugin ( #4816 )
...
* [feature] support no master weights for low level zero plugin
* [feature] support no master weights for low level zero plugin, remove data copy when no master weights
* remove data copy and typecasting when no master weights
* not load weights to cpu when using no master weights
* fix grad: use fp16 grad when no master weights
* only do not update working param when no master weights
* fix: only do not update working param when no master weights
* fix: passing params in dict format in hybrid plugin
* fix: remove extra params (tp_process_group) in hybrid_parallel_plugin
1 year ago
Xu Kai
77a9328304
[inference] add llama2 support ( #4898 )
...
* add llama2 support
* fix multi group bug
1 year ago
Baizhou Zhang
39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 ( #4864 )
1 year ago
littsk
83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin ( #4837 )
...
* Add clip_grad_norm for hibrid_parallel_plugin
* polish code
* add unittests
* Move tp to a higher-level optimizer interface.
* bug fix
* polish code
1 year ago
Hongxin Liu
df63564184
[gemini] support amp o3 for gemini ( #4872 )
...
* [gemini] support no reuse fp16 chunk
* [gemini] support no master weight for optim
* [gemini] support no master weight for gemini ddp
* [test] update gemini tests
* [test] update gemini tests
* [plugin] update gemini plugin
* [test] fix gemini checkpointio test
* [test] fix gemini checkpoint io
1 year ago
ppt0011
c1fab951e7
Merge pull request #4889 from ppt0011/main
...
[doc] add reminder for issue encountered with hybrid adam
1 year ago
littsk
ffd9a3cbc9
[hotfix] fix bug in sequence parallel test ( #4887 )
1 year ago
ppt0011
1dcaf249bd
[doc] add reminder for issue encountered with hybrid adam
1 year ago
Xu Kai
fdec650bb4
fix test llama ( #4884 )
1 year ago