Yuanheng Zhao
5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 ( #5624 )
...
* [fix] GQA calling of flash decoding triton
* fix kv cache alloc shape
* fix rotary triton - GQA
* fix sequence max length assigning
* Sequence max length logic
* fix scheduling and spec-dec
* skip without import error
* fix pytest - skip without ImportError
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Steve Luo
ccf72797e3
feat baichuan2 rmsnorm whose hidden size equals to 5120 ( #5611 )
7 months ago
Runyu Lu
e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference ( #5563 )
...
* tensor parallel support naive source
* [fix]precision, model load and refactor the framework
* add tp unit test
* docstring
* fix do_sample
7 months ago
Steve Luo
be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block ( #5531 )
...
* feat flash decoding for paged attention
* refactor flashdecodingattention
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
flybird11111
a0ad587c24
[shardformer] refactor embedding resize ( #5603 )
...
* [branch rebase] rebase main to Feature/resize_embedding (#5554 )
* fix
* [release] update version (#5411 )
* [hotfix] fix typo s/keywrods/keywords etc. (#5429 )
* [devops] fix compatibility (#5444 )
* [devops] fix compatibility
* [hotfix] update compatibility test on pr
* [devops] fix compatibility
* [devops] record duration during comp test
* [test] decrease test duration
* fix falcon
* [shardformer] fix gathering output when using tensor parallelism (#5431 )
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
* [doc] release Open-Sora 1.0 with model weights (#5468 )
* [doc] release Open-Sora 1.0 with model weights
* [doc] release Open-Sora 1.0 with model weights
* [doc] release Open-Sora 1.0 with model weights
* [doc] update open-sora demo (#5479 )
* [doc] update open-sora demo
* [doc] update open-sora demo
* [doc] update open-sora demo
* [example] add grok-1 inference (#5485 )
* [misc] add submodule
* remove submodule
* [example] support grok-1 tp inference
* [example] add grok-1 inference script
* [example] refactor code
* [example] add grok-1 readme
* [exmaple] add test ci
* [exmaple] update readme
---------
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* [CI] run pre-commit (#5577 )
* fix
* [release] update version (#5411 )
* [hotfix] fix typo s/keywrods/keywords etc. (#5429 )
* [devops] fix compatibility (#5444 )
* [devops] fix compatibility
* [hotfix] update compatibility test on pr
* [devops] fix compatibility
* [devops] record duration during comp test
* [test] decrease test duration
* fix falcon
* [shardformer] fix gathering output when using tensor parallelism (#5431 )
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
* [doc] release Open-Sora 1.0 with model weights (#5468 )
* [doc] release Open-Sora 1.0 with model weights
* [doc] release Open-Sora 1.0 with model weights
* [doc] release Open-Sora 1.0 with model weights
* [doc] update open-sora demo (#5479 )
* [doc] update open-sora demo
* [doc] update open-sora demo
* [doc] update open-sora demo
* [example] add grok-1 inference (#5485 )
* [misc] add submodule
* remove submodule
* [example] support grok-1 tp inference
* [example] add grok-1 inference script
* [example] refactor code
* [example] add grok-1 readme
* [exmaple] add test ci
* [exmaple] update readme
* run pre-commit
---------
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* [rebase] rebase main to resize-embedding (#5581 )
* [release] grok-1 314b inference (#5490 )
* [release] grok-1 inference
* [release] grok-1 inference
* [release] grok-1 inference
* [example] update Grok-1 inference (#5495 )
* revise grok-1 example
* remove unused arg in scripts
* prevent re-installing torch
* update readme
* revert modifying colossalai requirements
* add perf
* trivial
* add tokenizer url
* [hotfix] set return_outputs=False in examples and polish code (#5404 )
* fix: simplify merge_batch
* fix: use return_outputs=False to eliminate extra memory consumption
* feat: add return_outputs warning
* style: remove `return_outputs=False` as it is the default value
* [release] grok-1 inference benchmark (#5500 )
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [shardformer]Fix lm parallel. (#5480 )
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
* fix lm forward distribution
* fix
* test ci
* fix
* [fix] fix grok-1 example typo (#5506 )
* [devops] fix example test ci (#5504 )
* Fix ColoTensorSpec for py11 (#5440 )
* fixed layout converter caching and updated tester
* Empty-Commit
* [shardformer] update colo attention to support custom mask (#5510 )
* [feature] refactor colo attention (#5462 )
* [extension] update api
* [feature] add colo attention
* [feature] update sdpa
* [feature] update npu attention
* [feature] update flash-attn
* [test] add flash attn test
* [test] update flash attn test
* [shardformer] update modeling to fit colo attention (#5465 )
* [misc] refactor folder structure
* [shardformer] update llama flash-attn
* [shardformer] fix llama policy
* [devops] update tensornvme install
* [test] update llama test
* [shardformer] update colo attn kernel dispatch
* [shardformer] update blip2
* [shardformer] update chatglm
* [shardformer] update gpt2
* [shardformer] update gptj
* [shardformer] update opt
* [shardformer] update vit
* [shardformer] update colo attention mask prep
* [shardformer] update whisper
* [test] fix shardformer tests (#5514 )
* [test] fix shardformer tests
* [test] fix shardformer tests
* [format] applied code formatting on changed files in pull request 5510 (#5517 )
Co-authored-by: github-actions <github-actions@github.com>
* [shardformer] fix pipeline forward error if custom layer distribution is used (#5189 )
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution
* Change static methods for t5 layer distribution to member functions
* Change static methods for whisper layer distribution to member functions
* Replace whisper policy usage with self one
* Fix test case to use non-static layer distribution methods
* fix: fix typo
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [Fix] Grok-1 use tokenizer from the same pretrained path (#5532 )
* [fix] use tokenizer from the same pretrained path
* trust remote code
* [ColossalChat] Update RLHF V2 (#5286 )
* Add dpo. Fix sft, ppo, lora. Refactor all
* fix and tested ppo
* 2 nd round refactor
* add ci tests
* fix ci
* fix ci
* fix readme, style
* fix readme style
* fix style, fix benchmark
* reproduce benchmark result, remove useless files
* rename to ColossalChat
* use new image
* fix ci workflow
* fix ci
* use local model/tokenizer for ci tests
* fix ci
* fix ci
* fix ci
* fix ci timeout
* fix rm progress bar. fix ci timeout
* fix ci
* fix ci typo
* remove 3d plugin from ci temporary
* test environment
* cannot save optimizer
* support chat template
* fix readme
* fix path
* test ci locally
* restore build_or_pr
* fix ci data path
* fix benchmark
* fix ci, move ci tests to 3080, disable fast tokenizer
* move ci to 85
* support flash attention 2
* add all-in-one data preparation script. Fix colossal-llama2-chat chat template
* add hardware requirements
* move ci test data
* fix save_model, add unwrap
* fix missing bos
* fix missing bos; support grad accumulation with gemini
* fix ci
* fix ci
* fix ci
* fix llama2 chat template config
* debug sft
* debug sft
* fix colossalai version requirement
* fix ci
* add sanity check to prevent NaN loss
* fix requirements
* add dummy data generation script
* add dummy data generation script
* add dummy data generation script
* add dummy data generation script
* update readme
* update readme
* update readme and ignore
* fix logger bug
* support parallel_output
* modify data preparation logic
* fix tokenization
* update lr
* fix inference
* run pre-commit
---------
Co-authored-by: Tong Li <tong.li352711588@gmail.com>
* [shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508 )
* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`
* feat: apply `GradientCheckpointConfig` to policy and llama_forward
* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager
* fix: add optional args for `distribute_layer` and `get_stage_index`
* fix: fix changed API calls
* test: update llama tests
* style: polish `GradientCheckpointConfig`
* fix: fix pipeline utils tests
* fix incorrect sharding without zero (#5545 )
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
* [shardformer] Sequence Parallelism Optimization (#5533 )
* sequence parallel optimization
* validate sequence parallel in llama (code to be polished)
* shardformer api writing
* integrate sequence parallel in ShardFormer
* fix pp bugs and sp bugs for LlaMa model
* integrating ring-based sequence parallelism into ShardFormer
* [sequence parallelism]: Add fused megatron function
* integrating ring-based sequence parallelism into ShardFormer
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* fix bugs when useing sp and flashattention together
* fix operation function name
* support flash attention for ulysses-style sp
* clarify sp process group
* fix compatibility bugs in moe plugin
* fix fused linear bugs
* fix linear layer test
* support gpt model all-to-all sp
* modify shard data dimension (meant to be dim=-1)
* support megtron-style sp and distributed attn for llama model
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* finish sp mode 3 support for gpt
* using all_to_all_single when batch size is 1
* support mode 2 sp in gpt2 (#5 )
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* refactor ring implementation
* support mode 2 sp in gpt2
* polish code
* enable distributed attn mask when using sp mode 2 and 3 in llama
* automatically enable flash attn when using sp mode 2 and 3 in llama
* inplace attn mask
* add zero2 support for sequence parallel
* polish code
* fix bugs
* fix gemini checkpoint io
* loose tensor checking atol and rtol
* add comment
* fix llama layernorm grad
* fix zero grad
* fix zero grad
* fix conflict
* update split and gather auto grad func
* sequence parallel: inside text split (#6 )
* polish code (part 1)
* polish code (part 2)
* polish code (part 2.5)
* polish code (part 3)
* sequence parallel: inside text split
* miscellaneous minor fixes
* polish code
* fix ulysses style ZeRO
* sequence parallel: inside text split
* miscellaneous minor fixes
* disaggregate sp group and dp group for sp
* fix llama and gpt sp
* polish code
* move ulysses grad sync to ddp (#9 )
* remove zero_stage and unbind the grad sync for alltoall sp
* add 2d group creation test
* move ulysses grad sync to ddp
* add 2d group creation test
* remove useless code
* change shard config not to enable sp when enable_all_optimizations
* add sp warnings for several model
* remove useless code
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* [hotfix] quick fixes to make legacy tutorials runnable (#5559 )
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
* [fix] fix typo s/muiti-node /multi-node etc. (#5448 )
* [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548 )
* [devops] remove post commit ci (#5566 )
* [devops] remove post commit ci
* [misc] run pre-commit on all files
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Insu Jang <insujang@umich.edu>
Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com>
Co-authored-by: Tong Li <tong.li352711588@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [shardformer]enable padding vocabulary size. (#5489 )
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
* padding vocab
* padding vocabe
* fix
* fix
* fxi
* test ci
* fix
fix
fix
fix
* fix
fix
* fix
* fix
* Update hybrid_parallel_plugin.py
fix
fix
fix
* fix
fix
* fix
fix
* fix
* resolve super init
resolve super init
resolve super init
resolve super init
* resolve comments
* fix
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* vocab checkpointio
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
fix
fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* padding vocab
* fix
* fix
fix
* fix
fix
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix ci
* fix
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
* cherry-pick
* revert moe modify
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
fix
fix
fix
fix
fix
fix
fix
* resolve comments
resolve comments
resolve comments
resolve comments
resolve comments
* ptensor
ptensor
resolve comments
fix
fix
fix
fix
fix
resolve comments
resolve comments
resolve comments
resolve comments
resolve comments
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix rebase
* fix rebase
---------
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Insu Jang <insujang@umich.edu>
Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com>
Co-authored-by: Tong Li <tong.li352711588@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
yuehuayingxueluo
56b222eff8
[inference/model]Adapted to the baichuan2-7B model ( #5591 )
...
* Adapted to the baichuan2-7B model
* modified according to the review comments.
* Modified the method of obtaining random weights.
* modified according to the review comments.
* change mlp layewr 'NOTE'
7 months ago
Yuanheng Zhao
e60d430cf5
[Fix] resolve conflicts of rebasing feat/speculative-decoding ( #5557 )
...
- resolve conflicts of rebasing feat/speculative-decoding
8 months ago
Yuanheng Zhao
d85d91435a
[Inference/SpecDec] Support GLIDE Drafter Model ( #5455 )
...
* add glide-llama policy and modeling
* update glide modeling, compitable with transformers 4.36.2
* revise glide llama modeling/usage
* fix issues of glimpsing large kv
* revise the way re-loading params for glide drafter
* fix drafter and engine tests
* enable convert to glide strict=False
* revise glide llama modeling
* revise vicuna prompt template
* revise drafter and tests
* apply usage of glide model in engine
8 months ago
Yuanheng Zhao
a37f82629d
[Inference/SpecDec] Add Speculative Decoding Implementation ( #5423 )
...
* fix flash decoding mask during verification
* add spec-dec
* add test for spec-dec
* revise drafter init
* remove drafter sampling
* retire past kv in drafter
* (trivial) rename attrs
* (trivial) rename arg
* revise how we enable/disable spec-dec
8 months ago
Yuanheng Zhao
5a9b05f7b2
[Inference/SpecDec] Add Basic Drafter Model Container ( #5405 )
...
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399 )
fix dependency in pytest
* add drafter model container (basic ver)
8 months ago
Yuanheng Zhao
d63c469f45
[Infer] Revise and Adapt Triton Kernels for Spec-Dec ( #5401 )
...
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399 )
fix dependency in pytest
* resolve conflicts for revising flash-attn
* adapt kv cache copy kernel for spec-dec
* fix seqlen-n kvcache copy kernel/tests
* test kvcache copy - use torch.equal
* add assertions
* (trivial) comment out
8 months ago
Yuanheng
ed5ebd1735
[Fix] resolve conflicts of merging main
8 months ago
Hongxin Liu
641b1ee71a
[devops] remove post commit ci ( #5566 )
...
* [devops] remove post commit ci
* [misc] run pre-commit on all files
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
8 months ago
Zhongkai Zhao
8e412a548e
[shardformer] Sequence Parallelism Optimization ( #5533 )
...
* sequence parallel optimization
* validate sequence parallel in llama (code to be polished)
* shardformer api writing
* integrate sequence parallel in ShardFormer
* fix pp bugs and sp bugs for LlaMa model
* integrating ring-based sequence parallelism into ShardFormer
* [sequence parallelism]: Add fused megatron function
* integrating ring-based sequence parallelism into ShardFormer
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* fix bugs when useing sp and flashattention together
* fix operation function name
* support flash attention for ulysses-style sp
* clarify sp process group
* fix compatibility bugs in moe plugin
* fix fused linear bugs
* fix linear layer test
* support gpt model all-to-all sp
* modify shard data dimension (meant to be dim=-1)
* support megtron-style sp and distributed attn for llama model
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* finish sp mode 3 support for gpt
* using all_to_all_single when batch size is 1
* support mode 2 sp in gpt2 (#5 )
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* refactor ring implementation
* support mode 2 sp in gpt2
* polish code
* enable distributed attn mask when using sp mode 2 and 3 in llama
* automatically enable flash attn when using sp mode 2 and 3 in llama
* inplace attn mask
* add zero2 support for sequence parallel
* polish code
* fix bugs
* fix gemini checkpoint io
* loose tensor checking atol and rtol
* add comment
* fix llama layernorm grad
* fix zero grad
* fix zero grad
* fix conflict
* update split and gather auto grad func
* sequence parallel: inside text split (#6 )
* polish code (part 1)
* polish code (part 2)
* polish code (part 2.5)
* polish code (part 3)
* sequence parallel: inside text split
* miscellaneous minor fixes
* polish code
* fix ulysses style ZeRO
* sequence parallel: inside text split
* miscellaneous minor fixes
* disaggregate sp group and dp group for sp
* fix llama and gpt sp
* polish code
* move ulysses grad sync to ddp (#9 )
* remove zero_stage and unbind the grad sync for alltoall sp
* add 2d group creation test
* move ulysses grad sync to ddp
* add 2d group creation test
* remove useless code
* change shard config not to enable sp when enable_all_optimizations
* add sp warnings for several model
* remove useless code
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
8 months ago
yuehuayingxueluo
04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel ( #5528 )
...
* Add get_cos_and_sin kernel
* fix code comments
* fix code typos
* merge common codes of get_cos_and_sin kernel.
* Fixed a typo
* Changed 'asset allclose' to 'assert equal'.
8 months ago
Wenhao Chen
e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama ( #5508 )
...
* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`
* feat: apply `GradientCheckpointConfig` to policy and llama_forward
* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager
* fix: add optional args for `distribute_layer` and `get_stage_index`
* fix: fix changed API calls
* test: update llama tests
* style: polish `GradientCheckpointConfig`
* fix: fix pipeline utils tests
8 months ago
Insu Jang
00525f7772
[shardformer] fix pipeline forward error if custom layer distribution is used ( #5189 )
...
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution
* Change static methods for t5 layer distribution to member functions
* Change static methods for whisper layer distribution to member functions
* Replace whisper policy usage with self one
* Fix test case to use non-static layer distribution methods
* fix: fix typo
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
8 months ago
Hongxin Liu
19e1a5cf16
[shardformer] update colo attention to support custom mask ( #5510 )
...
* [feature] refactor colo attention (#5462 )
* [extension] update api
* [feature] add colo attention
* [feature] update sdpa
* [feature] update npu attention
* [feature] update flash-attn
* [test] add flash attn test
* [test] update flash attn test
* [shardformer] update modeling to fit colo attention (#5465 )
* [misc] refactor folder structure
* [shardformer] update llama flash-attn
* [shardformer] fix llama policy
* [devops] update tensornvme install
* [test] update llama test
* [shardformer] update colo attn kernel dispatch
* [shardformer] update blip2
* [shardformer] update chatglm
* [shardformer] update gpt2
* [shardformer] update gptj
* [shardformer] update opt
* [shardformer] update vit
* [shardformer] update colo attention mask prep
* [shardformer] update whisper
* [test] fix shardformer tests (#5514 )
* [test] fix shardformer tests
* [test] fix shardformer tests
8 months ago
Edenzzzz
61da3fbc52
fixed layout converter caching and updated tester
8 months ago
flybird11111
0688d92e2d
[shardformer]Fix lm parallel. ( #5480 )
...
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
* fix lm forward distribution
* fix
* test ci
* fix
8 months ago
Runyu Lu
68e9396bc0
[fix] merge conflicts
8 months ago
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
8 months ago
Wenhao Chen
bb0a668fee
[hotfix] set return_outputs=False in examples and polish code ( #5404 )
...
* fix: simplify merge_batch
* fix: use return_outputs=False to eliminate extra memory consumption
* feat: add return_outputs warning
* style: remove `return_outputs=False` as it is the default value
8 months ago
Runyu Lu
9fe61b4475
[fix]
8 months ago
Runyu Lu
aabc9fb6aa
[feat] add use_cuda_kernel option
8 months ago
flybird11111
5e16bf7980
[shardformer] fix gathering output when using tensor parallelism ( #5431 )
...
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
8 months ago
Runyu Lu
d02e257abd
Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph
9 months ago
Runyu Lu
ae24b4f025
diverse tests
9 months ago
Runyu Lu
1821a6dab0
[fix] pytest and fix dyn grid bug
9 months ago
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
9 months ago
Steve Luo
ed431de4e4
fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test ( #5454 )
9 months ago
Hongxin Liu
f2e8b9ef9f
[devops] fix compatibility ( #5444 )
...
* [devops] fix compatibility
* [hotfix] update compatibility test on pr
* [devops] fix compatibility
* [devops] record duration during comp test
* [test] decrease test duration
* fix falcon
9 months ago
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
9 months ago
xs_courtesy
95c21498d4
add silu_and_mul for infer
9 months ago
flybird11111
29695cf70c
[example]add gpt2 benchmark example script. ( #5295 )
...
* benchmark gpt2
* fix
fix
fix
fix
* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247 )
* [workflow] fixed build CI (#5240 )
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
* [ci] fixed booster test (#5251 )
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed ddp test (#5254 )
* [ci] fixed ddp test
* polish
* fix typo in applications/ColossalEval/README.md (#5250 )
* [ci] fix shardformer tests. (#5255 )
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [doc] fix doc typo (#5256 )
* [doc] fix annotation display
* [doc] fix llama2 doc
* [hotfix]: add pp sanity check and fix mbs arg (#5268 )
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
* [workflow] fixed incomplete bash command (#5272 )
* [workflow] fixed oom tests (#5275 )
* [workflow] fixed oom tests
* polish
* polish
* polish
* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276 )
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [shardformer] hybridparallelplugin support gradients accumulation. (#5246 )
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230 )
* fix auto loading gpt2 tokenizer (#5279 )
* [doc] add llama2-13B disyplay (#5285 )
* Update README.md
* fix 13b typo
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* fix llama pretrain (#5287 )
* fix
* fix
* fix
fix
* fix
fix
fix
* fix
fix
* benchmark gpt2
* fix
fix
fix
fix
* [workflow] fixed build CI (#5240 )
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
* [ci] fixed booster test (#5251 )
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
* fix
fix
* fix
fix
fix
* fix
* fix
fix
fix
fix
fix
* fix
* Update shardformer.py
---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>
9 months ago
FrankLeeeee
0310b76e9d
Merge branch 'main' into sync/main
9 months ago
yuehuayingxueluo
0aa27f1961
[Inference]Move benchmark-related code to the example directory. ( #5408 )
...
* move benchmark-related code to the example directory.
* fix bugs in test_fused_rotary_embedding.py
9 months ago
yuehuayingxueluo
600881a8ea
[Inference]Add CUDA KVCache Kernel ( #5406 )
...
* add cuda KVCache kernel
* annotation benchmark_kvcache_copy
* add use cuda
* fix import path
* move benchmark scripts to example/
* rm benchmark codes in test_kv_cache_memcpy.py
* rm redundancy codes
* rm redundancy codes
* pr was modified according to the review
9 months ago
QinLuo
bf34c6fef6
[fsdp] impl save/load shard model/optimizer ( #5357 )
9 months ago
Yuanheng Zhao
19061188c3
[Infer/Fix] Fix Dependency in test - RMSNorm kernel ( #5399 )
...
fix dependency in pytest
9 months ago
yuehuayingxueluo
bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine ( #5395 )
...
* Fix bugs in inference_engine
* fix bugs in engine.py
* rm CUDA_VISIBLE_DEVICES
* add request_ids in generate
* fix bug in engine.py
* add logger.debug for BatchBucket
9 months ago
yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
9 months ago
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
9 months ago
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
9 months ago
ver217
06db94fbc9
[moe] fix tests
10 months ago
Xuanlei Zhao
7d8e0338a4
[moe] init mixtral impl
10 months ago
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
10 months ago
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
10 months ago
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
10 months ago
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
10 months ago
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
10 months ago
Hongxin Liu
c53ddda88f
[lr-scheduler] fix load state dict and add test ( #5369 )
10 months ago
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
10 months ago
Wenhao Chen
1c790c0877
[fix] remove unnecessary dp_size assert ( #5351 )
...
* fix: remove unnecessary assert
* test: add more 3d plugin tests
* fix: add warning
10 months ago
Frank Lee
e76acbb076
[inference] moved ops tests to test_infer ( #5354 )
10 months ago
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
10 months ago
Hongxin Liu
ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint ( #5347 )
...
* [checkpointio] fix hybrid parallel optim checkpoint
* [extension] fix cuda extension
* [checkpointio] fix gemini optimizer checkpoint
* polish code
10 months ago
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
10 months ago
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
10 months ago
Jianghai
df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding ( #5336 )
...
* revise rotary embedding
* remove useless print
* adapt
10 months ago
FrankLeeeee
c565519913
merge commit
10 months ago
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
10 months ago
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
10 months ago
Jianghai
1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM ( #5315 )
...
* add
* xi
* del
* del
* fix
10 months ago
Jianghai
7ddd8b37f0
fix ( #5311 )
10 months ago
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
10 months ago
Frank Lee
7cfed5f076
[feat] refactored extension module ( #5298 )
...
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
10 months ago
Jianghai
c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel ( #5302 )
...
* add fused rotary and get cos cache func
* staged
* fix bugs
* fix bugs
10 months ago
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
10 months ago
Jianghai
8e606ecc7e
[Inference] Benchmarking rotary embedding and add a fetch function ( #5277 )
...
* fix bugs and add a cos/sin cache fetch func
* add docstring
* fix bug
* fix
10 months ago
Hongxin Liu
d7f8db8e21
[hotfix] fix 3d plugin test ( #5292 )
10 months ago
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
10 months ago
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
10 months ago
ver217
148469348a
Merge branch 'main' into sync/npu
10 months ago
Yaozheng Fang
5ae9099f92
[kernel] Add RMSLayerNorm triton kernel ( #5262 )
...
* add layerrmsnorm triton kernel
* add layerrmsnorm kernel
* modify the atol and rtol in test file
* Remove the logics of mean computations, and update the name of ther kernel functions and files
* add benchmark of rms norm
10 months ago
Zhongkai Zhao
5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism ( #5230 )
10 months ago
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
10 months ago
flybird11111
2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py ( #5276 )
...
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
10 months ago
Frank Lee
d69cd2eb89
[workflow] fixed oom tests ( #5275 )
...
* [workflow] fixed oom tests
* polish
* polish
* polish
10 months ago
Yuanheng Zhao
0f2b46a41c
[kernel] Revise KVCache copy triton kernel API ( #5273 )
...
* [kernel/fix] revise kvcache copy kernel api
* fix benchmark
10 months ago
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
11 months ago
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
11 months ago
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
11 months ago
Yuanheng Zhao
1513f20f4d
[kernel] Add flash decoding triton kernel for blocked kv cache ( #5249 )
...
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
11 months ago
Jianghai
fded91d049
[Inference] Kernel: no pad rotary embedding ( #5252 )
...
* fix bugs
* comment
* use more accurate atol
* fix
11 months ago
yuehuayingxueluo
fab294c7f4
fix CI bugs
11 months ago
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
11 months ago
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
11 months ago
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
11 months ago
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
11 months ago
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
11 months ago
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
11 months ago
Yuanheng Zhao
07b5283b6a
[kernel] Add triton kernel for context attention (FAv2) without padding ( #5192 )
...
* add context attn unpadded triton kernel
* test compatibility
* kv cache copy (testing)
* fix k/v cache copy
* fix kv cache copy and test
* fix boundary of block ptrs
* add support for GQA/MQA and testing
* fix import statement
---------
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
11 months ago
yuehuayingxueluo
4df8876fca
Fixed a writing error
11 months ago
yuehuayingxueluo
9489dc64d8
precision alignment
11 months ago
yuehuayingxueluo
62968588d1
fix bugs in request_handler
11 months ago
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
11 months ago
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
11 months ago
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
11 months ago
Jianghai
93aeacca34
[Inference]Update inference config and fix test ( #5178 )
...
* unify the config setting
* fix test
* fix import
* fix test
* fix
* fix
* add logger
* revise log info
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
11 months ago