Steve Luo
5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… ( #5663 )
...
* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator
* refactor decode_kv_cache_memcpy
* enable alibi in pagedattention
2024-04-30 15:52:23 +08:00
Yuanheng Zhao
5be590b99e
[kernel] Support new KCache Layout - Context Attention Triton Kernel ( #5658 )
...
* add context attn triton kernel - new kcache layout
* add benchmark triton
* tiny revise
* trivial - code style, comment
2024-04-26 17:51:49 +08:00
Yuanheng Zhao
f342a93871
[Fix] Remove obsolete files - inference ( #5650 )
2024-04-25 22:04:59 +08:00
Steve Luo
a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout ( #5643 )
...
* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-25 14:24:02 +08:00
yuehuayingxueluo
90cd5227a3
[Fix/Inference]Fix vllm benchmark ( #5630 )
...
* Fix bugs about OOM when running vllm-0.4.0
* rm used params
* change generation_config
* change benchmark log file name
2024-04-24 14:51:36 +08:00
傅剑寒
279300dc5f
[Inference/Refactor] Refactor compilation mechanism and unified multi hw ( #5613 )
...
* refactor compilation mechanism and unified multi hw
* fix file path bug
* add init.py to make pybind a module to avoid relative path error caused by softlink
* delete duplicated micros
* fix micros bug in gcc
2024-04-24 14:17:54 +08:00
Yuanheng Zhao
04863a9b14
[example] Update Llama Inference example ( #5629 )
...
* [example] add infernece benchmark llama3
* revise inference config - arg
* remove unused args
* add llama generation demo script
* fix init rope in llama policy
* add benchmark-llama3 - cleanup
2024-04-23 22:23:07 +08:00
Steve Luo
ccf72797e3
feat baichuan2 rmsnorm whose hidden size equals to 5120 ( #5611 )
2024-04-19 15:34:53 +08:00
Steve Luo
be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block ( #5531 )
...
* feat flash decoding for paged attention
* refactor flashdecodingattention
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-18 16:45:07 +08:00
yuehuayingxueluo
56b222eff8
[inference/model]Adapted to the baichuan2-7B model ( #5591 )
...
* Adapted to the baichuan2-7B model
* modified according to the review comments.
* Modified the method of obtaining random weights.
* modified according to the review comments.
* change mlp layewr 'NOTE'
2024-04-15 16:53:02 +08:00
Yuanheng
ed5ebd1735
[Fix] resolve conflicts of merging main
2024-04-08 16:21:47 +08:00
Hongxin Liu
641b1ee71a
[devops] remove post commit ci ( #5566 )
...
* [devops] remove post commit ci
* [misc] run pre-commit on all files
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-08 15:09:40 +08:00
digger yu
341263df48
[hotfix] fix typo s/get_defualt_parser /get_default_parser ( #5548 )
2024-04-07 19:04:58 +08:00
digger yu
a799ca343b
[fix] fix typo s/muiti-node /multi-node etc. ( #5448 )
2024-04-07 18:42:15 +08:00
Edenzzzz
15055f9a36
[hotfix] quick fixes to make legacy tutorials runnable ( #5559 )
...
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-04-07 12:06:27 +08:00
Wenhao Chen
e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama ( #5508 )
...
* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`
* feat: apply `GradientCheckpointConfig` to policy and llama_forward
* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager
* fix: add optional args for `distribute_layer` and `get_stage_index`
* fix: fix changed API calls
* test: update llama tests
* style: polish `GradientCheckpointConfig`
* fix: fix pipeline utils tests
2024-04-01 11:34:58 +08:00
Yuanheng Zhao
36c4bb2893
[Fix] Grok-1 use tokenizer from the same pretrained path ( #5532 )
...
* [fix] use tokenizer from the same pretrained path
* trust remote code
2024-03-28 16:30:04 +08:00
yuehuayingxueluo
934e31afb2
The writing style of tail processing and the logic related to macro definitions have been optimized. ( #5519 )
2024-03-28 10:42:51 +08:00
Insu Jang
00525f7772
[shardformer] fix pipeline forward error if custom layer distribution is used ( #5189 )
...
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution
* Change static methods for t5 layer distribution to member functions
* Change static methods for whisper layer distribution to member functions
* Replace whisper policy usage with self one
* Fix test case to use non-static layer distribution methods
* fix: fix typo
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-03-27 13:57:00 +08:00
Yuanheng Zhao
131f32a076
[fix] fix grok-1 example typo ( #5506 )
2024-03-26 10:19:42 +08:00
binmakeswell
34e909256c
[release] grok-1 inference benchmark ( #5500 )
...
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
* [release] grok-1 inference benchmark
2024-03-25 14:42:51 +08:00
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
2024-03-25 13:40:34 +08:00
Wenhao Chen
bb0a668fee
[hotfix] set return_outputs=False in examples and polish code ( #5404 )
...
* fix: simplify merge_batch
* fix: use return_outputs=False to eliminate extra memory consumption
* feat: add return_outputs warning
* style: remove `return_outputs=False` as it is the default value
2024-03-25 12:31:09 +08:00
Yuanheng Zhao
5fcd7795cd
[example] update Grok-1 inference ( #5495 )
...
* revise grok-1 example
* remove unused arg in scripts
* prevent re-installing torch
* update readme
* revert modifying colossalai requirements
* add perf
* trivial
* add tokenizer url
2024-03-24 20:24:11 +08:00
binmakeswell
6df844b8c4
[release] grok-1 314b inference ( #5490 )
...
* [release] grok-1 inference
* [release] grok-1 inference
* [release] grok-1 inference
2024-03-22 15:48:12 +08:00
Hongxin Liu
848a574c26
[example] add grok-1 inference ( #5485 )
...
* [misc] add submodule
* remove submodule
* [example] support grok-1 tp inference
* [example] add grok-1 inference script
* [example] refactor code
* [example] add grok-1 readme
* [exmaple] add test ci
* [exmaple] update readme
2024-03-21 18:07:22 +08:00
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
2024-03-13 17:20:03 +08:00
digger yu
385e85afd4
[hotfix] fix typo s/keywrods/keywords etc. ( #5429 )
2024-03-12 11:25:16 +08:00
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
2024-03-08 16:21:12 +08:00
Youngon
68f55a709c
[hotfix] fix stable diffusion inference bug. ( #5289 )
...
* Update train_ddp.yaml
delete "strategy" to fix DDP config loading bug in "main.py"
* Update train_ddp.yaml
fix inference with scripts/txt2img.py config file load bug.
* Update README.md
add pretrain model test code.
2024-03-05 22:03:40 +08:00
Luo Yihang
e239cf9060
[hotfix] fix typo of openmoe model source ( #5403 )
2024-03-05 21:44:38 +08:00
MickeyCHAN
e304e4db35
[hotfix] fix sd vit import error ( #5420 )
...
* fix import error
* Update dpt_depth.py
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
2024-03-05 21:41:23 +08:00
Hongxin Liu
070df689e6
[devops] fix extention building ( #5427 )
2024-03-05 15:35:54 +08:00
flybird11111
29695cf70c
[example]add gpt2 benchmark example script. ( #5295 )
...
* benchmark gpt2
* fix
fix
fix
fix
* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247 )
* [workflow] fixed build CI (#5240 )
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
* [ci] fixed booster test (#5251 )
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed ddp test (#5254 )
* [ci] fixed ddp test
* polish
* fix typo in applications/ColossalEval/README.md (#5250 )
* [ci] fix shardformer tests. (#5255 )
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [doc] fix doc typo (#5256 )
* [doc] fix annotation display
* [doc] fix llama2 doc
* [hotfix]: add pp sanity check and fix mbs arg (#5268 )
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
* [workflow] fixed incomplete bash command (#5272 )
* [workflow] fixed oom tests (#5275 )
* [workflow] fixed oom tests
* polish
* polish
* polish
* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276 )
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [shardformer] hybridparallelplugin support gradients accumulation. (#5246 )
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230 )
* fix auto loading gpt2 tokenizer (#5279 )
* [doc] add llama2-13B disyplay (#5285 )
* Update README.md
* fix 13b typo
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* fix llama pretrain (#5287 )
* fix
* fix
* fix
fix
* fix
fix
fix
* fix
fix
* benchmark gpt2
* fix
fix
fix
fix
* [workflow] fixed build CI (#5240 )
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
* [ci] fixed booster test (#5251 )
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
* fix
fix
* fix
fix
fix
* fix
* fix
fix
fix
fix
fix
* fix
* Update shardformer.py
---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>
2024-03-04 16:18:13 +08:00
FrankLeeeee
0310b76e9d
Merge branch 'main' into sync/main
2024-03-04 10:09:36 +08:00
yuehuayingxueluo
0aa27f1961
[Inference]Move benchmark-related code to the example directory. ( #5408 )
...
* move benchmark-related code to the example directory.
* fix bugs in test_fused_rotary_embedding.py
2024-02-28 16:46:03 +08:00
yuehuayingxueluo
600881a8ea
[Inference]Add CUDA KVCache Kernel ( #5406 )
...
* add cuda KVCache kernel
* annotation benchmark_kvcache_copy
* add use cuda
* fix import path
* move benchmark scripts to example/
* rm benchmark codes in test_kv_cache_memcpy.py
* rm redundancy codes
* rm redundancy codes
* pr was modified according to the review
2024-02-28 14:36:50 +08:00
Hongxin Liu
d882d18c65
[example] reuse flash attn patch ( #5400 )
2024-02-27 11:22:07 +08:00
yuehuayingxueluo
bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine ( #5395 )
...
* Fix bugs in inference_engine
* fix bugs in engine.py
* rm CUDA_VISIBLE_DEVICES
* add request_ids in generate
* fix bug in engine.py
* add logger.debug for BatchBucket
2024-02-23 10:51:35 +08:00
yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
2024-02-21 13:23:57 +08:00
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
2024-02-21 11:31:48 +08:00
yuehuayingxueluo
8c69debdc7
[Inference]Support vllm testing in benchmark scripts ( #5379 )
...
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
2024-02-08 15:27:26 +08:00
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
2024-02-07 14:27:04 +08:00
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
2024-02-07 11:36:04 +08:00
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
yuehuayingxueluo
21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm ( #5350 )
...
* opt rms_norm
* fix bugs in rms_layernorm
2024-02-02 15:06:01 +08:00
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
FrankLeeeee
c565519913
merge commit
2024-01-31 10:41:47 +08:00
digger yu
71321a07cf
fix typo change dosen't to doesn't ( #5308 )
2024-01-30 09:57:38 +08:00
Frank Lee
8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
...
Feature/npu
2024-01-29 13:49:39 +08:00