yuehuayingxueluo
f79963199c
[inference]Add alibi to flash attn function ( #5678 )
...
* add alibi to flash attn function
* rm redundant modifications
7 months ago
Steve Luo
5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… ( #5663 )
...
* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator
* refactor decode_kv_cache_memcpy
* enable alibi in pagedattention
7 months ago
yuehuayingxueluo
5f00002e43
[Inference] Adapt Baichuan2-13B TP ( #5659 )
...
* adapt to baichuan2 13B
* add baichuan2 13B TP
* update baichuan tp logic
* rm unused code
* Fix TP logic
* fix alibi slopes tp logic
* rm nn.Module
* Polished the code.
* change BAICHUAN_MODEL_NAME_OR_PATH
* Modified the logic for loading Baichuan weights.
* fix typos
7 months ago
yuehuayingxueluo
3c91e3f176
[Inference]Adapt to baichuan2 13B ( #5614 )
...
* adapt to baichuan2 13B
* adapt to baichuan2 13B
* change BAICHUAN_MODEL_NAME_OR_PATH
* fix test_decoding_attn.py
* Modifications based on review comments.
* change BAICHUAN_MODEL_NAME_OR_PATH
* mv attn mask processes to test flash decoding
* mv get_alibi_slopes baichuan modeling
* fix bugs in test_baichuan.py
7 months ago
Steve Luo
a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout ( #5643 )
...
* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Yuanheng Zhao
04863a9b14
[example] Update Llama Inference example ( #5629 )
...
* [example] add infernece benchmark llama3
* revise inference config - arg
* remove unused args
* add llama generation demo script
* fix init rope in llama policy
* add benchmark-llama3 - cleanup
7 months ago
Yuanheng Zhao
5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 ( #5624 )
...
* [fix] GQA calling of flash decoding triton
* fix kv cache alloc shape
* fix rotary triton - GQA
* fix sequence max length assigning
* Sequence max length logic
* fix scheduling and spec-dec
* skip without import error
* fix pytest - skip without ImportError
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Runyu Lu
e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference ( #5563 )
...
* tensor parallel support naive source
* [fix]precision, model load and refactor the framework
* add tp unit test
* docstring
* fix do_sample
7 months ago
Steve Luo
be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block ( #5531 )
...
* feat flash decoding for paged attention
* refactor flashdecodingattention
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
yuehuayingxueluo
56b222eff8
[inference/model]Adapted to the baichuan2-7B model ( #5591 )
...
* Adapted to the baichuan2-7B model
* modified according to the review comments.
* Modified the method of obtaining random weights.
* modified according to the review comments.
* change mlp layewr 'NOTE'
8 months ago
Yuanheng
f8598e3ec5
[Fix] Llama Modeling Control with Spec-Dec ( #5580 )
...
- fix ref before asgmt
- fall back to use triton kernels when using spec-dec
8 months ago
Yuanheng Zhao
e60d430cf5
[Fix] resolve conflicts of rebasing feat/speculative-decoding ( #5557 )
...
- resolve conflicts of rebasing feat/speculative-decoding
8 months ago
Yuanheng Zhao
e1acb58423
[doc] Add inference/speculative-decoding README ( #5552 )
...
* add README for spec-dec
* update roadmap
8 months ago
Yuanheng Zhao
d85d91435a
[Inference/SpecDec] Support GLIDE Drafter Model ( #5455 )
...
* add glide-llama policy and modeling
* update glide modeling, compitable with transformers 4.36.2
* revise glide llama modeling/usage
* fix issues of glimpsing large kv
* revise the way re-loading params for glide drafter
* fix drafter and engine tests
* enable convert to glide strict=False
* revise glide llama modeling
* revise vicuna prompt template
* revise drafter and tests
* apply usage of glide model in engine
8 months ago
Yuanheng Zhao
912e24b2aa
[SpecDec] Fix inputs for speculation and revise past KV trimming ( #5449 )
...
* fix drafter pastkv and usage of batch bucket
8 months ago
Yuanheng Zhao
a37f82629d
[Inference/SpecDec] Add Speculative Decoding Implementation ( #5423 )
...
* fix flash decoding mask during verification
* add spec-dec
* add test for spec-dec
* revise drafter init
* remove drafter sampling
* retire past kv in drafter
* (trivial) rename attrs
* (trivial) rename arg
* revise how we enable/disable spec-dec
8 months ago
Yuanheng Zhao
5a9b05f7b2
[Inference/SpecDec] Add Basic Drafter Model Container ( #5405 )
...
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399 )
fix dependency in pytest
* add drafter model container (basic ver)
8 months ago
Yuanheng Zhao
4bb5d8923a
[Fix/Inference] Remove unused and non-functional functions ( #5543 )
...
* [fix] remove unused func
* rm non-functional partial
8 months ago
yuehuayingxueluo
04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel ( #5528 )
...
* Add get_cos_and_sin kernel
* fix code comments
* fix code typos
* merge common codes of get_cos_and_sin kernel.
* Fixed a typo
* Changed 'asset allclose' to 'assert equal'.
8 months ago
傅剑寒
e6496dd371
[Inference] Optimize request handler of llama ( #5512 )
...
* optimize request_handler
* fix ways of writing
8 months ago
Runyu Lu
6251d68dc9
[fix] PR #5354 ( #5501 )
...
* [fix]
* [fix]
* Update config.py docstring
* [fix] docstring align
* [fix] docstring align
* [fix] docstring align
8 months ago
Runyu Lu
68e9396bc0
[fix] merge conflicts
8 months ago
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
8 months ago
Runyu Lu
ff4998c6f3
[fix] remove unused comment
8 months ago
Runyu Lu
5b017d6324
[fix]
8 months ago
Runyu Lu
4eafe0c814
[fix] unused option
8 months ago
Runyu Lu
aabc9fb6aa
[feat] add use_cuda_kernel option
8 months ago
Runyu Lu
6e30248683
[fix] tmp for test
9 months ago
Runyu Lu
d02e257abd
Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph
9 months ago
Runyu Lu
ae24b4f025
diverse tests
9 months ago
Runyu Lu
1821a6dab0
[fix] pytest and fix dyn grid bug
9 months ago
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
9 months ago
Runyu Lu
633e95b301
[doc] add doc
9 months ago
Runyu Lu
9dec66fad6
[fix] multi graphs capture error
9 months ago
Runyu Lu
b2c0d9ff2b
[fix] multi graphs capture error
9 months ago
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
9 months ago
Runyu Lu
cefaeb5fdd
[feat] cuda graph support and refactor non-functional api
9 months ago
yuehuayingxueluo
600881a8ea
[Inference]Add CUDA KVCache Kernel ( #5406 )
...
* add cuda KVCache kernel
* annotation benchmark_kvcache_copy
* add use cuda
* fix import path
* move benchmark scripts to example/
* rm benchmark codes in test_kv_cache_memcpy.py
* rm redundancy codes
* rm redundancy codes
* pr was modified according to the review
9 months ago
yuehuayingxueluo
bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine ( #5395 )
...
* Fix bugs in inference_engine
* fix bugs in engine.py
* rm CUDA_VISIBLE_DEVICES
* add request_ids in generate
* fix bug in engine.py
* add logger.debug for BatchBucket
9 months ago
yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
9 months ago
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
9 months ago
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
9 months ago
yuehuayingxueluo
8c69debdc7
[Inference]Support vllm testing in benchmark scripts ( #5379 )
...
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
10 months ago
Frank Lee
9afa52061f
[inference] refactored config ( #5376 )
10 months ago
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
10 months ago
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
10 months ago
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
10 months ago
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
10 months ago
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
10 months ago
yuehuayingxueluo
35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. ( #5365 )
...
* fused the gate and up proj in mlp
* fix code styles
* opt auto_grad
* rollback test_inference_engine.py
* modifications based on the review feedback.
* fix bugs in flash attn
* Change reshape to view
* fix test_rmsnorm_triton.py
10 months ago