yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
9 months ago
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
9 months ago
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
9 months ago
yuehuayingxueluo
8c69debdc7
[Inference]Support vllm testing in benchmark scripts ( #5379 )
...
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
10 months ago
Frank Lee
9afa52061f
[inference] refactored config ( #5376 )
10 months ago
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
10 months ago
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
10 months ago
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
10 months ago
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
10 months ago
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
10 months ago
yuehuayingxueluo
35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. ( #5365 )
...
* fused the gate and up proj in mlp
* fix code styles
* opt auto_grad
* rollback test_inference_engine.py
* modifications based on the review feedback.
* fix bugs in flash attn
* Change reshape to view
* fix test_rmsnorm_triton.py
10 months ago
Yuanheng Zhao
1dedb57747
[Fix/Infer] Remove unused deps and revise requirements ( #5341 )
...
* remove flash-attn dep
* rm padding llama
* revise infer requirements
* move requirements out of module
10 months ago
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
10 months ago
yuehuayingxueluo
21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm ( #5350 )
...
* opt rms_norm
* fix bugs in rms_layernorm
10 months ago
Frank Lee
027aa1043f
[doc] updated inference readme ( #5343 )
10 months ago
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
10 months ago
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
10 months ago
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
10 months ago
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
10 months ago
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
10 months ago
Jianghai
c7c104cb7c
[DOC] Update inference readme ( #5280 )
...
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
10 months ago
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
10 months ago
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
10 months ago
yuehuayingxueluo
cea9c86e45
add utils.py
10 months ago
yuehuayingxueluo
bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm ( #5283 )
...
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
10 months ago
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
10 months ago
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
10 months ago
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
10 months ago
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
11 months ago
Frank Lee
c597678da4
[doc] updated inference readme ( #5269 )
11 months ago
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
11 months ago
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
11 months ago
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
11 months ago
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
11 months ago
yuehuayingxueluo
fab294c7f4
fix CI bugs
11 months ago
yuehuayingxueluo
2a73e828eb
fix bugs related to processing padding mask
11 months ago
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
11 months ago
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
11 months ago
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
11 months ago
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
11 months ago
yuehuayingxueluo
3ad1f3b78b
fix beam_width
11 months ago
yuehuayingxueluo
b2eb9cd186
Fixed a typo
11 months ago
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
11 months ago
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
11 months ago
yuehuayingxueluo
9489dc64d8
precision alignment
11 months ago
yuehuayingxueluo
62968588d1
fix bugs in request_handler
11 months ago
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
11 months ago
yuehuayingxueluo
86853a37d5
Add padding llama model
11 months ago
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
11 months ago
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
11 months ago