yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
2024-02-21 13:23:57 +08:00
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
2024-02-21 11:31:48 +08:00
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
2024-02-19 17:18:20 +08:00
yuehuayingxueluo
8c69debdc7
[Inference]Support vllm testing in benchmark scripts ( #5379 )
...
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
2024-02-08 15:27:26 +08:00
Frank Lee
9afa52061f
[inference] refactored config ( #5376 )
2024-02-08 14:04:14 +08:00
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
2024-02-07 17:55:48 +08:00
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
2024-02-07 17:15:42 +08:00
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
2024-02-07 17:11:43 +08:00
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
2024-02-07 14:27:04 +08:00
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
2024-02-07 11:36:04 +08:00
yuehuayingxueluo
35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. ( #5365 )
...
* fused the gate and up proj in mlp
* fix code styles
* opt auto_grad
* rollback test_inference_engine.py
* modifications based on the review feedback.
* fix bugs in flash attn
* Change reshape to view
* fix test_rmsnorm_triton.py
2024-02-06 19:38:25 +08:00
Yuanheng Zhao
1dedb57747
[Fix/Infer] Remove unused deps and revise requirements ( #5341 )
...
* remove flash-attn dep
* rm padding llama
* revise infer requirements
* move requirements out of module
2024-02-06 17:27:45 +08:00
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
yuehuayingxueluo
21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm ( #5350 )
...
* opt rms_norm
* fix bugs in rms_layernorm
2024-02-02 15:06:01 +08:00
Frank Lee
027aa1043f
[doc] updated inference readme ( #5343 )
2024-02-02 14:31:10 +08:00
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
2024-02-02 11:44:15 +08:00
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
2024-02-01 15:31:01 +08:00
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
2024-01-30 16:06:09 +08:00
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
2024-01-30 10:31:46 +08:00
Jianghai
c7c104cb7c
[DOC] Update inference readme ( #5280 )
...
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
2024-01-29 16:21:06 +08:00
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
yuehuayingxueluo
cea9c86e45
add utils.py
2024-01-22 16:06:27 +08:00
yuehuayingxueluo
bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm ( #5283 )
...
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
2024-01-22 10:55:34 +08:00
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
2024-01-18 16:31:14 +08:00
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
2024-01-17 16:03:10 +08:00
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
2024-01-15 17:50:46 +08:00
Frank Lee
c597678da4
[doc] updated inference readme ( #5269 )
2024-01-15 17:37:41 +08:00
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
2024-01-11 13:50:45 +00:00
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fab294c7f4
fix CI bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
2a73e828eb
fix bugs related to processing padding mask
2024-01-11 13:46:14 +00:00
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
2024-01-11 13:44:06 +00:00
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
3ad1f3b78b
fix beam_width
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
b2eb9cd186
Fixed a typo
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
9489dc64d8
precision alignment
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62968588d1
fix bugs in request_handler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
86853a37d5
Add padding llama model
2024-01-11 13:39:56 +00:00
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
2024-01-11 13:39:56 +00:00