Frank Lee
9afa52061f
[inference] refactored config ( #5376 )
2024-02-08 14:04:14 +08:00
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
2024-02-07 17:55:48 +08:00
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
2024-02-07 17:15:42 +08:00
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
2024-02-07 17:11:43 +08:00
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
2024-02-07 14:27:04 +08:00
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
2024-02-07 11:36:04 +08:00
yuehuayingxueluo
35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. ( #5365 )
...
* fused the gate and up proj in mlp
* fix code styles
* opt auto_grad
* rollback test_inference_engine.py
* modifications based on the review feedback.
* fix bugs in flash attn
* Change reshape to view
* fix test_rmsnorm_triton.py
2024-02-06 19:38:25 +08:00
Yuanheng Zhao
1dedb57747
[Fix/Infer] Remove unused deps and revise requirements ( #5341 )
...
* remove flash-attn dep
* rm padding llama
* revise infer requirements
* move requirements out of module
2024-02-06 17:27:45 +08:00
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
yuehuayingxueluo
21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm ( #5350 )
...
* opt rms_norm
* fix bugs in rms_layernorm
2024-02-02 15:06:01 +08:00
Frank Lee
027aa1043f
[doc] updated inference readme ( #5343 )
2024-02-02 14:31:10 +08:00
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
2024-02-02 11:44:15 +08:00
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
2024-02-01 15:31:01 +08:00
Jianghai
df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding ( #5336 )
...
* revise rotary embedding
* remove useless print
* adapt
2024-01-31 16:31:29 +08:00
FrankLeeeee
c565519913
merge commit
2024-01-31 10:41:47 +08:00
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
2024-01-30 16:06:09 +08:00
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
2024-01-30 10:31:46 +08:00
digger yu
71321a07cf
fix typo change dosen't to doesn't ( #5308 )
2024-01-30 09:57:38 +08:00
flybird11111
388179f966
[tests] fix t5 test. ( #5322 )
...
* [ci] fix shardformer tests. (#5255 )
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* fix t5 test
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-29 17:38:46 +08:00
Jianghai
c7c104cb7c
[DOC] Update inference readme ( #5280 )
...
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
2024-01-29 16:21:06 +08:00
FrankLeeeee
087d0cb1fc
[accelerator] fixed npu api
2024-01-29 14:27:52 +08:00
Frank Lee
8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
...
Feature/npu
2024-01-29 13:49:39 +08:00
Jianghai
1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM ( #5315 )
...
* add
* xi
* del
* del
* fix
2024-01-29 10:22:33 +08:00
Jianghai
7ddd8b37f0
fix ( #5311 )
2024-01-26 15:02:12 +08:00
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Frank Lee
7cfed5f076
[feat] refactored extension module ( #5298 )
...
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
2024-01-25 17:01:48 +08:00
digger yu
bce9499ed3
fix some typo ( #5307 )
2024-01-25 13:56:27 +08:00
Yuanheng Zhao
af8359c430
[hotfix] fix boundary check in batch ( #5306 )
2024-01-25 10:23:12 +08:00
Jianghai
c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel ( #5302 )
...
* add fused rotary and get cos cache func
* staged
* fix bugs
* fix bugs
2024-01-24 16:20:42 +08:00
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
yuehuayingxueluo
cea9c86e45
add utils.py
2024-01-22 16:06:27 +08:00
yuehuayingxueluo
bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm ( #5283 )
...
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
2024-01-22 10:55:34 +08:00
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
2024-01-18 16:31:14 +08:00
ver217
148469348a
Merge branch 'main' into sync/npu
2024-01-18 12:05:21 +08:00
Yaozheng Fang
5ae9099f92
[kernel] Add RMSLayerNorm triton kernel ( #5262 )
...
* add layerrmsnorm triton kernel
* add layerrmsnorm kernel
* modify the atol and rtol in test file
* Remove the logics of mean computations, and update the name of ther kernel functions and files
* add benchmark of rms norm
2024-01-18 10:21:03 +08:00
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
2024-01-17 16:03:10 +08:00
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
2024-01-17 15:22:33 +08:00
Yuanheng Zhao
0f2b46a41c
[kernel] Revise KVCache copy triton kernel API ( #5273 )
...
* [kernel/fix] revise kvcache copy kernel api
* fix benchmark
2024-01-16 14:41:02 +08:00
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
2024-01-15 17:50:46 +08:00
Frank Lee
c597678da4
[doc] updated inference readme ( #5269 )
2024-01-15 17:37:41 +08:00
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
2024-01-11 13:50:45 +00:00
Yuanheng Zhao
1513f20f4d
[kernel] Add flash decoding triton kernel for blocked kv cache ( #5249 )
...
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
2024-01-11 13:46:14 +00:00
Jianghai
fded91d049
[Inference] Kernel: no pad rotary embedding ( #5252 )
...
* fix bugs
* comment
* use more accurate atol
* fix
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fab294c7f4
fix CI bugs
2024-01-11 13:46:14 +00:00