Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
10 months ago
Hongxin Liu
ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint ( #5347 )
...
* [checkpointio] fix hybrid parallel optim checkpoint
* [extension] fix cuda extension
* [checkpointio] fix gemini optimizer checkpoint
* polish code
10 months ago
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
10 months ago
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
10 months ago
Jianghai
df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding ( #5336 )
...
* revise rotary embedding
* remove useless print
* adapt
10 months ago
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
10 months ago
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
10 months ago
digger yu
71321a07cf
fix typo change dosen't to doesn't ( #5308 )
10 months ago
flybird11111
388179f966
[tests] fix t5 test. ( #5322 )
...
* [ci] fix shardformer tests. (#5255 )
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* fix t5 test
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
10 months ago
Jianghai
c7c104cb7c
[DOC] Update inference readme ( #5280 )
...
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
10 months ago
FrankLeeeee
087d0cb1fc
[accelerator] fixed npu api
10 months ago
Jianghai
1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM ( #5315 )
...
* add
* xi
* del
* del
* fix
10 months ago
Jianghai
7ddd8b37f0
fix ( #5311 )
10 months ago
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
10 months ago
Frank Lee
7cfed5f076
[feat] refactored extension module ( #5298 )
...
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
10 months ago
digger yu
bce9499ed3
fix some typo ( #5307 )
10 months ago
Yuanheng Zhao
af8359c430
[hotfix] fix boundary check in batch ( #5306 )
10 months ago
Jianghai
c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel ( #5302 )
...
* add fused rotary and get cos cache func
* staged
* fix bugs
* fix bugs
10 months ago
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
10 months ago
yuehuayingxueluo
cea9c86e45
add utils.py
10 months ago
yuehuayingxueluo
bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm ( #5283 )
...
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
10 months ago
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
10 months ago
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
10 months ago
Yaozheng Fang
5ae9099f92
[kernel] Add RMSLayerNorm triton kernel ( #5262 )
...
* add layerrmsnorm triton kernel
* add layerrmsnorm kernel
* modify the atol and rtol in test file
* Remove the logics of mean computations, and update the name of ther kernel functions and files
* add benchmark of rms norm
10 months ago
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
10 months ago
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
10 months ago
Yuanheng Zhao
0f2b46a41c
[kernel] Revise KVCache copy triton kernel API ( #5273 )
...
* [kernel/fix] revise kvcache copy kernel api
* fix benchmark
10 months ago
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
10 months ago
Frank Lee
c597678da4
[doc] updated inference readme ( #5269 )
10 months ago
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
10 months ago
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
10 months ago
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
11 months ago
Yuanheng Zhao
1513f20f4d
[kernel] Add flash decoding triton kernel for blocked kv cache ( #5249 )
...
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
11 months ago
Jianghai
fded91d049
[Inference] Kernel: no pad rotary embedding ( #5252 )
...
* fix bugs
* comment
* use more accurate atol
* fix
11 months ago
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
11 months ago
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
11 months ago
yuehuayingxueluo
fab294c7f4
fix CI bugs
11 months ago
yuehuayingxueluo
2a73e828eb
fix bugs related to processing padding mask
11 months ago
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
11 months ago
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
11 months ago
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
11 months ago
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
11 months ago
yuehuayingxueluo
3ad1f3b78b
fix beam_width
11 months ago
yuehuayingxueluo
b2eb9cd186
Fixed a typo
11 months ago
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
11 months ago
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
11 months ago
Yuanheng Zhao
07b5283b6a
[kernel] Add triton kernel for context attention (FAv2) without padding ( #5192 )
...
* add context attn unpadded triton kernel
* test compatibility
* kv cache copy (testing)
* fix k/v cache copy
* fix kv cache copy and test
* fix boundary of block ptrs
* add support for GQA/MQA and testing
* fix import statement
---------
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
11 months ago
yuehuayingxueluo
9489dc64d8
precision alignment
11 months ago
yuehuayingxueluo
62968588d1
fix bugs in request_handler
11 months ago
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
11 months ago