yuehuayingxueluo
b7853196a0
Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding
...
[Inference/fix]Add utils.py for Rotary Embedding
10 months ago
yuehuayingxueluo
cea9c86e45
add utils.py
10 months ago
Hongxin Liu
d7f8db8e21
[hotfix] fix 3d plugin test ( #5292 )
10 months ago
yuehuayingxueluo
bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm ( #5283 )
...
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
10 months ago
flybird11111
f7e3f82a7e
fix llama pretrain ( #5287 )
10 months ago
Desperado-Jia
6a56967855
[doc] add llama2-13B disyplay ( #5285 )
...
* Update README.md
* fix 13b typo
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
10 months ago
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
10 months ago
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
10 months ago
Michelle
32cb74493a
fix auto loading gpt2 tokenizer ( #5279 )
10 months ago
Frank Lee
d66e6988bc
Merge pull request #5278 from ver217/sync/npu
...
[sync] sync npu branch with main
10 months ago
ver217
148469348a
Merge branch 'main' into sync/npu
10 months ago
Yaozheng Fang
5ae9099f92
[kernel] Add RMSLayerNorm triton kernel ( #5262 )
...
* add layerrmsnorm triton kernel
* add layerrmsnorm kernel
* modify the atol and rtol in test file
* Remove the logics of mean computations, and update the name of ther kernel functions and files
* add benchmark of rms norm
10 months ago
Zhongkai Zhao
5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism ( #5230 )
10 months ago
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
10 months ago
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
10 months ago
flybird11111
2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py ( #5276 )
...
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
10 months ago
Frank Lee
d69cd2eb89
[workflow] fixed oom tests ( #5275 )
...
* [workflow] fixed oom tests
* polish
* polish
* polish
10 months ago
Yuanheng Zhao
0f2b46a41c
[kernel] Revise KVCache copy triton kernel API ( #5273 )
...
* [kernel/fix] revise kvcache copy kernel api
* fix benchmark
10 months ago
Frank Lee
04244aaaf1
[workflow] fixed incomplete bash command ( #5272 )
10 months ago
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
10 months ago
Frank Lee
c597678da4
[doc] updated inference readme ( #5269 )
10 months ago
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
10 months ago
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
10 months ago
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
11 months ago
Yuanheng Zhao
1513f20f4d
[kernel] Add flash decoding triton kernel for blocked kv cache ( #5249 )
...
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
11 months ago
Jianghai
fded91d049
[Inference] Kernel: no pad rotary embedding ( #5252 )
...
* fix bugs
* comment
* use more accurate atol
* fix
11 months ago
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
11 months ago
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
11 months ago
yuehuayingxueluo
fab294c7f4
fix CI bugs
11 months ago
yuehuayingxueluo
2a73e828eb
fix bugs related to processing padding mask
11 months ago
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
11 months ago
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
11 months ago
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
11 months ago
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
11 months ago
yuehuayingxueluo
3ad1f3b78b
fix beam_width
11 months ago
yuehuayingxueluo
b2eb9cd186
Fixed a typo
11 months ago
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
11 months ago
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
11 months ago
Yuanheng Zhao
07b5283b6a
[kernel] Add triton kernel for context attention (FAv2) without padding ( #5192 )
...
* add context attn unpadded triton kernel
* test compatibility
* kv cache copy (testing)
* fix k/v cache copy
* fix kv cache copy and test
* fix boundary of block ptrs
* add support for GQA/MQA and testing
* fix import statement
---------
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
11 months ago
yuehuayingxueluo
4df8876fca
Fixed a writing error
11 months ago
yuehuayingxueluo
9489dc64d8
precision alignment
11 months ago
yuehuayingxueluo
62968588d1
fix bugs in request_handler
11 months ago
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
11 months ago
yuehuayingxueluo
86853a37d5
Add padding llama model
11 months ago
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
11 months ago
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
11 months ago
Jianghai
93aeacca34
[Inference]Update inference config and fix test ( #5178 )
...
* unify the config setting
* fix test
* fix import
* fix test
* fix
* fix
* add logger
* revise log info
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
11 months ago
Yuanheng Zhao
3de2e62299
[Inference] Add CacheBlock and KV-Cache Manager ( #5156 )
...
* [Inference] Add KVCache Manager
* function refactored
* add test for KVCache Manager
* add attr beam width
* Revise alloc func in CacheManager
* Fix docs and pytests
* add tp slicing for head number
* optimize shapes of tensors used as physical cache
* Apply using InferenceConfig on KVCacheManager
* rm duplicate config file
* Optimize cache allocation: use contiguous cache
* Fix config in pytest (and config)
11 months ago
yuehuayingxueluo
fab9b931d9
[Inference]Add BatchInferState, Sequence and InferConfig ( #5149 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
11 months ago
Yuanheng Zhao
2bb92243d4
[Inference/NFC] Clean outdated inference tests and deprecated kernels ( #5159 )
...
* [inference/nfc] remove outdated inference tests
* remove outdated kernel tests
* remove deprecated triton kernels
* remove imports from deprecated kernels
11 months ago