54 Commits (2a718c8be89918ec70b88f1f059148a7294dbccb)

Author SHA1 Message Date
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) 9 months ago
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383) 9 months ago
Yuanheng Zhao b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling (#5367) 9 months ago
Jianghai 1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. (#5337) 10 months ago
yuehuayingxueluo 6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy (#5374) 10 months ago
Frank Lee 58740b5f68
[inference] added inference template (#5375) 10 months ago
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) 10 months ago
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348) 10 months ago
yuehuayingxueluo 631862f339
[Inference]Optimize generation process of inference engine (#5356) 10 months ago
Frank Lee e76acbb076
[inference] moved ops tests to test_infer (#5354) 10 months ago
Frank Lee db1a763307
[inference] removed redundancy init_batch (#5353) 10 months ago
Frank Lee f8e456d202
[inference] simplified config verification (#5346) 10 months ago
Yuanheng Zhao 5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325) 10 months ago
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304) 10 months ago
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270) 10 months ago
FrankLeeeee 1ded7e81ef [git] fixed rebased files 11 months ago
yuehuayingxueluo fab294c7f4 fix CI bugs 11 months ago
Jianghai e545a871b8 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229) 11 months ago
yuehuayingxueluo fa4fbdbffb adapted to pad_context_forward 11 months ago
yuehuayingxueluo 47e53eaa1c fix bugs in attention.py and request_handler.py 11 months ago
Jianghai bfd9b1b494 [Inference] Pytorch Attention func, pad&nopad input support (#5219) 11 months ago
yuehuayingxueluo bbfebfb9fc fix bugs in sampler 11 months ago
yuehuayingxueluo 02c1bf8b2a add context_attention_unpadded 11 months ago
yuehuayingxueluo 4df8876fca Fixed a writing error 11 months ago
yuehuayingxueluo 9489dc64d8 precision alignment 11 months ago
yuehuayingxueluo 62968588d1 fix bugs in request_handler 11 months ago
yuehuayingxueluo 62fd08ee44 Fixed a bug in the inference frame 11 months ago
Jianghai 0e616462a7 [Inference] add logit processor and request handler (#5166) 11 months ago
yuehuayingxueluo 8daee26989 [Inference] Add the logic of the inference engine (#5173) 11 months ago
Jianghai 93aeacca34 [Inference]Update inference config and fix test (#5178) 11 months ago
Yuanheng Zhao 3de2e62299 [Inference] Add CacheBlock and KV-Cache Manager (#5156) 11 months ago
yuehuayingxueluo fab9b931d9 [Inference]Add BatchInferState, Sequence and InferConfig (#5149) 11 months ago
Yuanheng Zhao 2bb92243d4 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) 11 months ago
Zhongkai Zhao 75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer (#5064) 1 year ago
Xu Kai fb103cfd6e
[inference] update examples and engine (#5073) 1 year ago
Bin Jia 0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069) 1 year ago
Xu Kai fd6482ad8c
[inference] Refactor inference architecture (#5057) 1 year ago
Zhongkai Zhao 70885d707d
[hotfix] Suport extra_kwargs in ShardConfig (#5031) 1 year ago
Jianghai ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014) 1 year ago
github-actions[bot] c36e782d80
[format] applied code formatting on changed files in pull request 4926 (#5007) 1 year ago
littsk 1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926) 1 year ago
Bin Jia b6696beb04
[Pipeline Inference] Merge pp with tp (#4993) 1 year ago
Cuiqing Li 459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965) 1 year ago
Jianghai cf579ff46d
[Inference] Dynamic Batching Inference, online and offline (#4953) 1 year ago
Bin Jia 1db6727678
[Pipeline inference] Combine kvcache with pipeline inference (#4938) 1 year ago
github-actions[bot] 486d06a2d5
[format] applied code formatting on changed files in pull request 4820 (#4886) 1 year ago
Xu Kai 77a9328304
[inference] add llama2 support (#4898) 1 year ago
Xu Kai fdec650bb4
fix test llama (#4884) 1 year ago
Bin Jia 08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main (#4820) 1 year ago
Xu Kai d1fcc0fa4d
[infer] fix test bug (#4838) 1 year ago