Commit Graph

148 Commits (58ad76d4665032bbe548d066116d1c572ce98979)

Author SHA1 Message Date
Runyu Lu 6e30248683 [fix] tmp for test
9 months ago
Runyu Lu d02e257abd
Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph
9 months ago
Runyu Lu ae24b4f025 diverse tests
9 months ago
Runyu Lu 1821a6dab0 [fix] pytest and fix dyn grid bug
9 months ago
yuehuayingxueluo f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)
9 months ago
Runyu Lu 633e95b301 [doc] add doc
9 months ago
Runyu Lu 9dec66fad6 [fix] multi graphs capture error
9 months ago
Runyu Lu b2c0d9ff2b [fix] multi graphs capture error
9 months ago
Steve Luo f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)
9 months ago
Runyu Lu cefaeb5fdd [feat] cuda graph support and refactor non-functional api
9 months ago
yuehuayingxueluo 600881a8ea
[Inference]Add CUDA KVCache Kernel (#5406)
9 months ago
yuehuayingxueluo bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine (#5395)
9 months ago
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
9 months ago
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383)
9 months ago
Yuanheng Zhao b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)
9 months ago
yuehuayingxueluo 8c69debdc7
[Inference]Support vllm testing in benchmark scripts (#5379)
10 months ago
Frank Lee 9afa52061f
[inference] refactored config (#5376)
10 months ago
Jianghai 1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. (#5337)
10 months ago
yuehuayingxueluo 6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy (#5374)
10 months ago
Frank Lee 58740b5f68
[inference] added inference template (#5375)
10 months ago
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)
10 months ago
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348)
10 months ago
yuehuayingxueluo 35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365)
10 months ago
Yuanheng Zhao 1dedb57747
[Fix/Infer] Remove unused deps and revise requirements (#5341)
10 months ago
yuehuayingxueluo 631862f339
[Inference]Optimize generation process of inference engine (#5356)
10 months ago
yuehuayingxueluo 21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350)
10 months ago
Frank Lee 027aa1043f
[doc] updated inference readme (#5343)
10 months ago
Frank Lee db1a763307
[inference] removed redundancy init_batch (#5353)
10 months ago
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
10 months ago
Frank Lee f8e456d202
[inference] simplified config verification (#5346)
10 months ago
Yuanheng Zhao 5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
10 months ago
yuehuayingxueluo e8f0642f28
[Inference]Add Nopadding Llama Modeling (#5327)
10 months ago
Jianghai c7c104cb7c
[DOC] Update inference readme (#5280)
10 months ago
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
10 months ago
Yuanheng Zhao 3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
10 months ago
yuehuayingxueluo cea9c86e45 add utils.py
10 months ago
yuehuayingxueluo bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm (#5283)
10 months ago
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
10 months ago
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270)
11 months ago
yuehuayingxueluo 86b63f720c
[Inference]Adapted to the triton attn kernels (#5264)
11 months ago
Jianghai d8db500efc
[Inference] Fix request handler and add recycle logic (#5260)
11 months ago
Frank Lee c597678da4
[doc] updated inference readme (#5269)
11 months ago
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261)
11 months ago
FrankLeeeee 1ded7e81ef [git] fixed rebased files
11 months ago
yuehuayingxueluo d40eb26029 fix bugs in request_handler.py and engine.py
11 months ago
yuehuayingxueluo 10e3c9f923 rm torch.cuda.synchronize
11 months ago
yuehuayingxueluo fab294c7f4 fix CI bugs
11 months ago
yuehuayingxueluo 2a73e828eb fix bugs related to processing padding mask
11 months ago
Jianghai e545a871b8 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)
11 months ago
yuehuayingxueluo fa4fbdbffb adapted to pad_context_forward
11 months ago
yuehuayingxueluo 47e53eaa1c fix bugs in attention.py and request_handler.py
11 months ago
Jianghai bfd9b1b494 [Inference] Pytorch Attention func, pad&nopad input support (#5219)
11 months ago
yuehuayingxueluo 3ad1f3b78b fix beam_width
11 months ago
yuehuayingxueluo b2eb9cd186 Fixed a typo
11 months ago
yuehuayingxueluo bbfebfb9fc fix bugs in sampler
11 months ago
yuehuayingxueluo 02c1bf8b2a add context_attention_unpadded
11 months ago
yuehuayingxueluo 9489dc64d8 precision alignment
11 months ago
yuehuayingxueluo 62968588d1 fix bugs in request_handler
11 months ago
yuehuayingxueluo 62fd08ee44 Fixed a bug in the inference frame
11 months ago
yuehuayingxueluo 86853a37d5 Add padding llama model
11 months ago
Jianghai 0e616462a7 [Inference] add logit processor and request handler (#5166)
11 months ago
yuehuayingxueluo 8daee26989 [Inference] Add the logic of the inference engine (#5173)
11 months ago
Jianghai 93aeacca34 [Inference]Update inference config and fix test (#5178)
11 months ago
Yuanheng Zhao 3de2e62299 [Inference] Add CacheBlock and KV-Cache Manager (#5156)
11 months ago
yuehuayingxueluo fab9b931d9 [Inference]Add BatchInferState, Sequence and InferConfig (#5149)
11 months ago
Jianghai 56e75eeb06 [Inference] Add readme (roadmap) and fulfill request handler (#5147)
11 months ago
Jianghai 4cf4682e70 [Inference] First PR for rebuild colossal-infer (#5143)
11 months ago
Zhongkai Zhao 75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer (#5064)
1 year ago
Hongxin Liu 1cd7efc520
[inference] refactor examples and fix schedule (#5077)
1 year ago
Xu Kai fb103cfd6e
[inference] update examples and engine (#5073)
1 year ago
Bin Jia 0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069)
1 year ago
Cuiqing Li (李崔卿) bce919708f
[Kernels]added flash-decoidng of triton (#5063)
1 year ago
Xu Kai fd6482ad8c
[inference] Refactor inference architecture (#5057)
1 year ago
Cuiqing Li (李崔卿) 28052a71fb
[Kernels]Update triton kernels into 2.1.0 (#5046)
1 year ago
Zhongkai Zhao 70885d707d
[hotfix] Suport extra_kwargs in ShardConfig (#5031)
1 year ago
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015)
1 year ago
Jianghai ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014)
1 year ago
github-actions[bot] c36e782d80
[format] applied code formatting on changed files in pull request 4926 (#5007)
1 year ago
littsk 1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926)
1 year ago
Bin Jia b6696beb04
[Pipeline Inference] Merge pp with tp (#4993)
1 year ago
Cuiqing Li (李崔卿) 4f0234f236
[doc]Update doc for colossal-inference (#4989)
1 year ago
Cuiqing Li 459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965)
1 year ago
Jianghai cf579ff46d
[Inference] Dynamic Batching Inference, online and offline (#4953)
1 year ago
Bin Jia 1db6727678
[Pipeline inference] Combine kvcache with pipeline inference (#4938)
1 year ago
Xu Kai 785802e809
[inference] add reference and fix some bugs (#4937)
1 year ago
Cuiqing Li 3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention (#4946)
1 year ago
digger yu 11009103be
[nfc] fix some typo with colossalai/ docs/ etc. (#4920)
1 year ago
github-actions[bot] 486d06a2d5
[format] applied code formatting on changed files in pull request 4820 (#4886)
1 year ago
Xu Kai 611a5a80ca
[inference] Add smmoothquant for llama (#4904)
1 year ago
Xu Kai 77a9328304
[inference] add llama2 support (#4898)
1 year ago
Bin Jia 08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main (#4820)
1 year ago
Michelle 07ed155e86 [NFC] polish colossalai/inference/quant/gptq/cai_gptq/__init__.py code style (#4792)
1 year ago
Jianghai 013a4bedf0
[inference]fix import bug and delete down useless init (#4830)
1 year ago
Jianghai ce7ade3882
[inference] chatglm2 infer demo (#4724)
1 year ago
Xu Kai 946ab56c48
[feature] add gptq for inference (#4754)
1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Yuanheng Zhao e2c0e7f92a
[hotfix] Fix import error: colossal.kernel without triton installed (#4722)
1 year ago
Cuiqing Li bce0f16702
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577)
1 year ago