Commit Graph

1121 Commits (f9afe0addd89303de4819debd93efe97d5618238)

Author SHA1 Message Date
Hongxin Liu f2e8b9ef9f
[devops] fix compatibility (#5444)
9 months ago
Steve Luo f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)
9 months ago
xs_courtesy 95c21498d4 add silu_and_mul for infer
9 months ago
flybird11111 29695cf70c
[example]add gpt2 benchmark example script. (#5295)
9 months ago
FrankLeeeee 0310b76e9d Merge branch 'main' into sync/main
9 months ago
yuehuayingxueluo 0aa27f1961
[Inference]Move benchmark-related code to the example directory. (#5408)
9 months ago
yuehuayingxueluo 600881a8ea
[Inference]Add CUDA KVCache Kernel (#5406)
9 months ago
QinLuo bf34c6fef6
[fsdp] impl save/load shard model/optimizer (#5357)
9 months ago
Yuanheng Zhao 19061188c3
[Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)
9 months ago
yuehuayingxueluo bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine (#5395)
9 months ago
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
9 months ago
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383)
9 months ago
Yuanheng Zhao b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)
9 months ago
ver217 06db94fbc9 [moe] fix tests
10 months ago
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl
10 months ago
Jianghai 1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. (#5337)
10 months ago
yuehuayingxueluo 6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy (#5374)
10 months ago
Frank Lee 58740b5f68
[inference] added inference template (#5375)
10 months ago
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)
10 months ago
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348)
10 months ago
Hongxin Liu c53ddda88f
[lr-scheduler] fix load state dict and add test (#5369)
10 months ago
yuehuayingxueluo 631862f339
[Inference]Optimize generation process of inference engine (#5356)
10 months ago
Wenhao Chen 1c790c0877
[fix] remove unnecessary dp_size assert (#5351)
10 months ago
Frank Lee e76acbb076
[inference] moved ops tests to test_infer (#5354)
10 months ago
Frank Lee db1a763307
[inference] removed redundancy init_batch (#5353)
10 months ago
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
10 months ago
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
10 months ago
Frank Lee f8e456d202
[inference] simplified config verification (#5346)
10 months ago
Jianghai df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)
10 months ago
FrankLeeeee c565519913 merge commit
10 months ago
Yuanheng Zhao 5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
10 months ago
yuehuayingxueluo e8f0642f28
[Inference]Add Nopadding Llama Modeling (#5327)
10 months ago
Jianghai 1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM (#5315)
10 months ago
Jianghai 7ddd8b37f0
fix (#5311)
10 months ago
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
10 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298)
10 months ago
Jianghai c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel (#5302)
10 months ago
Yuanheng Zhao 3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
10 months ago
Jianghai 8e606ecc7e
[Inference] Benchmarking rotary embedding and add a fetch function (#5277)
10 months ago
Hongxin Liu d7f8db8e21
[hotfix] fix 3d plugin test (#5292)
10 months ago
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
10 months ago
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270)
11 months ago
ver217 148469348a Merge branch 'main' into sync/npu
11 months ago
Yaozheng Fang 5ae9099f92
[kernel] Add RMSLayerNorm triton kernel (#5262)
11 months ago
Zhongkai Zhao 5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)
11 months ago
flybird11111 46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246)
11 months ago
flybird11111 2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)
11 months ago
Frank Lee d69cd2eb89
[workflow] fixed oom tests (#5275)
11 months ago
Yuanheng Zhao 0f2b46a41c
[kernel] Revise KVCache copy triton kernel API (#5273)
11 months ago
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261)
11 months ago
Wenhao Chen ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268)
11 months ago
FrankLeeeee 1ded7e81ef [git] fixed rebased files
11 months ago
Yuanheng Zhao 1513f20f4d [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)
11 months ago
Jianghai fded91d049 [Inference] Kernel: no pad rotary embedding (#5252)
11 months ago
yuehuayingxueluo fab294c7f4 fix CI bugs
11 months ago
Jianghai e545a871b8 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)
11 months ago
yuehuayingxueluo fa4fbdbffb adapted to pad_context_forward
11 months ago
yuehuayingxueluo 47e53eaa1c fix bugs in attention.py and request_handler.py
11 months ago
Jianghai bfd9b1b494 [Inference] Pytorch Attention func, pad&nopad input support (#5219)
11 months ago
yuehuayingxueluo bbfebfb9fc fix bugs in sampler
11 months ago
yuehuayingxueluo 02c1bf8b2a add context_attention_unpadded
11 months ago
Yuanheng Zhao 07b5283b6a [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)
11 months ago
yuehuayingxueluo 4df8876fca Fixed a writing error
11 months ago
yuehuayingxueluo 9489dc64d8 precision alignment
11 months ago
yuehuayingxueluo 62968588d1 fix bugs in request_handler
11 months ago
yuehuayingxueluo 62fd08ee44 Fixed a bug in the inference frame
11 months ago
Jianghai 0e616462a7 [Inference] add logit processor and request handler (#5166)
11 months ago
yuehuayingxueluo 8daee26989 [Inference] Add the logic of the inference engine (#5173)
11 months ago
Jianghai 93aeacca34 [Inference]Update inference config and fix test (#5178)
11 months ago
Yuanheng Zhao 3de2e62299 [Inference] Add CacheBlock and KV-Cache Manager (#5156)
11 months ago
yuehuayingxueluo fab9b931d9 [Inference]Add BatchInferState, Sequence and InferConfig (#5149)
11 months ago
Yuanheng Zhao 2bb92243d4 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)
11 months ago
flybird11111 e830ef917d
[ci] fix shardformer tests. (#5255)
11 months ago
Frank Lee 2b83418719
[ci] fixed ddp test (#5254)
11 months ago
Frank Lee d5eeeb1416
[ci] fixed booster test (#5251)
11 months ago
Frank Lee edf94a35c3
[workflow] fixed build CI (#5240)
11 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239)
11 months ago
Elsa Granger d565df3821
[pipeline] A more general _communicate in p2p (#5062)
11 months ago
Xuanlei Zhao dd2c28a323
[npu] use extension for op builder (#5172)
11 months ago
Wenhao Chen d799a3088f
[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214)
11 months ago
Wenhao Chen 4fa689fca1
[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134)
11 months ago
flybird11111 79718fae04
[shardformer] llama support DistCrossEntropy (#5176)
12 months ago
flybird11111 21aa5de00b
[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150)
12 months ago
flybird11111 2a2ec49aa7
[plugin]fix 3d checkpoint load when booster boost without optimizer. (#5135)
1 year ago
github-actions[bot] d10ee42f68
[format] applied code formatting on changed files in pull request 5088 (#5127)
1 year ago
Wenhao Chen 7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088)
1 year ago
Zhongkai Zhao 75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer (#5064)
1 year ago
Xu Kai fb103cfd6e
[inference] update examples and engine (#5073)
1 year ago
Bin Jia 0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069)
1 year ago
Hongxin Liu e5ce4c8ea6
[npu] add npu support for gemini and zero (#5067)
1 year ago
Xu Kai fd6482ad8c
[inference] Refactor inference architecture (#5057)
1 year ago
Wenhao Chen 3c08f17348
[hotfix]: modify create_ep_hierarchical_group and add test (#5032)
1 year ago
flybird11111 3e02154710
[gemini] gemini support extra-dp (#5043)
1 year ago
Cuiqing Li (李崔卿) 28052a71fb
[Kernels]Update triton kernels into 2.1.0 (#5046)
1 year ago
Zhongkai Zhao 70885d707d
[hotfix] Suport extra_kwargs in ShardConfig (#5031)
1 year ago
flybird11111 576a2f7b10
[gemini] gemini support tensor parallelism. (#4942)
1 year ago
Wenhao Chen 724441279b
[moe]: fix ep/tp tests, add hierarchical all2all (#4982)
1 year ago
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015)
1 year ago
Jianghai ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014)
1 year ago
github-actions[bot] c36e782d80
[format] applied code formatting on changed files in pull request 4926 (#5007)
1 year ago