388 Commits (4bb5d8923a6e85a0f89a483f15933698635a9f9c)

Author SHA1 Message Date
yuehuayingxueluo 934e31afb2
The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) 8 months ago
yuehuayingxueluo 87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) 8 months ago
yuehuayingxueluo f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) 8 months ago
Steve Luo f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) 9 months ago
yuehuayingxueluo 0aa27f1961
[Inference]Move benchmark-related code to the example directory. (#5408) 9 months ago
yuehuayingxueluo 600881a8ea
[Inference]Add CUDA KVCache Kernel (#5406) 9 months ago
Hongxin Liu d882d18c65
[example] reuse flash attn patch (#5400) 9 months ago
yuehuayingxueluo bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine (#5395) 9 months ago
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) 9 months ago
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383) 9 months ago
yuehuayingxueluo 8c69debdc7
[Inference]Support vllm testing in benchmark scripts (#5379) 10 months ago
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) 10 months ago
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348) 10 months ago
yuehuayingxueluo 631862f339
[Inference]Optimize generation process of inference engine (#5356) 10 months ago
yuehuayingxueluo 21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350) 10 months ago
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) 10 months ago
digger yu 71321a07cf
fix typo change dosen't to doesn't (#5308) 10 months ago
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304) 10 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298) 10 months ago
digger yu bce9499ed3
fix some typo (#5307) 10 months ago
yuehuayingxueluo bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm (#5283) 10 months ago
flybird11111 f7e3f82a7e
fix llama pretrain (#5287) 10 months ago
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270) 10 months ago
yuehuayingxueluo 86b63f720c
[Inference]Adapted to the triton attn kernels (#5264) 10 months ago
Wenhao Chen ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268) 10 months ago
binmakeswell c174c4fc5f
[doc] fix doc typo (#5256) 11 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239) 11 months ago
Xuanlei Zhao dd2c28a323
[npu] use extension for op builder (#5172) 11 months ago
Wenhao Chen 3c0d82b19b
[pipeline]: support arbitrary batch size in forward_only mode (#5201) 11 months ago
Wenhao Chen 4fa689fca1
[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) 11 months ago
flybird11111 21aa5de00b
[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150) 12 months ago
binmakeswell 177c79f2d1
[doc] add moe news (#5128) 1 year ago
Wenhao Chen 7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) 1 year ago
digger yu d5661f0f25
[nfc] fix typo change directoty to directory (#5111) 1 year ago
Xuanlei Zhao 3acbf6d496
[npu] add npu support for hybrid plugin and llama (#5090) 1 year ago
flybird11111 aae496631c
[shardformer]fix flash attention, when mask is casual, just don't unpad it (#5084) 1 year ago
Hongxin Liu 1cd7efc520
[inference] refactor examples and fix schedule (#5077) 1 year ago
Bin Jia 4e3959d316
[hotfix/hybridengine] Fix init model with random parameters in benchmark (#5074) 1 year ago
github-actions[bot] 8921a73c90
[format] applied code formatting on changed files in pull request 5067 (#5072) 1 year ago
Xu Kai fb103cfd6e
[inference] update examples and engine (#5073) 1 year ago
Hongxin Liu e5ce4c8ea6
[npu] add npu support for gemini and zero (#5067) 1 year ago
Cuiqing Li (李崔卿) bce919708f
[Kernels]added flash-decoidng of triton (#5063) 1 year ago
Xu Kai fd6482ad8c
[inference] Refactor inference architecture (#5057) 1 year ago
flybird11111 bc09b95f50
[exampe] fix llama example' loss error when using gemini plugin (#5060) 1 year ago
Elsa Granger b2ad0d9e8f
[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017) 1 year ago
Cuiqing Li (李崔卿) 28052a71fb
[Kernels]Update triton kernels into 2.1.0 (#5046) 1 year ago
Zhongkai Zhao 70885d707d
[hotfix] Suport extra_kwargs in ShardConfig (#5031) 1 year ago
Wenhao Chen 724441279b
[moe]: fix ep/tp tests, add hierarchical all2all (#4982) 1 year ago
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015) 1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978) 1 year ago