1201 Commits (main)

Author SHA1 Message Date
yuehuayingxueluo b45000f839
[Inference]Add Streaming LLM (#5745) 6 months ago
Haze188 e22b82755d
[CI/tests] simplify some test case to reduce testing time (#5755) 6 months ago
duanjunwen 1b76564e16
[test] Fix/fix testcase (#5770) 6 months ago
Hongxin Liu 68359ed1e1
[release] update version (#5752) 6 months ago
Yuanheng Zhao b96c6390f4
[inference] Fix running time of test_continuous_batching (#5750) 6 months ago
Edenzzzz 5f8c0a0ac3
[Feature] auto-cast optimizers to distributed version (#5746) 6 months ago
hxwang ca674549e0 [chore] remove unnecessary test & changes 6 months ago
botbw 2fc85abf43
[gemini] async grad chunk reduce (all-reduce&reduce-scatter) (#5713) 6 months ago
botbw 13c06d36a3
[bug] fix early return (#5740) 6 months ago
genghaozhe a280517dd9 remove unrelated file 6 months ago
genghaozhe 1ec92d29af remove perf log, unrelated file and so on 6 months ago
genghaozhe 5c6c5d6be3 remove comments 6 months ago
genghaozhe df63db7e63 remote comments 6 months ago
genghaozhe 5470e5f94e a commit for fake push test 6 months ago
Edenzzzz 43995ee436
[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694) 6 months ago
Steve Luo 7806842f2d
add paged-attetionv2: support seq length split across thread block (#5707) 6 months ago
Runyu Lu 18d67d0e8e
[Feat]Inference RPC Server Support (#5705) 6 months ago
傅剑寒 50104ab340
[Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706) 7 months ago
Wang Binluo a3cc68ca93
[Shardformer] Support the Qwen2 model (#5699) 7 months ago
flybird11111 d4c5ef441e
[gemini]remove registered gradients hooks (#5696) 7 months ago
CjhHa1 bc9063adf1 resolve rebase conflicts on Branch feat/online-serving 7 months ago
Jianghai 61a1b2e798 [Inference] Fix bugs and docs for feat/online-server (#5598) 7 months ago
Jianghai c064032865 [Online Server] Chat Api for streaming and not streaming response (#5470) 7 months ago
Jianghai de378cd2ab [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) 7 months ago
Jianghai 69cd7e069d [Inference] ADD async and sync Api server using FastAPI (#5396) 7 months ago
yuehuayingxueluo 9c2fe7935f
[Inference]Adapt temperature processing logic (#5689) 7 months ago
Yuanheng Zhao 55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements (#5688) 7 months ago
flybird11111 77ec773388
[zero]remove registered gradients hooks (#5687) 7 months ago
Yuanheng Zhao 8754abae24 [Fix] Fix & Update Inference Tests (compatibility w/ main) 7 months ago
Yuanheng Zhao 537a3cbc4d
[kernel] Support New KCache Layout - Triton Kernel (#5677) 7 months ago
Steve Luo 5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) 7 months ago
yuehuayingxueluo 5f00002e43
[Inference] Adapt Baichuan2-13B TP (#5659) 7 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666) 7 months ago
linsj20 91fa553775 [Feature] qlora support (#5586) 7 months ago
flybird11111 8954a0c2e2 [LowLevelZero] low level zero support lora (#5153) 7 months ago
Baizhou Zhang 14b0d4c7e5 [lora] add lora APIs for booster, support lora for TorchDDP (#4981) 7 months ago
Yuanheng Zhao 5be590b99e
[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) 7 months ago
Hongxin Liu 2082852f3f
[lazyinit] skip whisper test (#5653) 7 months ago
yuehuayingxueluo 3c91e3f176
[Inference]Adapt to baichuan2 13B (#5614) 7 months ago
Yuanheng Zhao f342a93871
[Fix] Remove obsolete files - inference (#5650) 7 months ago
Hongxin Liu 1b387ca9fe
[shardformer] refactor pipeline grad ckpt config (#5646) 7 months ago
Hongxin Liu bbb2c21f16
[shardformer] fix chatglm implementation (#5644) 7 months ago
Steve Luo a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) 7 months ago
flybird11111 148506c828
[coloattention]modify coloattention (#5627) 7 months ago
Wang Binluo 0d0a582033
[shardformer] update transformers (#5583) 7 months ago
yuehuayingxueluo 12f10d5b0b
[Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623) 7 months ago
Yuanheng Zhao 5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) 7 months ago
Steve Luo ccf72797e3
feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611) 7 months ago
Runyu Lu e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference (#5563) 7 months ago
Steve Luo be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) 7 months ago