146 Commits (bd38fe6b912379080673a43d77fd3bdf0e5c852e)

Author SHA1 Message Date
Yuanheng Zhao d8b1ea4ac9
[doc] Update Inference Readme (#5736) 6 months ago
Yuanheng Zhao bdf9a001d6
[Fix/Inference] Add unsupported auto-policy error message (#5730) 6 months ago
Yuanheng Zhao 283c407a19
[Inference] Fix Inference Generation Config and Sampling (#5710) 6 months ago
Yuanheng Zhao 8bcfe360fd
[example] Update Inference Example (#5725) 6 months ago
Jianghai f47f2fbb24
[Inference] Fix API server, test and example (#5712) 6 months ago
Runyu Lu 74c47921fa
[Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) 6 months ago
Steve Luo 7806842f2d
add paged-attetionv2: support seq length split across thread block (#5707) 6 months ago
Runyu Lu 18d67d0e8e
[Feat]Inference RPC Server Support (#5705) 6 months ago
yuehuayingxueluo de4bf3dedf
[Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) 7 months ago
傅剑寒 bfad39357b
[Inference/Feat] Add quant kvcache interface (#5700) 7 months ago
CjhHa1 bc9063adf1 resolve rebase conflicts on Branch feat/online-serving 7 months ago
Jianghai 61a1b2e798 [Inference] Fix bugs and docs for feat/online-server (#5598) 7 months ago
CjhHa1 7bbb28e48b [Inference] resolve rebase conflicts 7 months ago
Jianghai c064032865 [Online Server] Chat Api for streaming and not streaming response (#5470) 7 months ago
Jianghai de378cd2ab [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) 7 months ago
Jianghai 69cd7e069d [Inference] ADD async and sync Api server using FastAPI (#5396) 7 months ago
yuehuayingxueluo d482922035
[Inference] Support the logic related to ignoring EOS token (#5693) 7 months ago
yuehuayingxueluo 9c2fe7935f
[Inference]Adapt temperature processing logic (#5689) 7 months ago
Yuanheng Zhao 55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements (#5688) 7 months ago
Yuanheng Zhao f9afe0addd
[hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) 7 months ago
Yuanheng Zhao 8754abae24 [Fix] Fix & Update Inference Tests (compatibility w/ main) 7 months ago
yuehuayingxueluo f79963199c
[inference]Add alibi to flash attn function (#5678) 7 months ago
Steve Luo 5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) 7 months ago
yuehuayingxueluo 5f00002e43
[Inference] Adapt Baichuan2-13B TP (#5659) 7 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666) 7 months ago
linsj20 91fa553775 [Feature] qlora support (#5586) 7 months ago
yuehuayingxueluo 3c91e3f176
[Inference]Adapt to baichuan2 13B (#5614) 7 months ago
Steve Luo a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) 7 months ago
Yuanheng Zhao 04863a9b14
[example] Update Llama Inference example (#5629) 7 months ago
Yuanheng Zhao 5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) 7 months ago
Runyu Lu e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference (#5563) 7 months ago
Steve Luo be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) 7 months ago
yuehuayingxueluo 56b222eff8
[inference/model]Adapted to the baichuan2-7B model (#5591) 7 months ago
Yuanheng f8598e3ec5 [Fix] Llama Modeling Control with Spec-Dec (#5580) 8 months ago
Yuanheng Zhao e60d430cf5 [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) 8 months ago
Yuanheng Zhao e1acb58423 [doc] Add inference/speculative-decoding README (#5552) 8 months ago
Yuanheng Zhao d85d91435a [Inference/SpecDec] Support GLIDE Drafter Model (#5455) 8 months ago
Yuanheng Zhao 912e24b2aa [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449) 8 months ago
Yuanheng Zhao a37f82629d [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) 8 months ago
Yuanheng Zhao 5a9b05f7b2 [Inference/SpecDec] Add Basic Drafter Model Container (#5405) 8 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566) 8 months ago
Yuanheng Zhao 4bb5d8923a
[Fix/Inference] Remove unused and non-functional functions (#5543) 8 months ago
yuehuayingxueluo 04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel (#5528) 8 months ago
Wenhao Chen e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508) 8 months ago
傅剑寒 e6496dd371
[Inference] Optimize request handler of llama (#5512) 8 months ago
Runyu Lu 6251d68dc9
[fix] PR #5354 (#5501) 8 months ago
yuehuayingxueluo 87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) 8 months ago
Runyu Lu ff4998c6f3 [fix] remove unused comment 8 months ago
Runyu Lu 5b017d6324 [fix] 8 months ago
Runyu Lu 4eafe0c814 [fix] unused option 8 months ago