161 Commits (8241c0c054b38a109ed3ce7be1052a1e600b8471)

Author SHA1 Message Date
flybird11111 0c10afd372
[FP8] rebase main (#5963) 4 months ago
pre-commit-ci[bot] 7c2f79fa98
[pre-commit.ci] pre-commit autoupdate (#5572) 5 months ago
Runyu Lu 3c7cda0c9a
[Inference]Lazy Init Support (#5785) 5 months ago
Yuanheng Zhao 7b249c76e5
[Fix] Fix spec-dec Glide LlamaModel for compatibility with transformers (#5837) 5 months ago
flybird11111 2ddf624a86
[shardformer] upgrade transformers to 4.39.3 (#5815) 5 months ago
Li Xingjian 8554585a5f
[Inference] Fix flash-attn import and add model test (#5794) 5 months ago
Runyu Lu c0948aff97
[Inference]refactor baichuan (#5791) 5 months ago
char-1ee f5981e808e Remove flash attention backend 6 months ago
char-1ee 5f398fc000 Pass inference model shard configs for module init 6 months ago
char-1ee eec77e5702 Fix tests and naming 6 months ago
char-1ee 04386d9eff Refactor modeling by adding attention backend 6 months ago
yuehuayingxueluo b45000f839
[Inference]Add Streaming LLM (#5745) 6 months ago
Yuanheng Zhao 406443200f
[Hotfix] Add missing init file in inference.executor (#5774) 6 months ago
Jianghai 85946d4236
[Inference]Fix readme and example for API server (#5742) 6 months ago
binmakeswell 4647ec28c8
[inference] release (#5747) 6 months ago
Yuanheng Zhao d8b1ea4ac9
[doc] Update Inference Readme (#5736) 6 months ago
Yuanheng Zhao bdf9a001d6
[Fix/Inference] Add unsupported auto-policy error message (#5730) 6 months ago
Yuanheng Zhao 283c407a19
[Inference] Fix Inference Generation Config and Sampling (#5710) 6 months ago
Yuanheng Zhao 8bcfe360fd
[example] Update Inference Example (#5725) 6 months ago
Jianghai f47f2fbb24
[Inference] Fix API server, test and example (#5712) 6 months ago
Runyu Lu 74c47921fa
[Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) 6 months ago
Steve Luo 7806842f2d
add paged-attetionv2: support seq length split across thread block (#5707) 6 months ago
Runyu Lu 18d67d0e8e
[Feat]Inference RPC Server Support (#5705) 6 months ago
yuehuayingxueluo de4bf3dedf
[Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) 7 months ago
傅剑寒 bfad39357b
[Inference/Feat] Add quant kvcache interface (#5700) 7 months ago
CjhHa1 bc9063adf1 resolve rebase conflicts on Branch feat/online-serving 7 months ago
Jianghai 61a1b2e798 [Inference] Fix bugs and docs for feat/online-server (#5598) 7 months ago
CjhHa1 7bbb28e48b [Inference] resolve rebase conflicts 7 months ago
Jianghai c064032865 [Online Server] Chat Api for streaming and not streaming response (#5470) 7 months ago
Jianghai de378cd2ab [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) 7 months ago
Jianghai 69cd7e069d [Inference] ADD async and sync Api server using FastAPI (#5396) 7 months ago
yuehuayingxueluo d482922035
[Inference] Support the logic related to ignoring EOS token (#5693) 7 months ago
yuehuayingxueluo 9c2fe7935f
[Inference]Adapt temperature processing logic (#5689) 7 months ago
Yuanheng Zhao 55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements (#5688) 7 months ago
Yuanheng Zhao f9afe0addd
[hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) 7 months ago
Yuanheng Zhao 8754abae24 [Fix] Fix & Update Inference Tests (compatibility w/ main) 7 months ago
yuehuayingxueluo f79963199c
[inference]Add alibi to flash attn function (#5678) 7 months ago
Steve Luo 5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) 7 months ago
yuehuayingxueluo 5f00002e43
[Inference] Adapt Baichuan2-13B TP (#5659) 7 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666) 7 months ago
linsj20 91fa553775 [Feature] qlora support (#5586) 7 months ago
yuehuayingxueluo 3c91e3f176
[Inference]Adapt to baichuan2 13B (#5614) 7 months ago
Steve Luo a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) 7 months ago
Yuanheng Zhao 04863a9b14
[example] Update Llama Inference example (#5629) 7 months ago
Yuanheng Zhao 5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) 7 months ago
Runyu Lu e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference (#5563) 7 months ago
Steve Luo be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) 7 months ago
yuehuayingxueluo 56b222eff8
[inference/model]Adapted to the baichuan2-7B model (#5591) 7 months ago
Yuanheng f8598e3ec5 [Fix] Llama Modeling Control with Spec-Dec (#5580) 8 months ago
Yuanheng Zhao e60d430cf5 [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) 8 months ago