Jianghai
85946d4236
[Inference]Fix readme and example for API server ( #5742 )
...
* fix chatapi readme and example
* updating doc
* add an api and change the doc
* remove
* add credits and del 'API' heading
* readme
* readme
6 months ago
binmakeswell
4647ec28c8
[inference] release ( #5747 )
...
* [inference] release
* [inference] release
* [inference] release
* [inference] release
* [inference] release
* [inference] release
* [inference] release
6 months ago
Yuanheng Zhao
d8b1ea4ac9
[doc] Update Inference Readme ( #5736 )
...
* [doc] update inference readme
* add contents
* trivial
6 months ago
Yuanheng Zhao
bdf9a001d6
[Fix/Inference] Add unsupported auto-policy error message ( #5730 )
...
* [fix] auto policy error message
* trivial
6 months ago
Yuanheng Zhao
283c407a19
[Inference] Fix Inference Generation Config and Sampling ( #5710 )
...
* refactor and add
* config default values
* fix gen config passing
* fix rpc generation config
6 months ago
Yuanheng Zhao
8bcfe360fd
[example] Update Inference Example ( #5725 )
...
* [example] update inference example
7 months ago
Jianghai
f47f2fbb24
[Inference] Fix API server, test and example ( #5712 )
...
* fix api server
* fix generation config
* fix api server
* fix comments
* fix infer hanging bug
* resolve comments, change backend to free port
7 months ago
Runyu Lu
74c47921fa
[Fix] Llama3 Load/Omit CheckpointIO Temporarily ( #5717 )
...
* Fix Llama3 Load error
* Omit Checkpoint IO Temporarily
7 months ago
Steve Luo
7806842f2d
add paged-attetionv2: support seq length split across thread block ( #5707 )
7 months ago
Runyu Lu
18d67d0e8e
[Feat]Inference RPC Server Support ( #5705 )
...
* rpc support source
* kv cache logical/physical disaggregation
* sampler refactor
* colossalai launch built in
* Unitest
* Rpyc support
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
yuehuayingxueluo
de4bf3dedf
[Inference]Adapt repetition_penalty and no_repeat_ngram_size ( #5708 )
...
* Adapt repetition_penalty and no_repeat_ngram_size
* fix no_repeat_ngram_size_logit_process
* remove batch_updated
* fix annotation
* modified codes based on the review feedback.
* rm get_batch_token_ids
7 months ago
傅剑寒
bfad39357b
[Inference/Feat] Add quant kvcache interface ( #5700 )
...
* add quant kvcache interface
* delete unused output
* complete args comments
7 months ago
CjhHa1
bc9063adf1
resolve rebase conflicts on Branch feat/online-serving
7 months ago
Jianghai
61a1b2e798
[Inference] Fix bugs and docs for feat/online-server ( #5598 )
...
* fix test bugs
* add do sample test
* del useless lines
* fix comments
* fix tests
* delete version tag
* delete version tag
* add
* del test sever
* fix test
* fix
* Revert "add"
This reverts commit b9305fb024
.
7 months ago
CjhHa1
7bbb28e48b
[Inference] resolve rebase conflicts
...
fix
7 months ago
Jianghai
c064032865
[Online Server] Chat Api for streaming and not streaming response ( #5470 )
...
* fix bugs
* fix bugs
* fix api server
* fix api server
* add chat api and test
* del request.n
7 months ago
Jianghai
de378cd2ab
[Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example ( #5432 )
...
* finish online test and add examples
* fix test_contionus_batching
* fix some bugs
* fix bash
* fix
* fix inference
* finish revision
* fix typos
* revision
7 months ago
Jianghai
69cd7e069d
[Inference] ADD async and sync Api server using FastAPI ( #5396 )
...
* add api server
* fix
* add
* add completion service and fix bug
* add generation config
* revise shardformer
* fix bugs
* add docstrings and fix some bugs
* fix bugs and add choices for prompt template
7 months ago
yuehuayingxueluo
d482922035
[Inference] Support the logic related to ignoring EOS token ( #5693 )
...
* Adapt temperature processing logic
* add ValueError for top_p and top_k
* add GQA Test
* fix except_msg
* support ignore EOS token
* change variable's name
* fix annotation
7 months ago
yuehuayingxueluo
9c2fe7935f
[Inference]Adapt temperature processing logic ( #5689 )
...
* Adapt temperature processing logic
* add ValueError for top_p and top_k
* add GQA Test
* fix except_msg
7 months ago
Yuanheng Zhao
55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements ( #5688 )
...
* clean requirements
* modify example inference struct
* add test ci scripts
* mark test_infer as submodule
* rm deprecated cls & deps
* import of HAS_FLASH_ATTN
* prune inference tests to be run
* prune triton kernel tests
* increment pytest timeout mins
* revert import path in openmoe
7 months ago
Yuanheng Zhao
f9afe0addd
[hotfix] Fix KV Heads Number Assignment in KVCacheManager ( #5695 )
...
- Fix key value number assignment in KVCacheManager, as well as method of accessing
7 months ago
Yuanheng Zhao
8754abae24
[Fix] Fix & Update Inference Tests (compatibility w/ main)
7 months ago
yuehuayingxueluo
f79963199c
[inference]Add alibi to flash attn function ( #5678 )
...
* add alibi to flash attn function
* rm redundant modifications
7 months ago
Steve Luo
5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… ( #5663 )
...
* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator
* refactor decode_kv_cache_memcpy
* enable alibi in pagedattention
7 months ago
yuehuayingxueluo
5f00002e43
[Inference] Adapt Baichuan2-13B TP ( #5659 )
...
* adapt to baichuan2 13B
* add baichuan2 13B TP
* update baichuan tp logic
* rm unused code
* Fix TP logic
* fix alibi slopes tp logic
* rm nn.Module
* Polished the code.
* change BAICHUAN_MODEL_NAME_OR_PATH
* Modified the logic for loading Baichuan weights.
* fix typos
7 months ago
yuehuayingxueluo
3c91e3f176
[Inference]Adapt to baichuan2 13B ( #5614 )
...
* adapt to baichuan2 13B
* adapt to baichuan2 13B
* change BAICHUAN_MODEL_NAME_OR_PATH
* fix test_decoding_attn.py
* Modifications based on review comments.
* change BAICHUAN_MODEL_NAME_OR_PATH
* mv attn mask processes to test flash decoding
* mv get_alibi_slopes baichuan modeling
* fix bugs in test_baichuan.py
7 months ago
Steve Luo
a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout ( #5643 )
...
* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Yuanheng Zhao
04863a9b14
[example] Update Llama Inference example ( #5629 )
...
* [example] add infernece benchmark llama3
* revise inference config - arg
* remove unused args
* add llama generation demo script
* fix init rope in llama policy
* add benchmark-llama3 - cleanup
7 months ago
Yuanheng Zhao
5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 ( #5624 )
...
* [fix] GQA calling of flash decoding triton
* fix kv cache alloc shape
* fix rotary triton - GQA
* fix sequence max length assigning
* Sequence max length logic
* fix scheduling and spec-dec
* skip without import error
* fix pytest - skip without ImportError
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Runyu Lu
e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference ( #5563 )
...
* tensor parallel support naive source
* [fix]precision, model load and refactor the framework
* add tp unit test
* docstring
* fix do_sample
7 months ago
Steve Luo
be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block ( #5531 )
...
* feat flash decoding for paged attention
* refactor flashdecodingattention
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
yuehuayingxueluo
56b222eff8
[inference/model]Adapted to the baichuan2-7B model ( #5591 )
...
* Adapted to the baichuan2-7B model
* modified according to the review comments.
* Modified the method of obtaining random weights.
* modified according to the review comments.
* change mlp layewr 'NOTE'
8 months ago
Yuanheng
f8598e3ec5
[Fix] Llama Modeling Control with Spec-Dec ( #5580 )
...
- fix ref before asgmt
- fall back to use triton kernels when using spec-dec
8 months ago
Yuanheng Zhao
e60d430cf5
[Fix] resolve conflicts of rebasing feat/speculative-decoding ( #5557 )
...
- resolve conflicts of rebasing feat/speculative-decoding
8 months ago
Yuanheng Zhao
e1acb58423
[doc] Add inference/speculative-decoding README ( #5552 )
...
* add README for spec-dec
* update roadmap
8 months ago
Yuanheng Zhao
d85d91435a
[Inference/SpecDec] Support GLIDE Drafter Model ( #5455 )
...
* add glide-llama policy and modeling
* update glide modeling, compitable with transformers 4.36.2
* revise glide llama modeling/usage
* fix issues of glimpsing large kv
* revise the way re-loading params for glide drafter
* fix drafter and engine tests
* enable convert to glide strict=False
* revise glide llama modeling
* revise vicuna prompt template
* revise drafter and tests
* apply usage of glide model in engine
8 months ago
Yuanheng Zhao
912e24b2aa
[SpecDec] Fix inputs for speculation and revise past KV trimming ( #5449 )
...
* fix drafter pastkv and usage of batch bucket
8 months ago
Yuanheng Zhao
a37f82629d
[Inference/SpecDec] Add Speculative Decoding Implementation ( #5423 )
...
* fix flash decoding mask during verification
* add spec-dec
* add test for spec-dec
* revise drafter init
* remove drafter sampling
* retire past kv in drafter
* (trivial) rename attrs
* (trivial) rename arg
* revise how we enable/disable spec-dec
8 months ago
Yuanheng Zhao
5a9b05f7b2
[Inference/SpecDec] Add Basic Drafter Model Container ( #5405 )
...
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399 )
fix dependency in pytest
* add drafter model container (basic ver)
8 months ago
Yuanheng Zhao
4bb5d8923a
[Fix/Inference] Remove unused and non-functional functions ( #5543 )
...
* [fix] remove unused func
* rm non-functional partial
8 months ago
yuehuayingxueluo
04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel ( #5528 )
...
* Add get_cos_and_sin kernel
* fix code comments
* fix code typos
* merge common codes of get_cos_and_sin kernel.
* Fixed a typo
* Changed 'asset allclose' to 'assert equal'.
8 months ago
傅剑寒
e6496dd371
[Inference] Optimize request handler of llama ( #5512 )
...
* optimize request_handler
* fix ways of writing
8 months ago
Runyu Lu
6251d68dc9
[fix] PR #5354 ( #5501 )
...
* [fix]
* [fix]
* Update config.py docstring
* [fix] docstring align
* [fix] docstring align
* [fix] docstring align
8 months ago
Runyu Lu
68e9396bc0
[fix] merge conflicts
8 months ago
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
8 months ago
Runyu Lu
ff4998c6f3
[fix] remove unused comment
8 months ago
Runyu Lu
5b017d6324
[fix]
8 months ago
Runyu Lu
4eafe0c814
[fix] unused option
8 months ago
Runyu Lu
aabc9fb6aa
[feat] add use_cuda_kernel option
8 months ago