yuehuayingxueluo
de4bf3dedf
[Inference]Adapt repetition_penalty and no_repeat_ngram_size ( #5708 )
...
* Adapt repetition_penalty and no_repeat_ngram_size
* fix no_repeat_ngram_size_logit_process
* remove batch_updated
* fix annotation
* modified codes based on the review feedback.
* rm get_batch_token_ids
7 months ago
傅剑寒
bfad39357b
[Inference/Feat] Add quant kvcache interface ( #5700 )
...
* add quant kvcache interface
* delete unused output
* complete args comments
7 months ago
CjhHa1
bc9063adf1
resolve rebase conflicts on Branch feat/online-serving
7 months ago
Jianghai
61a1b2e798
[Inference] Fix bugs and docs for feat/online-server ( #5598 )
...
* fix test bugs
* add do sample test
* del useless lines
* fix comments
* fix tests
* delete version tag
* delete version tag
* add
* del test sever
* fix test
* fix
* Revert "add"
This reverts commit b9305fb024
.
7 months ago
CjhHa1
7bbb28e48b
[Inference] resolve rebase conflicts
...
fix
7 months ago
Jianghai
c064032865
[Online Server] Chat Api for streaming and not streaming response ( #5470 )
...
* fix bugs
* fix bugs
* fix api server
* fix api server
* add chat api and test
* del request.n
7 months ago
Jianghai
de378cd2ab
[Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example ( #5432 )
...
* finish online test and add examples
* fix test_contionus_batching
* fix some bugs
* fix bash
* fix
* fix inference
* finish revision
* fix typos
* revision
7 months ago
Jianghai
69cd7e069d
[Inference] ADD async and sync Api server using FastAPI ( #5396 )
...
* add api server
* fix
* add
* add completion service and fix bug
* add generation config
* revise shardformer
* fix bugs
* add docstrings and fix some bugs
* fix bugs and add choices for prompt template
7 months ago
yuehuayingxueluo
d482922035
[Inference] Support the logic related to ignoring EOS token ( #5693 )
...
* Adapt temperature processing logic
* add ValueError for top_p and top_k
* add GQA Test
* fix except_msg
* support ignore EOS token
* change variable's name
* fix annotation
7 months ago
yuehuayingxueluo
9c2fe7935f
[Inference]Adapt temperature processing logic ( #5689 )
...
* Adapt temperature processing logic
* add ValueError for top_p and top_k
* add GQA Test
* fix except_msg
7 months ago
Yuanheng Zhao
55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements ( #5688 )
...
* clean requirements
* modify example inference struct
* add test ci scripts
* mark test_infer as submodule
* rm deprecated cls & deps
* import of HAS_FLASH_ATTN
* prune inference tests to be run
* prune triton kernel tests
* increment pytest timeout mins
* revert import path in openmoe
7 months ago
Yuanheng Zhao
f9afe0addd
[hotfix] Fix KV Heads Number Assignment in KVCacheManager ( #5695 )
...
- Fix key value number assignment in KVCacheManager, as well as method of accessing
7 months ago
Yuanheng Zhao
8754abae24
[Fix] Fix & Update Inference Tests (compatibility w/ main)
7 months ago
yuehuayingxueluo
f79963199c
[inference]Add alibi to flash attn function ( #5678 )
...
* add alibi to flash attn function
* rm redundant modifications
7 months ago
Steve Luo
5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… ( #5663 )
...
* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator
* refactor decode_kv_cache_memcpy
* enable alibi in pagedattention
7 months ago
yuehuayingxueluo
5f00002e43
[Inference] Adapt Baichuan2-13B TP ( #5659 )
...
* adapt to baichuan2 13B
* add baichuan2 13B TP
* update baichuan tp logic
* rm unused code
* Fix TP logic
* fix alibi slopes tp logic
* rm nn.Module
* Polished the code.
* change BAICHUAN_MODEL_NAME_OR_PATH
* Modified the logic for loading Baichuan weights.
* fix typos
7 months ago
yuehuayingxueluo
3c91e3f176
[Inference]Adapt to baichuan2 13B ( #5614 )
...
* adapt to baichuan2 13B
* adapt to baichuan2 13B
* change BAICHUAN_MODEL_NAME_OR_PATH
* fix test_decoding_attn.py
* Modifications based on review comments.
* change BAICHUAN_MODEL_NAME_OR_PATH
* mv attn mask processes to test flash decoding
* mv get_alibi_slopes baichuan modeling
* fix bugs in test_baichuan.py
7 months ago
Steve Luo
a8fd3b0342
[Inference/Kernel] Optimize paged attention: Refactor key cache layout ( #5643 )
...
* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Yuanheng Zhao
04863a9b14
[example] Update Llama Inference example ( #5629 )
...
* [example] add infernece benchmark llama3
* revise inference config - arg
* remove unused args
* add llama generation demo script
* fix init rope in llama policy
* add benchmark-llama3 - cleanup
7 months ago
Yuanheng Zhao
5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 ( #5624 )
...
* [fix] GQA calling of flash decoding triton
* fix kv cache alloc shape
* fix rotary triton - GQA
* fix sequence max length assigning
* Sequence max length logic
* fix scheduling and spec-dec
* skip without import error
* fix pytest - skip without ImportError
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Runyu Lu
e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference ( #5563 )
...
* tensor parallel support naive source
* [fix]precision, model load and refactor the framework
* add tp unit test
* docstring
* fix do_sample
7 months ago
Steve Luo
be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block ( #5531 )
...
* feat flash decoding for paged attention
* refactor flashdecodingattention
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
yuehuayingxueluo
56b222eff8
[inference/model]Adapted to the baichuan2-7B model ( #5591 )
...
* Adapted to the baichuan2-7B model
* modified according to the review comments.
* Modified the method of obtaining random weights.
* modified according to the review comments.
* change mlp layewr 'NOTE'
8 months ago
Yuanheng
f8598e3ec5
[Fix] Llama Modeling Control with Spec-Dec ( #5580 )
...
- fix ref before asgmt
- fall back to use triton kernels when using spec-dec
8 months ago
Yuanheng Zhao
e60d430cf5
[Fix] resolve conflicts of rebasing feat/speculative-decoding ( #5557 )
...
- resolve conflicts of rebasing feat/speculative-decoding
8 months ago
Yuanheng Zhao
e1acb58423
[doc] Add inference/speculative-decoding README ( #5552 )
...
* add README for spec-dec
* update roadmap
8 months ago
Yuanheng Zhao
d85d91435a
[Inference/SpecDec] Support GLIDE Drafter Model ( #5455 )
...
* add glide-llama policy and modeling
* update glide modeling, compitable with transformers 4.36.2
* revise glide llama modeling/usage
* fix issues of glimpsing large kv
* revise the way re-loading params for glide drafter
* fix drafter and engine tests
* enable convert to glide strict=False
* revise glide llama modeling
* revise vicuna prompt template
* revise drafter and tests
* apply usage of glide model in engine
8 months ago
Yuanheng Zhao
912e24b2aa
[SpecDec] Fix inputs for speculation and revise past KV trimming ( #5449 )
...
* fix drafter pastkv and usage of batch bucket
8 months ago
Yuanheng Zhao
a37f82629d
[Inference/SpecDec] Add Speculative Decoding Implementation ( #5423 )
...
* fix flash decoding mask during verification
* add spec-dec
* add test for spec-dec
* revise drafter init
* remove drafter sampling
* retire past kv in drafter
* (trivial) rename attrs
* (trivial) rename arg
* revise how we enable/disable spec-dec
8 months ago
Yuanheng Zhao
5a9b05f7b2
[Inference/SpecDec] Add Basic Drafter Model Container ( #5405 )
...
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399 )
fix dependency in pytest
* add drafter model container (basic ver)
8 months ago
Yuanheng Zhao
4bb5d8923a
[Fix/Inference] Remove unused and non-functional functions ( #5543 )
...
* [fix] remove unused func
* rm non-functional partial
8 months ago
yuehuayingxueluo
04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel ( #5528 )
...
* Add get_cos_and_sin kernel
* fix code comments
* fix code typos
* merge common codes of get_cos_and_sin kernel.
* Fixed a typo
* Changed 'asset allclose' to 'assert equal'.
8 months ago
傅剑寒
e6496dd371
[Inference] Optimize request handler of llama ( #5512 )
...
* optimize request_handler
* fix ways of writing
8 months ago
Runyu Lu
6251d68dc9
[fix] PR #5354 ( #5501 )
...
* [fix]
* [fix]
* Update config.py docstring
* [fix] docstring align
* [fix] docstring align
* [fix] docstring align
8 months ago
Runyu Lu
68e9396bc0
[fix] merge conflicts
8 months ago
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
8 months ago
Runyu Lu
ff4998c6f3
[fix] remove unused comment
8 months ago
Runyu Lu
5b017d6324
[fix]
8 months ago
Runyu Lu
4eafe0c814
[fix] unused option
8 months ago
Runyu Lu
aabc9fb6aa
[feat] add use_cuda_kernel option
8 months ago
Runyu Lu
6e30248683
[fix] tmp for test
9 months ago
Runyu Lu
d02e257abd
Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph
9 months ago
Runyu Lu
ae24b4f025
diverse tests
9 months ago
Runyu Lu
1821a6dab0
[fix] pytest and fix dyn grid bug
9 months ago
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
9 months ago
Runyu Lu
633e95b301
[doc] add doc
9 months ago
Runyu Lu
9dec66fad6
[fix] multi graphs capture error
9 months ago
Runyu Lu
b2c0d9ff2b
[fix] multi graphs capture error
9 months ago
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
9 months ago
Runyu Lu
cefaeb5fdd
[feat] cuda graph support and refactor non-functional api
9 months ago