Jianghai
61a1b2e798
[Inference] Fix bugs and docs for feat/online-server ( #5598 )
...
* fix test bugs
* add do sample test
* del useless lines
* fix comments
* fix tests
* delete version tag
* delete version tag
* add
* del test sever
* fix test
* fix
* Revert "add"
This reverts commit b9305fb024
.
7 months ago
yuehuayingxueluo
9c2fe7935f
[Inference]Adapt temperature processing logic ( #5689 )
...
* Adapt temperature processing logic
* add ValueError for top_p and top_k
* add GQA Test
* fix except_msg
7 months ago
Yuanheng Zhao
55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements ( #5688 )
...
* clean requirements
* modify example inference struct
* add test ci scripts
* mark test_infer as submodule
* rm deprecated cls & deps
* import of HAS_FLASH_ATTN
* prune inference tests to be run
* prune triton kernel tests
* increment pytest timeout mins
* revert import path in openmoe
7 months ago
Yuanheng Zhao
8754abae24
[Fix] Fix & Update Inference Tests (compatibility w/ main)
7 months ago
Yuanheng Zhao
5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 ( #5624 )
...
* [fix] GQA calling of flash decoding triton
* fix kv cache alloc shape
* fix rotary triton - GQA
* fix sequence max length assigning
* Sequence max length logic
* fix scheduling and spec-dec
* skip without import error
* fix pytest - skip without ImportError
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
7 months ago
Runyu Lu
e37ee2fb65
[Feat]Tensor Model Parallel Support For Inference ( #5563 )
...
* tensor parallel support naive source
* [fix]precision, model load and refactor the framework
* add tp unit test
* docstring
* fix do_sample
7 months ago
Yuanheng Zhao
d85d91435a
[Inference/SpecDec] Support GLIDE Drafter Model ( #5455 )
...
* add glide-llama policy and modeling
* update glide modeling, compitable with transformers 4.36.2
* revise glide llama modeling/usage
* fix issues of glimpsing large kv
* revise the way re-loading params for glide drafter
* fix drafter and engine tests
* enable convert to glide strict=False
* revise glide llama modeling
* revise vicuna prompt template
* revise drafter and tests
* apply usage of glide model in engine
8 months ago
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
8 months ago
Steve Luo
ed431de4e4
fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test ( #5454 )
8 months ago
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
9 months ago
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
10 months ago
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
10 months ago
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
10 months ago
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
10 months ago
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
10 months ago
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
11 months ago
yuehuayingxueluo
fab294c7f4
fix CI bugs
11 months ago
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
11 months ago
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
11 months ago
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
11 months ago
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
11 months ago
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
11 months ago
yuehuayingxueluo
4df8876fca
Fixed a writing error
11 months ago
yuehuayingxueluo
9489dc64d8
precision alignment
11 months ago
yuehuayingxueluo
62968588d1
fix bugs in request_handler
11 months ago
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
11 months ago
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
11 months ago
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
11 months ago