yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
Frank Lee
027aa1043f
[doc] updated inference readme ( #5343 )
2024-02-02 14:31:10 +08:00
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
2024-02-02 11:44:15 +08:00
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
2024-01-30 10:31:46 +08:00
Jianghai
c7c104cb7c
[DOC] Update inference readme ( #5280 )
...
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
2024-01-29 16:21:06 +08:00
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
2024-01-18 16:31:14 +08:00
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
2024-01-17 16:03:10 +08:00
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
2024-01-15 17:50:46 +08:00
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
2024-01-11 13:50:45 +00:00
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fab294c7f4
fix CI bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
9489dc64d8
precision alignment
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62968588d1
fix bugs in request_handler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
86853a37d5
Add padding llama model
2024-01-11 13:39:56 +00:00
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
2024-01-11 13:39:56 +00:00
Jianghai
93aeacca34
[Inference]Update inference config and fix test ( #5178 )
...
* unify the config setting
* fix test
* fix import
* fix test
* fix
* fix
* add logger
* revise log info
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Yuanheng Zhao
3de2e62299
[Inference] Add CacheBlock and KV-Cache Manager ( #5156 )
...
* [Inference] Add KVCache Manager
* function refactored
* add test for KVCache Manager
* add attr beam width
* Revise alloc func in CacheManager
* Fix docs and pytests
* add tp slicing for head number
* optimize shapes of tensors used as physical cache
* Apply using InferenceConfig on KVCacheManager
* rm duplicate config file
* Optimize cache allocation: use contiguous cache
* Fix config in pytest (and config)
2024-01-11 13:39:29 +00:00
yuehuayingxueluo
fab9b931d9
[Inference]Add BatchInferState, Sequence and InferConfig ( #5149 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
2024-01-11 13:39:29 +00:00
Jianghai
56e75eeb06
[Inference] Add readme (roadmap) and fulfill request handler ( #5147 )
...
* request handler
* add readme
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Jianghai
4cf4682e70
[Inference] First PR for rebuild colossal-infer ( #5143 )
...
* add engine and scheduler
* add dirs
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00