Yuanheng Zhao
912e24b2aa
[SpecDec] Fix inputs for speculation and revise past KV trimming ( #5449 )
...
* fix drafter pastkv and usage of batch bucket
2024-04-10 11:07:52 +08:00
Yuanheng Zhao
a37f82629d
[Inference/SpecDec] Add Speculative Decoding Implementation ( #5423 )
...
* fix flash decoding mask during verification
* add spec-dec
* add test for spec-dec
* revise drafter init
* remove drafter sampling
* retire past kv in drafter
* (trivial) rename attrs
* (trivial) rename arg
* revise how we enable/disable spec-dec
2024-04-10 11:07:52 +08:00
Yuanheng Zhao
5a9b05f7b2
[Inference/SpecDec] Add Basic Drafter Model Container ( #5405 )
...
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399 )
fix dependency in pytest
* add drafter model container (basic ver)
2024-04-10 11:07:51 +08:00
Yuanheng Zhao
4bb5d8923a
[Fix/Inference] Remove unused and non-functional functions ( #5543 )
...
* [fix] remove unused func
* rm non-functional partial
2024-04-02 14:16:59 +08:00
yuehuayingxueluo
04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel ( #5528 )
...
* Add get_cos_and_sin kernel
* fix code comments
* fix code typos
* merge common codes of get_cos_and_sin kernel.
* Fixed a typo
* Changed 'asset allclose' to 'assert equal'.
2024-04-01 13:47:14 +08:00
傅剑寒
e6496dd371
[Inference] Optimize request handler of llama ( #5512 )
...
* optimize request_handler
* fix ways of writing
2024-03-26 16:37:14 +08:00
Runyu Lu
6251d68dc9
[fix] PR #5354 ( #5501 )
...
* [fix]
* [fix]
* Update config.py docstring
* [fix] docstring align
* [fix] docstring align
* [fix] docstring align
2024-03-25 15:24:17 +08:00
Runyu Lu
68e9396bc0
[fix] merge conflicts
2024-03-25 14:48:28 +08:00
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
2024-03-25 13:40:34 +08:00
Runyu Lu
ff4998c6f3
[fix] remove unused comment
2024-03-25 12:00:57 +08:00
Runyu Lu
5b017d6324
[fix]
2024-03-21 15:55:25 +08:00
Runyu Lu
4eafe0c814
[fix] unused option
2024-03-21 11:28:42 +08:00
Runyu Lu
aabc9fb6aa
[feat] add use_cuda_kernel option
2024-03-19 13:24:25 +08:00
Runyu Lu
6e30248683
[fix] tmp for test
2024-03-14 16:13:00 +08:00
Runyu Lu
d02e257abd
Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph
2024-03-14 10:37:05 +08:00
Runyu Lu
ae24b4f025
diverse tests
2024-03-14 10:35:08 +08:00
Runyu Lu
1821a6dab0
[fix] pytest and fix dyn grid bug
2024-03-13 17:28:32 +08:00
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
2024-03-13 17:20:03 +08:00
Runyu Lu
633e95b301
[doc] add doc
2024-03-11 10:56:51 +08:00
Runyu Lu
9dec66fad6
[fix] multi graphs capture error
2024-03-11 10:51:16 +08:00
Runyu Lu
b2c0d9ff2b
[fix] multi graphs capture error
2024-03-11 10:49:31 +08:00
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
2024-03-08 16:21:12 +08:00
Runyu Lu
cefaeb5fdd
[feat] cuda graph support and refactor non-functional api
2024-03-08 14:19:35 +08:00
yuehuayingxueluo
600881a8ea
[Inference]Add CUDA KVCache Kernel ( #5406 )
...
* add cuda KVCache kernel
* annotation benchmark_kvcache_copy
* add use cuda
* fix import path
* move benchmark scripts to example/
* rm benchmark codes in test_kv_cache_memcpy.py
* rm redundancy codes
* rm redundancy codes
* pr was modified according to the review
2024-02-28 14:36:50 +08:00
yuehuayingxueluo
bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine ( #5395 )
...
* Fix bugs in inference_engine
* fix bugs in engine.py
* rm CUDA_VISIBLE_DEVICES
* add request_ids in generate
* fix bug in engine.py
* add logger.debug for BatchBucket
2024-02-23 10:51:35 +08:00
yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
2024-02-21 13:23:57 +08:00
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
2024-02-21 11:31:48 +08:00
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
2024-02-19 17:18:20 +08:00
yuehuayingxueluo
8c69debdc7
[Inference]Support vllm testing in benchmark scripts ( #5379 )
...
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
2024-02-08 15:27:26 +08:00
Frank Lee
9afa52061f
[inference] refactored config ( #5376 )
2024-02-08 14:04:14 +08:00
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
2024-02-07 17:55:48 +08:00
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
2024-02-07 17:15:42 +08:00
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
2024-02-07 17:11:43 +08:00
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
2024-02-07 14:27:04 +08:00
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
2024-02-07 11:36:04 +08:00
yuehuayingxueluo
35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. ( #5365 )
...
* fused the gate and up proj in mlp
* fix code styles
* opt auto_grad
* rollback test_inference_engine.py
* modifications based on the review feedback.
* fix bugs in flash attn
* Change reshape to view
* fix test_rmsnorm_triton.py
2024-02-06 19:38:25 +08:00
Yuanheng Zhao
1dedb57747
[Fix/Infer] Remove unused deps and revise requirements ( #5341 )
...
* remove flash-attn dep
* rm padding llama
* revise infer requirements
* move requirements out of module
2024-02-06 17:27:45 +08:00
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
yuehuayingxueluo
21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm ( #5350 )
...
* opt rms_norm
* fix bugs in rms_layernorm
2024-02-02 15:06:01 +08:00
Frank Lee
027aa1043f
[doc] updated inference readme ( #5343 )
2024-02-02 14:31:10 +08:00
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
2024-02-02 11:44:15 +08:00
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
2024-02-01 15:31:01 +08:00
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
2024-01-30 16:06:09 +08:00
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
2024-01-30 10:31:46 +08:00
Jianghai
c7c104cb7c
[DOC] Update inference readme ( #5280 )
...
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
2024-01-29 16:21:06 +08:00
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
yuehuayingxueluo
cea9c86e45
add utils.py
2024-01-22 16:06:27 +08:00
yuehuayingxueluo
bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm ( #5283 )
...
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
2024-01-22 10:55:34 +08:00
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
2024-01-18 16:31:14 +08:00
yuehuayingxueluo
86b63f720c
[Inference]Adapted to the triton attn kernels ( #5264 )
...
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
2024-01-17 16:03:10 +08:00
Jianghai
d8db500efc
[Inference] Fix request handler and add recycle logic ( #5260 )
...
* fix request handler
* fix comment
2024-01-15 17:50:46 +08:00
Frank Lee
c597678da4
[doc] updated inference readme ( #5269 )
2024-01-15 17:37:41 +08:00
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
2024-01-11 13:50:45 +00:00
yuehuayingxueluo
d40eb26029
fix bugs in request_handler.py and engine.py
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
10e3c9f923
rm torch.cuda.synchronize
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fab294c7f4
fix CI bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
2a73e828eb
fix bugs related to processing padding mask
2024-01-11 13:46:14 +00:00
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
2024-01-11 13:44:06 +00:00
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
3ad1f3b78b
fix beam_width
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
b2eb9cd186
Fixed a typo
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
9489dc64d8
precision alignment
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62968588d1
fix bugs in request_handler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
86853a37d5
Add padding llama model
2024-01-11 13:39:56 +00:00
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
2024-01-11 13:39:56 +00:00
Jianghai
93aeacca34
[Inference]Update inference config and fix test ( #5178 )
...
* unify the config setting
* fix test
* fix import
* fix test
* fix
* fix
* add logger
* revise log info
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Yuanheng Zhao
3de2e62299
[Inference] Add CacheBlock and KV-Cache Manager ( #5156 )
...
* [Inference] Add KVCache Manager
* function refactored
* add test for KVCache Manager
* add attr beam width
* Revise alloc func in CacheManager
* Fix docs and pytests
* add tp slicing for head number
* optimize shapes of tensors used as physical cache
* Apply using InferenceConfig on KVCacheManager
* rm duplicate config file
* Optimize cache allocation: use contiguous cache
* Fix config in pytest (and config)
2024-01-11 13:39:29 +00:00
yuehuayingxueluo
fab9b931d9
[Inference]Add BatchInferState, Sequence and InferConfig ( #5149 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
2024-01-11 13:39:29 +00:00
Jianghai
56e75eeb06
[Inference] Add readme (roadmap) and fulfill request handler ( #5147 )
...
* request handler
* add readme
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Jianghai
4cf4682e70
[Inference] First PR for rebuild colossal-infer ( #5143 )
...
* add engine and scheduler
* add dirs
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Zhongkai Zhao
75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer ( #5064 )
...
* hotfix/Fix get model policy strategy in ShardFormer
* fix bug in auto policy
2023-11-22 11:19:39 +08:00
Hongxin Liu
1cd7efc520
[inference] refactor examples and fix schedule ( #5077 )
...
* [setup] refactor infer setup
* [hotfix] fix infenrece behavior on 1 1 gpu
* [exmaple] refactor inference examples
2023-11-21 10:46:03 +08:00
Xu Kai
fb103cfd6e
[inference] update examples and engine ( #5073 )
...
* update examples and engine
* fix choices
* update example
2023-11-20 19:44:52 +08:00
Bin Jia
0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 ( #5069 )
2023-11-20 17:15:37 +08:00
Cuiqing Li (李崔卿)
bce919708f
[Kernels]added flash-decoidng of triton ( #5063 )
...
* added flash-decoidng of triton based on lightllm kernel
* add req
* clean
* clean
* delete build.sh
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-11-20 13:58:29 +08:00
Xu Kai
fd6482ad8c
[inference] Refactor inference architecture ( #5057 )
...
* [inference] support only TP (#4998 )
* support only tp
* enable tp
* add support for bloom (#5008 )
* [refactor] refactor gptq and smoothquant llama (#5012 )
* refactor gptq and smoothquant llama
* fix import error
* fix linear import torch-int
* fix smoothquant llama import error
* fix import accelerate error
* fix bug
* fix import smooth cuda
* fix smoothcuda
* [Inference Refactor] Merge chatglm2 with pp and tp (#5023 )
merge chatglm with pp and tp
* [Refactor] remove useless inference code (#5022 )
* remove useless code
* fix quant model
* fix test import bug
* mv original inference legacy
* fix chatglm2
* [Refactor] refactor policy search and quant type controlling in inference (#5035 )
* [Refactor] refactor policy search and quant type controling in inference
* [inference] update readme (#5051 )
* update readme
* update readme
* fix architecture
* fix table
* fix table
* [inference] udpate example (#5053 )
* udpate example
* fix run.sh
* fix rebase bug
* fix some errors
* update readme
* add some features
* update interface
* update readme
* update benchmark
* add requirements-infer
---------
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-19 21:05:05 +08:00
Cuiqing Li (李崔卿)
28052a71fb
[Kernels]Update triton kernels into 2.1.0 ( #5046 )
...
* update flash-context-attention
* adding kernels
* fix
* reset
* add build script
* add building process
* add llama2 exmaple
* add colossal-llama2 test
* clean
* fall back test setting
* fix test file
* clean
* clean
* clean
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-11-16 16:43:15 +08:00
Zhongkai Zhao
70885d707d
[hotfix] Suport extra_kwargs in ShardConfig ( #5031 )
...
* [refactor]: replace inference args with extra_kwargs in ShardConfig
* modify shardconfig
* polish code
* fix policy bug in llama
* fix bug in auto policy
* remove setattr in ShardConfig
2023-11-10 10:49:50 +08:00
Xuanlei Zhao
f71e63b0f3
[moe] support optimizer checkpoint ( #5015 )
...
* Refactor MoE Manager setup method
* unshard optim ckpt
* optim io
* update transformer version
* update requirements
* update ckpt
* update ckpt
* update ckpt
* fix engine
* fix engine
2023-11-08 15:07:03 +00:00
Jianghai
ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism ( #5014 )
...
* fix bug
* fix
* fix multiquery
* fix multiquery
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2023-11-07 15:01:50 +08:00
github-actions[bot]
c36e782d80
[format] applied code formatting on changed files in pull request 4926 ( #5007 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-11-06 17:08:12 +08:00
littsk
1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel ( #4926 )
...
* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915 )
* Add layer norm gradients all-reduce for sequence parallel.
* skip pipeline inference test
* [hotfix] fixing polices of sequence parallel (#4922 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
---------
Co-authored-by: littsk <1214689160@qq.com>
* Hotfix/add grad all reduce for sequence parallel (#4927 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
* fix bug using wrong variables
---------
Co-authored-by: littsk <1214689160@qq.com>
* fix policy initialization
* fix bloom and chatglm policices
* polish code of handling layernorm
* fix moe module
* polish code of class initializing
---------
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-03 13:32:43 +08:00
Bin Jia
b6696beb04
[Pipeline Inference] Merge pp with tp ( #4993 )
...
* refactor pipeline into new CaiInferEngine
* updata llama modeling forward
* merge tp with pp
* update docstring
* optimize test workflow and example
* fix typo
* add assert and todo
2023-11-01 12:46:21 +08:00
Cuiqing Li (李崔卿)
4f0234f236
[doc]Update doc for colossal-inference ( #4989 )
...
* update doc
* Update README.md
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-10-31 10:48:07 +08:00
Cuiqing Li
459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention ( #4965 )
...
* adding flash-decoding
* clean
* adding kernel
* adding flash-decoding
* add integration
* add
* adding kernel
* adding kernel
* adding triton 2.1.0 features for inference
* update bloom triton kernel
* remove useless vllm kernels
* clean codes
* fix
* adding files
* fix readme
* update llama flash-decoding
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-10-30 14:04:37 +08:00
Jianghai
cf579ff46d
[Inference] Dynamic Batching Inference, online and offline ( #4953 )
...
* [inference] Dynamic Batching for Single and Multiple GPUs (#4831 )
* finish batch manager
* 1
* first
* fix
* fix dynamic batching
* llama infer
* finish test
* support different lengths generating
* del prints
* del prints
* fix
* fix bug
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [inference] Async dynamic batching (#4894 )
* finish input and output logic
* add generate
* test forward
* 1
* [inference]Re push async dynamic batching (#4901 )
* adapt to ray server
* finish async
* finish test
* del test
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* Revert "[inference]Re push async dynamic batching (#4901 )" (#4905 )
This reverts commit fbf3c09e67
.
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Revert "[inference] Async dynamic batching (#4894 )" (#4909 )
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* [infer]Add Ray Distributed Environment Init Scripts (#4911 )
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* support dynamic batch for bloom model and is_running function
* [Inference]Test for new Async engine (#4935 )
* infer engine
* infer engine
* test engine
* test engine
* new manager
* change step
* add
* test
* fix
* fix
* finish test
* finish test
* finish test
* finish test
* add license
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* add assertion for config (#4947 )
* [Inference] Finish dynamic batching offline test (#4948 )
* test
* fix test
* fix quant
* add default
* fix
* fix some bugs
* fix some bugs
* fix
* fix bug
* fix bugs
* reset param
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2023-10-30 10:52:19 +08:00
Bin Jia
1db6727678
[Pipeline inference] Combine kvcache with pipeline inference ( #4938 )
...
* merge kvcache with pipeline inference and refactor the code structure
* support ppsize > 2
* refactor pipeline code
* do pre-commit
* modify benchmark
* fix bench mark
* polish code
* add docstring and update readme
* refactor the code
* fix some logic bug of ppinfer
* polish readme
* fix typo
* skip infer test
2023-10-27 16:19:54 +08:00
Xu Kai
785802e809
[inference] add reference and fix some bugs ( #4937 )
...
* add reference and fix some bugs
* update gptq init
---------
Co-authored-by: Xu Kai <xukai16@foxamil.com>
2023-10-20 13:39:34 +08:00
Cuiqing Li
3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention ( #4946 )
...
* add some req for inference
* clean codes
* add codes
* add some lightllm deps
* clean codes
* hello
* delete rms files
* add some comments
* add comments
* add doc
* add lightllm deps
* add lightllm cahtglm2 kernels
* add lightllm cahtglm2 kernels
* replace rotary embedding with lightllm kernel
* add some commnets
* add some comments
* add some comments
* add
* replace fwd kernel att1
* fix a arg
* add
* add
* fix token attention
* add some comments
* clean codes
* modify comments
* fix readme
* fix bug
* fix bug
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
2023-10-19 22:22:47 +08:00
digger yu
11009103be
[nfc] fix some typo with colossalai/ docs/ etc. ( #4920 )
2023-10-18 15:44:04 +08:00