Hongxin Liu
19e1a5cf16
[shardformer] update colo attention to support custom mask ( #5510 )
...
* [feature] refactor colo attention (#5462 )
* [extension] update api
* [feature] add colo attention
* [feature] update sdpa
* [feature] update npu attention
* [feature] update flash-attn
* [test] add flash attn test
* [test] update flash attn test
* [shardformer] update modeling to fit colo attention (#5465 )
* [misc] refactor folder structure
* [shardformer] update llama flash-attn
* [shardformer] fix llama policy
* [devops] update tensornvme install
* [test] update llama test
* [shardformer] update colo attn kernel dispatch
* [shardformer] update blip2
* [shardformer] update chatglm
* [shardformer] update gpt2
* [shardformer] update gptj
* [shardformer] update opt
* [shardformer] update vit
* [shardformer] update colo attention mask prep
* [shardformer] update whisper
* [test] fix shardformer tests (#5514 )
* [test] fix shardformer tests
* [test] fix shardformer tests
2024-03-27 11:19:32 +08:00
Edenzzzz
61da3fbc52
fixed layout converter caching and updated tester
2024-03-26 17:22:27 +08:00
flybird11111
0688d92e2d
[shardformer]Fix lm parallel. ( #5480 )
...
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
* fix lm forward distribution
* fix
* test ci
* fix
2024-03-25 17:21:51 +08:00
Runyu Lu
68e9396bc0
[fix] merge conflicts
2024-03-25 14:48:28 +08:00
yuehuayingxueluo
87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding ( #5461 )
...
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
2024-03-25 13:40:34 +08:00
Wenhao Chen
bb0a668fee
[hotfix] set return_outputs=False in examples and polish code ( #5404 )
...
* fix: simplify merge_batch
* fix: use return_outputs=False to eliminate extra memory consumption
* feat: add return_outputs warning
* style: remove `return_outputs=False` as it is the default value
2024-03-25 12:31:09 +08:00
Runyu Lu
9fe61b4475
[fix]
2024-03-25 11:37:58 +08:00
Runyu Lu
aabc9fb6aa
[feat] add use_cuda_kernel option
2024-03-19 13:24:25 +08:00
flybird11111
5e16bf7980
[shardformer] fix gathering output when using tensor parallelism ( #5431 )
...
* fix
* padding vocab_size when using pipeline parallellism
padding vocab_size when using pipeline parallellism
fix
fix
* fix
* fix
fix
fix
* fix gather output
* fix
* fix
* fix
fix resize embedding
fix resize embedding
* fix resize embedding
fix
* revert
* revert
* revert
2024-03-18 15:55:11 +08:00
Runyu Lu
d02e257abd
Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph
2024-03-14 10:37:05 +08:00
Runyu Lu
ae24b4f025
diverse tests
2024-03-14 10:35:08 +08:00
Runyu Lu
1821a6dab0
[fix] pytest and fix dyn grid bug
2024-03-13 17:28:32 +08:00
yuehuayingxueluo
f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel ( #5418 )
...
* add rotary embedding kernel
* add rotary_embedding_kernel
* add fused rotary_emb and kvcache memcopy
* add fused_rotary_emb_and_cache_kernel.cu
* add fused_rotary_emb_and_memcopy
* fix bugs in fused_rotary_emb_and_cache_kernel.cu
* fix ci bugs
* use vec memcopy and opt the gloabl memory access
* fix code style
* fix test_rotary_embdding_unpad.py
* codes revised based on the review comments
* fix bugs about include path
* rm inline
2024-03-13 17:20:03 +08:00
Steve Luo
ed431de4e4
fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test ( #5454 )
2024-03-13 16:00:55 +08:00
Hongxin Liu
f2e8b9ef9f
[devops] fix compatibility ( #5444 )
...
* [devops] fix compatibility
* [hotfix] update compatibility test on pr
* [devops] fix compatibility
* [devops] record duration during comp test
* [test] decrease test duration
* fix falcon
2024-03-13 15:24:13 +08:00
Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
2024-03-08 16:21:12 +08:00
xs_courtesy
95c21498d4
add silu_and_mul for infer
2024-03-07 16:57:49 +08:00
flybird11111
29695cf70c
[example]add gpt2 benchmark example script. ( #5295 )
...
* benchmark gpt2
* fix
fix
fix
fix
* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247 )
* [workflow] fixed build CI (#5240 )
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
* [ci] fixed booster test (#5251 )
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed ddp test (#5254 )
* [ci] fixed ddp test
* polish
* fix typo in applications/ColossalEval/README.md (#5250 )
* [ci] fix shardformer tests. (#5255 )
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [doc] fix doc typo (#5256 )
* [doc] fix annotation display
* [doc] fix llama2 doc
* [hotfix]: add pp sanity check and fix mbs arg (#5268 )
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
* [workflow] fixed incomplete bash command (#5272 )
* [workflow] fixed oom tests (#5275 )
* [workflow] fixed oom tests
* polish
* polish
* polish
* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276 )
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [shardformer] hybridparallelplugin support gradients accumulation. (#5246 )
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230 )
* fix auto loading gpt2 tokenizer (#5279 )
* [doc] add llama2-13B disyplay (#5285 )
* Update README.md
* fix 13b typo
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* fix llama pretrain (#5287 )
* fix
* fix
* fix
fix
* fix
fix
fix
* fix
fix
* benchmark gpt2
* fix
fix
fix
fix
* [workflow] fixed build CI (#5240 )
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
* [ci] fixed booster test (#5251 )
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
* fix
fix
* fix
fix
fix
* fix
* fix
fix
fix
fix
fix
* fix
* Update shardformer.py
---------
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>
2024-03-04 16:18:13 +08:00
FrankLeeeee
0310b76e9d
Merge branch 'main' into sync/main
2024-03-04 10:09:36 +08:00
yuehuayingxueluo
0aa27f1961
[Inference]Move benchmark-related code to the example directory. ( #5408 )
...
* move benchmark-related code to the example directory.
* fix bugs in test_fused_rotary_embedding.py
2024-02-28 16:46:03 +08:00
yuehuayingxueluo
600881a8ea
[Inference]Add CUDA KVCache Kernel ( #5406 )
...
* add cuda KVCache kernel
* annotation benchmark_kvcache_copy
* add use cuda
* fix import path
* move benchmark scripts to example/
* rm benchmark codes in test_kv_cache_memcpy.py
* rm redundancy codes
* rm redundancy codes
* pr was modified according to the review
2024-02-28 14:36:50 +08:00
QinLuo
bf34c6fef6
[fsdp] impl save/load shard model/optimizer ( #5357 )
2024-02-27 13:51:14 +08:00
Yuanheng Zhao
19061188c3
[Infer/Fix] Fix Dependency in test - RMSNorm kernel ( #5399 )
...
fix dependency in pytest
2024-02-26 16:17:47 +08:00
yuehuayingxueluo
bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine ( #5395 )
...
* Fix bugs in inference_engine
* fix bugs in engine.py
* rm CUDA_VISIBLE_DEVICES
* add request_ids in generate
* fix bug in engine.py
* add logger.debug for BatchBucket
2024-02-23 10:51:35 +08:00
yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
2024-02-21 13:23:57 +08:00
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
2024-02-21 11:31:48 +08:00
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
2024-02-19 17:18:20 +08:00
ver217
06db94fbc9
[moe] fix tests
2024-02-08 12:46:37 +08:00
Xuanlei Zhao
7d8e0338a4
[moe] init mixtral impl
2024-02-07 19:21:02 +08:00
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
2024-02-07 17:55:48 +08:00
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
2024-02-07 17:15:42 +08:00
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
2024-02-07 17:11:43 +08:00
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
2024-02-07 14:27:04 +08:00
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
2024-02-07 11:36:04 +08:00
Hongxin Liu
c53ddda88f
[lr-scheduler] fix load state dict and add test ( #5369 )
2024-02-06 14:23:32 +08:00
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
Wenhao Chen
1c790c0877
[fix] remove unnecessary dp_size assert ( #5351 )
...
* fix: remove unnecessary assert
* test: add more 3d plugin tests
* fix: add warning
2024-02-02 14:40:20 +08:00
Frank Lee
e76acbb076
[inference] moved ops tests to test_infer ( #5354 )
2024-02-02 13:51:22 +08:00
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
2024-02-02 11:44:15 +08:00
Hongxin Liu
ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint ( #5347 )
...
* [checkpointio] fix hybrid parallel optim checkpoint
* [extension] fix cuda extension
* [checkpointio] fix gemini optimizer checkpoint
* polish code
2024-02-01 16:13:06 +08:00
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
2024-02-01 15:31:01 +08:00
Jianghai
df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding ( #5336 )
...
* revise rotary embedding
* remove useless print
* adapt
2024-01-31 16:31:29 +08:00
FrankLeeeee
c565519913
merge commit
2024-01-31 10:41:47 +08:00
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
2024-01-30 16:06:09 +08:00
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
2024-01-30 10:31:46 +08:00
Jianghai
1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM ( #5315 )
...
* add
* xi
* del
* del
* fix
2024-01-29 10:22:33 +08:00
Jianghai
7ddd8b37f0
fix ( #5311 )
2024-01-26 15:02:12 +08:00
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Frank Lee
7cfed5f076
[feat] refactored extension module ( #5298 )
...
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
2024-01-25 17:01:48 +08:00
Jianghai
c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel ( #5302 )
...
* add fused rotary and get cos cache func
* staged
* fix bugs
* fix bugs
2024-01-24 16:20:42 +08:00
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
Jianghai
8e606ecc7e
[Inference] Benchmarking rotary embedding and add a fetch function ( #5277 )
...
* fix bugs and add a cos/sin cache fetch func
* add docstring
* fix bug
* fix
2024-01-23 12:11:53 +08:00
Hongxin Liu
d7f8db8e21
[hotfix] fix 3d plugin test ( #5292 )
2024-01-22 15:19:04 +08:00
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
2024-01-18 16:31:14 +08:00
ver217
148469348a
Merge branch 'main' into sync/npu
2024-01-18 12:05:21 +08:00
Yaozheng Fang
5ae9099f92
[kernel] Add RMSLayerNorm triton kernel ( #5262 )
...
* add layerrmsnorm triton kernel
* add layerrmsnorm kernel
* modify the atol and rtol in test file
* Remove the logics of mean computations, and update the name of ther kernel functions and files
* add benchmark of rms norm
2024-01-18 10:21:03 +08:00
Zhongkai Zhao
5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism ( #5230 )
2024-01-17 17:42:29 +08:00
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
2024-01-17 15:22:33 +08:00
flybird11111
2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py ( #5276 )
...
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-17 13:38:55 +08:00
Frank Lee
d69cd2eb89
[workflow] fixed oom tests ( #5275 )
...
* [workflow] fixed oom tests
* polish
* polish
* polish
2024-01-16 18:55:13 +08:00
Yuanheng Zhao
0f2b46a41c
[kernel] Revise KVCache copy triton kernel API ( #5273 )
...
* [kernel/fix] revise kvcache copy kernel api
* fix benchmark
2024-01-16 14:41:02 +08:00
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
2024-01-11 13:50:45 +00:00
Yuanheng Zhao
1513f20f4d
[kernel] Add flash decoding triton kernel for blocked kv cache ( #5249 )
...
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
2024-01-11 13:46:14 +00:00
Jianghai
fded91d049
[Inference] Kernel: no pad rotary embedding ( #5252 )
...
* fix bugs
* comment
* use more accurate atol
* fix
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fab294c7f4
fix CI bugs
2024-01-11 13:46:14 +00:00
Jianghai
e545a871b8
[Hotfix] Fix accuracy and align attention method api with Triton kernel ( #5229 )
...
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo
fa4fbdbffb
adapted to pad_context_forward
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
47e53eaa1c
fix bugs in attention.py and request_handler.py
2024-01-11 13:44:06 +00:00
Jianghai
bfd9b1b494
[Inference] Pytorch Attention func, pad&nopad input support ( #5219 )
...
* add attn
* add attention test
* fix attn forward
* fix decoding
2024-01-11 13:44:06 +00:00
yuehuayingxueluo
bbfebfb9fc
fix bugs in sampler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
02c1bf8b2a
add context_attention_unpadded
2024-01-11 13:39:56 +00:00
Yuanheng Zhao
07b5283b6a
[kernel] Add triton kernel for context attention (FAv2) without padding ( #5192 )
...
* add context attn unpadded triton kernel
* test compatibility
* kv cache copy (testing)
* fix k/v cache copy
* fix kv cache copy and test
* fix boundary of block ptrs
* add support for GQA/MQA and testing
* fix import statement
---------
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
4df8876fca
Fixed a writing error
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
9489dc64d8
precision alignment
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62968588d1
fix bugs in request_handler
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
62fd08ee44
Fixed a bug in the inference frame
2024-01-11 13:39:56 +00:00
Jianghai
0e616462a7
[Inference] add logit processor and request handler ( #5166 )
...
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo
8daee26989
[Inference] Add the logic of the inference engine ( #5173 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
* Add the logic of the inference engine
* update engine and test
* Recover cache_manager.py
* add logger
* fix conflict
* update codes
* update codes
* update model and tokenizer
* fix add the logic about shardformer
* change kvcache_manager docstring
* add policy
* fix ci bug in test_kvcache_manager.py
* remove codes related o tokenizer and move model_policy
* fix code style
* add ordered_set to requirements-infer.txt
* Delete extra empty lines
* add ordered_set to requirements-test.txt
2024-01-11 13:39:56 +00:00
Jianghai
93aeacca34
[Inference]Update inference config and fix test ( #5178 )
...
* unify the config setting
* fix test
* fix import
* fix test
* fix
* fix
* add logger
* revise log info
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Yuanheng Zhao
3de2e62299
[Inference] Add CacheBlock and KV-Cache Manager ( #5156 )
...
* [Inference] Add KVCache Manager
* function refactored
* add test for KVCache Manager
* add attr beam width
* Revise alloc func in CacheManager
* Fix docs and pytests
* add tp slicing for head number
* optimize shapes of tensors used as physical cache
* Apply using InferenceConfig on KVCacheManager
* rm duplicate config file
* Optimize cache allocation: use contiguous cache
* Fix config in pytest (and config)
2024-01-11 13:39:29 +00:00
yuehuayingxueluo
fab9b931d9
[Inference]Add BatchInferState, Sequence and InferConfig ( #5149 )
...
* add infer_struct and infer_config
* update codes
* change InferConfig
* Add hf_model_config to the engine
* rm _get_hf_model_config
* update codes
* made adjustments according to the feedback from the reviewer.
* update codes
* add ci test for config and struct
2024-01-11 13:39:29 +00:00
Yuanheng Zhao
2bb92243d4
[Inference/NFC] Clean outdated inference tests and deprecated kernels ( #5159 )
...
* [inference/nfc] remove outdated inference tests
* remove outdated kernel tests
* remove deprecated triton kernels
* remove imports from deprecated kernels
2024-01-11 13:39:29 +00:00
flybird11111
e830ef917d
[ci] fix shardformer tests. ( #5255 )
...
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-11 19:07:45 +08:00
Frank Lee
2b83418719
[ci] fixed ddp test ( #5254 )
...
* [ci] fixed ddp test
* polish
2024-01-11 17:16:32 +08:00
Frank Lee
d5eeeb1416
[ci] fixed booster test ( #5251 )
...
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
2024-01-11 16:04:45 +08:00
Frank Lee
edf94a35c3
[workflow] fixed build CI ( #5240 )
...
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
2024-01-10 22:34:16 +08:00
Hongxin Liu
d202cc28c0
[npu] change device to accelerator api ( #5239 )
...
* update accelerator
* fix timer
* fix amp
* update
* fix
* update bug
* add error raise
* fix autocast
* fix set device
* remove doc accelerator
* update doc
* update doc
* update doc
* use nullcontext
* update cpu
* update null context
* change time limit for example
* udpate
* update
* update
* update
* [npu] polish accelerator code
---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>
2024-01-09 10:20:05 +08:00
Elsa Granger
d565df3821
[pipeline] A more general _communicate in p2p ( #5062 )
...
* A more general _communicate
* feat: finish tree_flatten version p2p
* fix: update p2p api calls
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-08 15:37:27 +08:00
Xuanlei Zhao
dd2c28a323
[npu] use extension for op builder ( #5172 )
...
* update extension
* update cpu adam
* update is
* add doc for cpu adam
* update kernel
* update commit
* update flash
* update memory efficient
* update flash attn
* update flash attention loader
* update api
* fix
* update doc
* update example time limit
* reverse change
* fix doc
* remove useless kernel
* fix
* not use warning
* update
* update
2024-01-08 11:39:16 +08:00
Wenhao Chen
d799a3088f
[pipeline]: add p2p fallback order and fix interleaved pp deadlock ( #5214 )
...
* fix: add fallback order option and update 1f1b
* fix: fix deadlock comm in interleaved pp
* test: modify p2p test
2024-01-03 11:34:49 +08:00
Wenhao Chen
4fa689fca1
[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp ( #5134 )
...
* test: add more p2p tests
* fix: remove send_forward_recv_forward as p2p op list need to use the same group
* fix: make send and receive atomic
* feat: update P2PComm fn
* feat: add metadata cache in 1f1b
* feat: add metadata cache in interleaved pp
* feat: modify is_xx_stage fn
* revert: add _broadcast_object_list
* feat: add interleaved pp in llama policy
* feat: set NCCL_BUFFSIZE in HybridParallelPlugin
2023-12-22 10:44:00 +08:00
flybird11111
79718fae04
[shardformer] llama support DistCrossEntropy ( #5176 )
...
* fix
aaa
fix
fix
fix
* fix
* fix
* test ci
* fix ci
fix
* llama support dist-cross
fix
fix
fix
fix
fix
fix
fix
fix
* fix
* fix
* fix
fix
* test ci
* test ci
* fix
* [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878 )
* Add finetuning Colossal-Llama-2 example
* Add finetuning Colossal-Llama-2 example 2
* Add finetuning Colossal-Llama-2 example and support NEFTuning
* Add inference example and refine neftune
* Modify readme file
* update the imports
---------
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>
* llama support dist-cross
fix
fix
fix
fix
fix
fix
fix
fix
* fix
* fix
* fix
fix
* test ci
* test ci
* fix
* fix ci
* fix ci
---------
Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com>
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>
2023-12-13 01:39:14 +08:00
flybird11111
21aa5de00b
[gemini] hotfix NaN loss while using Gemini + tensor_parallel ( #5150 )
...
* fix
aaa
fix
fix
fix
* fix
* fix
* test ci
* fix ci
fix
2023-12-08 11:10:51 +08:00
flybird11111
2a2ec49aa7
[plugin]fix 3d checkpoint load when booster boost without optimizer. ( #5135 )
...
* fix 3d checkpoint load when booster boost without optimizer
fix 3d checkpoint load when booster boost without optimizer
* test ci
* revert ci
* fix
fix
2023-11-30 18:37:47 +08:00
github-actions[bot]
d10ee42f68
[format] applied code formatting on changed files in pull request 5088 ( #5127 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-11-29 13:38:37 +08:00
Wenhao Chen
7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert ( #5088 )
...
* [shardformer] implement policy for all GPT-J models and test
* [shardformer] support interleaved pipeline parallel for bert finetune
* [shardformer] shardformer support falcon (#4883 )
* [shardformer]: fix interleaved pipeline for bert model (#5048 )
* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093 )
* Add Mistral support for Shardformer (#5103 )
* [shardformer] add tests to mistral (#5105 )
---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>
2023-11-28 16:54:42 +08:00
Zhongkai Zhao
75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer ( #5064 )
...
* hotfix/Fix get model policy strategy in ShardFormer
* fix bug in auto policy
2023-11-22 11:19:39 +08:00
Xu Kai
fb103cfd6e
[inference] update examples and engine ( #5073 )
...
* update examples and engine
* fix choices
* update example
2023-11-20 19:44:52 +08:00
Bin Jia
0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 ( #5069 )
2023-11-20 17:15:37 +08:00
Hongxin Liu
e5ce4c8ea6
[npu] add npu support for gemini and zero ( #5067 )
...
* [npu] setup device utils (#5047 )
* [npu] add npu device support
* [npu] support low level zero
* [test] update npu zero plugin test
* [hotfix] fix import
* [test] recover tests
* [npu] gemini support npu (#5052 )
* [npu] refactor device utils
* [gemini] support npu
* [example] llama2+gemini support npu
* [kernel] add arm cpu adam kernel (#5065 )
* [kernel] add arm cpu adam
* [optim] update adam optimizer
* [kernel] arm cpu adam remove bf16 support
2023-11-20 16:12:41 +08:00
Xu Kai
fd6482ad8c
[inference] Refactor inference architecture ( #5057 )
...
* [inference] support only TP (#4998 )
* support only tp
* enable tp
* add support for bloom (#5008 )
* [refactor] refactor gptq and smoothquant llama (#5012 )
* refactor gptq and smoothquant llama
* fix import error
* fix linear import torch-int
* fix smoothquant llama import error
* fix import accelerate error
* fix bug
* fix import smooth cuda
* fix smoothcuda
* [Inference Refactor] Merge chatglm2 with pp and tp (#5023 )
merge chatglm with pp and tp
* [Refactor] remove useless inference code (#5022 )
* remove useless code
* fix quant model
* fix test import bug
* mv original inference legacy
* fix chatglm2
* [Refactor] refactor policy search and quant type controlling in inference (#5035 )
* [Refactor] refactor policy search and quant type controling in inference
* [inference] update readme (#5051 )
* update readme
* update readme
* fix architecture
* fix table
* fix table
* [inference] udpate example (#5053 )
* udpate example
* fix run.sh
* fix rebase bug
* fix some errors
* update readme
* add some features
* update interface
* update readme
* update benchmark
* add requirements-infer
---------
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-19 21:05:05 +08:00
Wenhao Chen
3c08f17348
[hotfix]: modify create_ep_hierarchical_group and add test ( #5032 )
...
* feat: modify create_ep_hierarchical_group args
* test: add ep tests
* fix: remove get_process_group_ranks
* fix: fix src_rank
2023-11-17 10:53:00 +08:00
flybird11111
3e02154710
[gemini] gemini support extra-dp ( #5043 )
...
* support ddp
* fix
* fix
* fix
fix
* support ddp
* fix
* fix
* fix
fix
* simplify tests
* fix
* fix
* fix
fix
fix
* fix
2023-11-16 21:03:04 +08:00
Cuiqing Li (李崔卿)
28052a71fb
[Kernels]Update triton kernels into 2.1.0 ( #5046 )
...
* update flash-context-attention
* adding kernels
* fix
* reset
* add build script
* add building process
* add llama2 exmaple
* add colossal-llama2 test
* clean
* fall back test setting
* fix test file
* clean
* clean
* clean
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-11-16 16:43:15 +08:00
Zhongkai Zhao
70885d707d
[hotfix] Suport extra_kwargs in ShardConfig ( #5031 )
...
* [refactor]: replace inference args with extra_kwargs in ShardConfig
* modify shardconfig
* polish code
* fix policy bug in llama
* fix bug in auto policy
* remove setattr in ShardConfig
2023-11-10 10:49:50 +08:00
flybird11111
576a2f7b10
[gemini] gemini support tensor parallelism. ( #4942 )
...
* [colossalai]fix typo
* [inference] Add smmoothquant for llama (#4904 )
* [inference] add int8 rotary embedding kernel for smoothquant (#4843 )
* [inference] add smoothquant llama attention (#4850 )
* add smoothquant llama attention
* remove uselss code
* remove useless code
* fix import error
* rename file name
* [inference] add silu linear fusion for smoothquant llama mlp (#4853 )
* add silu linear
* update skip condition
* catch smoothquant cuda lib exception
* prcocess exception for tests
* [inference] add llama mlp for smoothquant (#4854 )
* add llama mlp for smoothquant
* fix down out scale
* remove duplicate lines
* add llama mlp check
* delete useless code
* [inference] add smoothquant llama (#4861 )
* add smoothquant llama
* fix attention accuracy
* fix accuracy
* add kv cache and save pretrained
* refactor example
* delete smooth
* refactor code
* [inference] add smooth function and delete useless code for smoothquant (#4895 )
* add smooth function and delete useless code
* update datasets
* remove duplicate import
* delete useless file
* refactor codes (#4902 )
* rafactor code
* add license
* add torch-int and smoothquant license
* Update flash_attention_patch.py
To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
https://github.com/huggingface/transformers/pull/25598
* [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921 )
* [kernel] support pure fp16 for cpu adam (#4896 )
* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919 )
* [kernel] fix cpu adam
* [test] update gemini optim test
* [format] applied code formatting on changed files in pull request 4908 (#4918 )
Co-authored-by: github-actions <github-actions@github.com>
* [gemini] support gradient accumulation (#4869 )
* add test
* fix no_sync bug in low level zero plugin
* fix test
* add argument for grad accum
* add grad accum in backward hook for gemini
* finish implementation, rewrite tests
* fix test
* skip stuck model in low level zero test
* update doc
* optimize communication & fix gradient checkpoint
* modify doc
* cleaning codes
* update cpu adam fp16 case
* [hotfix] fix torch 2.0 compatibility (#4936 )
* [hotfix] fix launch
* [test] fix test gemini optim
* [shardformer] fix vit
* [test] add no master test for low level zero plugin (#4934 )
* [format] applied code formatting on changed files in pull request 4820 (#4886 )
Co-authored-by: github-actions <github-actions@github.com>
* [nfc] fix some typo with colossalai/ docs/ etc. (#4920 )
* [Refactor] Integrated some lightllm kernels into token-attention (#4946 )
* add some req for inference
* clean codes
* add codes
* add some lightllm deps
* clean codes
* hello
* delete rms files
* add some comments
* add comments
* add doc
* add lightllm deps
* add lightllm cahtglm2 kernels
* add lightllm cahtglm2 kernels
* replace rotary embedding with lightllm kernel
* add some commnets
* add some comments
* add some comments
* add
* replace fwd kernel att1
* fix a arg
* add
* add
* fix token attention
* add some comments
* clean codes
* modify comments
* fix readme
* fix bug
* fix bug
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
* [test] merge old components to test to model zoo (#4945 )
* [test] add custom models in model zoo
* [test] update legacy test
* [test] update model zoo
* [test] update gemini test
* [test] remove components to test
* [inference] add reference and fix some bugs (#4937 )
* add reference and fix some bugs
* update gptq init
---------
Co-authored-by: Xu Kai <xukai16@foxamil.com>
* [Inference]ADD Bench Chatglm2 script (#4963 )
* add bench chatglm
* fix bug and make utils
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [Pipeline inference] Combine kvcache with pipeline inference (#4938 )
* merge kvcache with pipeline inference and refactor the code structure
* support ppsize > 2
* refactor pipeline code
* do pre-commit
* modify benchmark
* fix bench mark
* polish code
* add docstring and update readme
* refactor the code
* fix some logic bug of ppinfer
* polish readme
* fix typo
* skip infer test
* updated c++17 compiler flags (#4983 )
* [Inference] Dynamic Batching Inference, online and offline (#4953 )
* [inference] Dynamic Batching for Single and Multiple GPUs (#4831 )
* finish batch manager
* 1
* first
* fix
* fix dynamic batching
* llama infer
* finish test
* support different lengths generating
* del prints
* del prints
* fix
* fix bug
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [inference] Async dynamic batching (#4894 )
* finish input and output logic
* add generate
* test forward
* 1
* [inference]Re push async dynamic batching (#4901 )
* adapt to ray server
* finish async
* finish test
* del test
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* Revert "[inference]Re push async dynamic batching (#4901 )" (#4905 )
This reverts commit fbf3c09e67
.
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Revert "[inference] Async dynamic batching (#4894 )" (#4909 )
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* [infer]Add Ray Distributed Environment Init Scripts (#4911 )
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* support dynamic batch for bloom model and is_running function
* [Inference]Test for new Async engine (#4935 )
* infer engine
* infer engine
* test engine
* test engine
* new manager
* change step
* add
* test
* fix
* fix
* finish test
* finish test
* finish test
* finish test
* add license
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* add assertion for config (#4947 )
* [Inference] Finish dynamic batching offline test (#4948 )
* test
* fix test
* fix quant
* add default
* fix
* fix some bugs
* fix some bugs
* fix
* fix bug
* fix bugs
* reset param
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965 )
* adding flash-decoding
* clean
* adding kernel
* adding flash-decoding
* add integration
* add
* adding kernel
* adding kernel
* adding triton 2.1.0 features for inference
* update bloom triton kernel
* remove useless vllm kernels
* clean codes
* fix
* adding files
* fix readme
* update llama flash-decoding
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
* fix ColossalEval (#4992 )
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
* [doc]Update doc for colossal-inference (#4989 )
* update doc
* Update README.md
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
* [hotfix] Fix the bug where process groups were not being properly released. (#4940 )
* Fix the bug where process groups were not being properly released.
* test
* Revert "test"
This reverts commit 479900c139
.
* [hotfix] fix the bug of repeatedly storing param group (#4951 )
* [doc] add supported feature diagram for hybrid parallel plugin (#4996 )
* [Pipeline Inference] Merge pp with tp (#4993 )
* refactor pipeline into new CaiInferEngine
* updata llama modeling forward
* merge tp with pp
* update docstring
* optimize test workflow and example
* fix typo
* add assert and todo
* [release] update version (#4995 )
* [release] update version
* [hotfix] fix ci
* [gemini] gemini support tp
[gemini] gemini support tp
[gemini] gemini support tp
[gemini] gemini support tp
[gemini] gemini support tp
* fix
fix
fix
* update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
update checkpointIO
* support fused layernorm
support fused layernorm
support fused layernorm
* update fusedlayernorm
update fusedlayernorm
update fusedlayernorm
* add sequence parallel to gemini
add sequence parallel to gemini
* fix
* fix comments
fix comments
fix comments
* fix
* fix t5
* clear cache
* fix
* activate ci
* activate ci
* fix
* fix
* fix
* fix
* revert
* modify tp gather method
modify tp gather method
modify tp gather method
modify tp gather method
* fix test
---------
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: Xu Kai <xukai16@foxamil.com>
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com>
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: littsk <1214689160@qq.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
2023-11-10 10:15:16 +08:00
Wenhao Chen
724441279b
[moe]: fix ep/tp tests, add hierarchical all2all ( #4982 )
...
* fix: add warning for EP different behavior
* fix: use shard_data in ep & tp model
* to: add used_capacity
* fix: fix router test
* feat: add create_ep_node_group
* feat: add create_ep_hierarchical_group fn
* feat: add HierarchicalAllToAll
* test: add hierarchical all2all test
* fix: fix test errors
* fix: simplify create_ep_hierarchical_group
* fix: add hierarchical_alltoall arg
* fix: fix environ typo
* revert: revert process mesh order
* to: add todo mark
* fix: skip hierarchical_comm if torch < 1.13.1
2023-11-09 06:31:00 +00:00
Xuanlei Zhao
f71e63b0f3
[moe] support optimizer checkpoint ( #5015 )
...
* Refactor MoE Manager setup method
* unshard optim ckpt
* optim io
* update transformer version
* update requirements
* update ckpt
* update ckpt
* update ckpt
* fix engine
* fix engine
2023-11-08 15:07:03 +00:00
Jianghai
ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism ( #5014 )
...
* fix bug
* fix
* fix multiquery
* fix multiquery
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2023-11-07 15:01:50 +08:00
github-actions[bot]
c36e782d80
[format] applied code formatting on changed files in pull request 4926 ( #5007 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-11-06 17:08:12 +08:00
littsk
1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel ( #4926 )
...
* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915 )
* Add layer norm gradients all-reduce for sequence parallel.
* skip pipeline inference test
* [hotfix] fixing polices of sequence parallel (#4922 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
---------
Co-authored-by: littsk <1214689160@qq.com>
* Hotfix/add grad all reduce for sequence parallel (#4927 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
* fix bug using wrong variables
---------
Co-authored-by: littsk <1214689160@qq.com>
* fix policy initialization
* fix bloom and chatglm policices
* polish code of handling layernorm
* fix moe module
* polish code of class initializing
---------
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-03 13:32:43 +08:00
Baizhou Zhang
d99b2c961a
[hotfix] fix grad accumulation plus clipping for gemini ( #5002 )
2023-11-02 17:59:10 +08:00
Xuanlei Zhao
dc003c304c
[moe] merge moe into main ( #4978 )
...
* update moe module
* support openmoe
2023-11-02 02:21:24 +00:00
Bin Jia
b6696beb04
[Pipeline Inference] Merge pp with tp ( #4993 )
...
* refactor pipeline into new CaiInferEngine
* updata llama modeling forward
* merge tp with pp
* update docstring
* optimize test workflow and example
* fix typo
* add assert and todo
2023-11-01 12:46:21 +08:00
Cuiqing Li
459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention ( #4965 )
...
* adding flash-decoding
* clean
* adding kernel
* adding flash-decoding
* add integration
* add
* adding kernel
* adding kernel
* adding triton 2.1.0 features for inference
* update bloom triton kernel
* remove useless vllm kernels
* clean codes
* fix
* adding files
* fix readme
* update llama flash-decoding
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-10-30 14:04:37 +08:00
Jianghai
cf579ff46d
[Inference] Dynamic Batching Inference, online and offline ( #4953 )
...
* [inference] Dynamic Batching for Single and Multiple GPUs (#4831 )
* finish batch manager
* 1
* first
* fix
* fix dynamic batching
* llama infer
* finish test
* support different lengths generating
* del prints
* del prints
* fix
* fix bug
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* [inference] Async dynamic batching (#4894 )
* finish input and output logic
* add generate
* test forward
* 1
* [inference]Re push async dynamic batching (#4901 )
* adapt to ray server
* finish async
* finish test
* del test
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* Revert "[inference]Re push async dynamic batching (#4901 )" (#4905 )
This reverts commit fbf3c09e67
.
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Revert "[inference] Async dynamic batching (#4894 )" (#4909 )
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* [infer]Add Ray Distributed Environment Init Scripts (#4911 )
* Revert "[inference] Async dynamic batching (#4894 )"
This reverts commit fced140250
.
* Add Ray Distributed Environment Init Scripts
* support DynamicBatchManager base function
* revert _set_tokenizer version
* add driver async generate
* add async test
* fix bugs in test_ray_dist.py
* add get_tokenizer.py
* fix code style
* fix bugs about No module named 'pydantic' in ci test
* fix bugs in ci test
* fix bugs in ci test
* fix bugs in ci test
* support dynamic batch for bloom model and is_running function
* [Inference]Test for new Async engine (#4935 )
* infer engine
* infer engine
* test engine
* test engine
* new manager
* change step
* add
* test
* fix
* fix
* finish test
* finish test
* finish test
* finish test
* add license
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
* add assertion for config (#4947 )
* [Inference] Finish dynamic batching offline test (#4948 )
* test
* fix test
* fix quant
* add default
* fix
* fix some bugs
* fix some bugs
* fix
* fix bug
* fix bugs
* reset param
---------
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2023-10-30 10:52:19 +08:00
Bin Jia
1db6727678
[Pipeline inference] Combine kvcache with pipeline inference ( #4938 )
...
* merge kvcache with pipeline inference and refactor the code structure
* support ppsize > 2
* refactor pipeline code
* do pre-commit
* modify benchmark
* fix bench mark
* polish code
* add docstring and update readme
* refactor the code
* fix some logic bug of ppinfer
* polish readme
* fix typo
* skip infer test
2023-10-27 16:19:54 +08:00
Hongxin Liu
b8e770c832
[test] merge old components to test to model zoo ( #4945 )
...
* [test] add custom models in model zoo
* [test] update legacy test
* [test] update model zoo
* [test] update gemini test
* [test] remove components to test
2023-10-20 10:35:08 +08:00
Cuiqing Li
3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention ( #4946 )
...
* add some req for inference
* clean codes
* add codes
* add some lightllm deps
* clean codes
* hello
* delete rms files
* add some comments
* add comments
* add doc
* add lightllm deps
* add lightllm cahtglm2 kernels
* add lightllm cahtglm2 kernels
* replace rotary embedding with lightllm kernel
* add some commnets
* add some comments
* add some comments
* add
* replace fwd kernel att1
* fix a arg
* add
* add
* fix token attention
* add some comments
* clean codes
* modify comments
* fix readme
* fix bug
* fix bug
---------
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
2023-10-19 22:22:47 +08:00
github-actions[bot]
486d06a2d5
[format] applied code formatting on changed files in pull request 4820 ( #4886 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-10-18 11:46:37 +08:00
Zhongkai Zhao
c7aa319ba0
[test] add no master test for low level zero plugin ( #4934 )
2023-10-18 11:41:23 +08:00
Hongxin Liu
1f5d2e8062
[hotfix] fix torch 2.0 compatibility ( #4936 )
...
* [hotfix] fix launch
* [test] fix test gemini optim
* [shardformer] fix vit
2023-10-18 11:05:25 +08:00
Baizhou Zhang
21ba89cab6
[gemini] support gradient accumulation ( #4869 )
...
* add test
* fix no_sync bug in low level zero plugin
* fix test
* add argument for grad accum
* add grad accum in backward hook for gemini
* finish implementation, rewrite tests
* fix test
* skip stuck model in low level zero test
* update doc
* optimize communication & fix gradient checkpoint
* modify doc
* cleaning codes
* update cpu adam fp16 case
2023-10-17 14:07:21 +08:00
Hongxin Liu
4f68b3f10c
[kernel] support pure fp16 for cpu adam and update gemini optim tests ( #4921 )
...
* [kernel] support pure fp16 for cpu adam (#4896 )
* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919 )
* [kernel] fix cpu adam
* [test] update gemini optim test
2023-10-16 21:56:53 +08:00
Xu Kai
611a5a80ca
[inference] Add smmoothquant for llama ( #4904 )
...
* [inference] add int8 rotary embedding kernel for smoothquant (#4843 )
* [inference] add smoothquant llama attention (#4850 )
* add smoothquant llama attention
* remove uselss code
* remove useless code
* fix import error
* rename file name
* [inference] add silu linear fusion for smoothquant llama mlp (#4853 )
* add silu linear
* update skip condition
* catch smoothquant cuda lib exception
* prcocess exception for tests
* [inference] add llama mlp for smoothquant (#4854 )
* add llama mlp for smoothquant
* fix down out scale
* remove duplicate lines
* add llama mlp check
* delete useless code
* [inference] add smoothquant llama (#4861 )
* add smoothquant llama
* fix attention accuracy
* fix accuracy
* add kv cache and save pretrained
* refactor example
* delete smooth
* refactor code
* [inference] add smooth function and delete useless code for smoothquant (#4895 )
* add smooth function and delete useless code
* update datasets
* remove duplicate import
* delete useless file
* refactor codes (#4902 )
* rafactor code
* add license
* add torch-int and smoothquant license
2023-10-16 11:28:44 +08:00
Xu Kai
77a9328304
[inference] add llama2 support ( #4898 )
...
* add llama2 support
* fix multi group bug
2023-10-13 13:09:23 +08:00
Baizhou Zhang
39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 ( #4864 )
2023-10-12 14:04:24 +08:00
littsk
83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin ( #4837 )
...
* Add clip_grad_norm for hibrid_parallel_plugin
* polish code
* add unittests
* Move tp to a higher-level optimizer interface.
* bug fix
* polish code
2023-10-12 11:32:37 +08:00
Hongxin Liu
df63564184
[gemini] support amp o3 for gemini ( #4872 )
...
* [gemini] support no reuse fp16 chunk
* [gemini] support no master weight for optim
* [gemini] support no master weight for gemini ddp
* [test] update gemini tests
* [test] update gemini tests
* [plugin] update gemini plugin
* [test] fix gemini checkpointio test
* [test] fix gemini checkpoint io
2023-10-12 10:39:08 +08:00
littsk
ffd9a3cbc9
[hotfix] fix bug in sequence parallel test ( #4887 )
2023-10-11 19:30:41 +08:00
Xu Kai
fdec650bb4
fix test llama ( #4884 )
2023-10-11 17:43:01 +08:00
Bin Jia
08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main ( #4820 )
...
* [pipeline inference] pipeline inference (#4492 )
* add pp stage manager as circle stage
* fix a bug when create process group
* add ppinfer basic framework
* add micro batch manager and support kvcache-pp gpt2 fwd
* add generate schedule
* use mb size to control mb number
* support generate with kv cache
* add output, remove unused code
* add test
* reuse shardformer to build model
* refactor some code and use the same attribute name of hf
* fix review and add test for generation
* remove unused file
* fix CI
* add cache clear
* fix code error
* fix typo
* [Pipeline inference] Modify to tieweight (#4599 )
* add pp stage manager as circle stage
* fix a bug when create process group
* add ppinfer basic framework
* add micro batch manager and support kvcache-pp gpt2 fwd
* add generate schedule
* use mb size to control mb number
* support generate with kv cache
* add output, remove unused code
* add test
* reuse shardformer to build model
* refactor some code and use the same attribute name of hf
* fix review and add test for generation
* remove unused file
* modify the way of saving newtokens
* modify to tieweight
* modify test
* remove unused file
* solve review
* add docstring
* [Pipeline inference] support llama pipeline inference (#4647 )
* support llama pipeline inference
* remove tie weight operation
* [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708 )
* add benchmark verbose
* fix export tokens
* fix benchmark verbose
* add P2POp style to do p2p communication
* modify schedule as p2p type when ppsize is 2
* remove unused code and add docstring
* [Pipeline inference] Refactor code, add docsting, fix bug (#4790 )
* add benchmark script
* update argparse
* fix fp16 load
* refactor code style
* add docstring
* polish code
* fix test bug
* [Pipeline inference] Add pipeline inference docs (#4817 )
* add readme doc
* add a ico
* Add performance
* update table of contents
* refactor code (#4873 )
2023-10-11 11:40:06 +08:00
Hongxin Liu
cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility ( #4824 )
2023-10-07 10:45:52 +08:00
Zhongkai Zhao
db40e086c8
[test] modify model supporting part of low_level_zero plugin (including correspoding docs)
2023-10-05 15:10:31 +08:00
Xu Kai
d1fcc0fa4d
[infer] fix test bug ( #4838 )
...
* fix test bug
* delete useless code
* fix typo
2023-10-04 10:01:03 +08:00
Jianghai
013a4bedf0
[inference]fix import bug and delete down useless init ( #4830 )
...
* fix import bug and release useless init
* fix
* fix
* fix
2023-10-04 09:18:45 +08:00
Hongxin Liu
4965c0dabd
[lazy] support from_pretrained ( #4801 )
...
* [lazy] patch from pretrained
* [lazy] fix from pretrained and add tests
* [devops] update ci
2023-09-26 11:04:11 +08:00
Baizhou Zhang
64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel ( #4774 )
...
* support unsharded saving/loading for model
* support optimizer unsharded saving
* update doc
* support unsharded loading for optimizer
* small fix
2023-09-26 10:58:03 +08:00
Jianghai
ce7ade3882
[inference] chatglm2 infer demo ( #4724 )
...
* add chatglm2
* add
* gather needed kernels
* fix some bugs
* finish context forward
* finish context stage
* fix
* add
* pause
* add
* fix bugs
* finish chatglm
* fix bug
* change some logic
* fix bugs
* change some logics
* add
* add
* add
* fix
* fix tests
* fix
2023-09-22 11:12:50 +08:00
Xu Kai
946ab56c48
[feature] add gptq for inference ( #4754 )
...
* [gptq] add gptq kernel (#4416 )
* add gptq
* refactor code
* fix tests
* replace auto-gptq
* rname inferance/quant
* refactor test
* add auto-gptq as an option
* reset requirements
* change assert and check auto-gptq
* add import warnings
* change test flash attn version
* remove example
* change requirements of flash_attn
* modify tests
* [skip ci] change requirements-test
* [gptq] faster gptq cuda kernel (#4494 )
* [skip ci] add cuda kernels
* add license
* [skip ci] fix max_input_len
* format files & change test size
* [skip ci]
* [gptq] add gptq tensor parallel (#4538 )
* add gptq tensor parallel
* add gptq tp
* delete print
* add test gptq check
* add test auto gptq check
* [gptq] combine gptq and kv cache manager (#4706 )
* combine gptq and kv cache manager
* add init bits
* delete useless code
* add model path
* delete usless print and update test
* delete usless import
* move option gptq to shard config
* change replace linear to shardformer
* update bloom policy
* delete useless code
* fix import bug and delete uselss code
* change colossalai/gptq to colossalai/quant/gptq
* update import linear for tests
* delete useless code and mv gptq_kernel to kernel directory
* fix triton kernel
* add triton import
2023-09-22 11:02:50 +08:00
Hongxin Liu
3e05c07bb8
[lazy] support torch 2.0 ( #4763 )
...
* [lazy] support _like methods and clamp
* [lazy] pass transformers models
* [lazy] fix device move and requires grad
* [lazy] fix requires grad and refactor api
* [lazy] fix requires grad
2023-09-21 16:30:23 +08:00
Baizhou Zhang
c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic ( #4758 )
...
* fix master param sync for hybrid plugin
* rewrite unwrap for ddp/fsdp
* rewrite unwrap for zero/gemini
* rewrite unwrap for hybrid plugin
* fix geemini unwrap
* fix bugs
2023-09-20 18:29:37 +08:00
Hongxin Liu
079bf3cb26
[misc] update pre-commit and run all files ( #4752 )
...
* [misc] update pre-commit
* [misc] run pre-commit
* [misc] remove useless configuration files
* [misc] ignore cuda for clang-format
2023-09-19 14:20:26 +08:00
Hongxin Liu
b5f9e37c70
[legacy] clean up legacy code ( #4743 )
...
* [legacy] remove outdated codes of pipeline (#4692 )
* [legacy] remove cli of benchmark and update optim (#4690 )
* [legacy] remove cli of benchmark and update optim
* [doc] fix cli doc test
* [legacy] fix engine clip grad norm
* [legacy] remove outdated colo tensor (#4694 )
* [legacy] remove outdated colo tensor
* [test] fix test import
* [legacy] move outdated zero to legacy (#4696 )
* [legacy] clean up utils (#4700 )
* [legacy] clean up utils
* [example] update examples
* [legacy] clean up amp
* [legacy] fix amp module
* [legacy] clean up gpc (#4742 )
* [legacy] clean up context
* [legacy] clean core, constants and global vars
* [legacy] refactor initialize
* [example] fix examples ci
* [example] fix examples ci
* [legacy] fix tests
* [example] fix gpt example
* [example] fix examples ci
* [devops] fix ci installation
* [example] fix examples ci
2023-09-18 16:31:06 +08:00
Pengtai Xu
cd4e61d149
[legacy] remove deterministic data loader test
2023-09-15 15:52:18 +08:00
digger yu
9c2feb2f0b
fix some typo with colossalai/device colossalai/tensor/ etc. ( #4171 )
...
Co-authored-by: flybird11111 <1829166702@qq.com>
2023-09-12 17:41:52 +08:00