Steve Luo
f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script ( #5417 )
2024-03-08 16:21:12 +08:00
xs_courtesy
95c21498d4
add silu_and_mul for infer
2024-03-07 16:57:49 +08:00
FrankLeeeee
0310b76e9d
Merge branch 'main' into sync/main
2024-03-04 10:09:36 +08:00
yuehuayingxueluo
0aa27f1961
[Inference]Move benchmark-related code to the example directory. ( #5408 )
...
* move benchmark-related code to the example directory.
* fix bugs in test_fused_rotary_embedding.py
2024-02-28 16:46:03 +08:00
yuehuayingxueluo
600881a8ea
[Inference]Add CUDA KVCache Kernel ( #5406 )
...
* add cuda KVCache kernel
* annotation benchmark_kvcache_copy
* add use cuda
* fix import path
* move benchmark scripts to example/
* rm benchmark codes in test_kv_cache_memcpy.py
* rm redundancy codes
* rm redundancy codes
* pr was modified according to the review
2024-02-28 14:36:50 +08:00
QinLuo
bf34c6fef6
[fsdp] impl save/load shard model/optimizer ( #5357 )
2024-02-27 13:51:14 +08:00
Yuanheng Zhao
19061188c3
[Infer/Fix] Fix Dependency in test - RMSNorm kernel ( #5399 )
...
fix dependency in pytest
2024-02-26 16:17:47 +08:00
yuehuayingxueluo
bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine ( #5395 )
...
* Fix bugs in inference_engine
* fix bugs in engine.py
* rm CUDA_VISIBLE_DEVICES
* add request_ids in generate
* fix bug in engine.py
* add logger.debug for BatchBucket
2024-02-23 10:51:35 +08:00
yuehuayingxueluo
2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy ( #5390 )
...
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
2024-02-21 13:23:57 +08:00
Jianghai
730103819d
[Inference]Fused kv copy into rotary calculation ( #5383 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
2024-02-21 11:31:48 +08:00
Yuanheng Zhao
b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling ( #5367 )
...
* add kvcache manager funcs for batching
* add batch bucket for batching
* revise RunningList struct in handler
* add kvcache/batch funcs for compatibility
* use new batching methods
* fix indexing bugs
* revise abort logic
* use cpu seq lengths/block tables
* rm unused attr in Sequence
* fix type conversion/default arg
* add and revise pytests
* revise pytests, rm unused tests
* rm unused statements
* fix pop finished indexing issue
* fix: use index in batch when retrieving inputs/update seqs
* use dict instead of odict in batch struct
* arg type hinting
* fix make compress
* refine comments
* fix: pop_n_seqs to pop the first n seqs
* add check in request handler
* remove redundant conversion
* fix test for request handler
* fix pop method in batch bucket
* fix prefill adding
2024-02-19 17:18:20 +08:00
ver217
06db94fbc9
[moe] fix tests
2024-02-08 12:46:37 +08:00
Xuanlei Zhao
7d8e0338a4
[moe] init mixtral impl
2024-02-07 19:21:02 +08:00
Jianghai
1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. ( #5337 )
...
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
2024-02-07 17:55:48 +08:00
yuehuayingxueluo
6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy ( #5374 )
...
* fused kv memcopy
* add TODO in test_kvcache_copy.py
2024-02-07 17:15:42 +08:00
Frank Lee
58740b5f68
[inference] added inference template ( #5375 )
2024-02-07 17:11:43 +08:00
Frank Lee
8106ede07f
Revert "[Inference] Adapt to Fused rotary ( #5348 )" ( #5373 )
...
This reverts commit 9f4ab2eb92
.
2024-02-07 14:27:04 +08:00
Jianghai
9f4ab2eb92
[Inference] Adapt to Fused rotary ( #5348 )
...
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
2024-02-07 11:36:04 +08:00
Hongxin Liu
c53ddda88f
[lr-scheduler] fix load state dict and add test ( #5369 )
2024-02-06 14:23:32 +08:00
yuehuayingxueluo
631862f339
[Inference]Optimize generation process of inference engine ( #5356 )
...
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
Wenhao Chen
1c790c0877
[fix] remove unnecessary dp_size assert ( #5351 )
...
* fix: remove unnecessary assert
* test: add more 3d plugin tests
* fix: add warning
2024-02-02 14:40:20 +08:00
Frank Lee
e76acbb076
[inference] moved ops tests to test_infer ( #5354 )
2024-02-02 13:51:22 +08:00
Frank Lee
db1a763307
[inference] removed redundancy init_batch ( #5353 )
2024-02-02 11:44:15 +08:00
Hongxin Liu
ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint ( #5347 )
...
* [checkpointio] fix hybrid parallel optim checkpoint
* [extension] fix cuda extension
* [checkpointio] fix gemini optimizer checkpoint
* polish code
2024-02-01 16:13:06 +08:00
yuehuayingxueluo
249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add ( #5340 )
...
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
Frank Lee
f8e456d202
[inference] simplified config verification ( #5346 )
...
* [inference] simplified config verification
* polish
* polish
2024-02-01 15:31:01 +08:00
Jianghai
df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding ( #5336 )
...
* revise rotary embedding
* remove useless print
* adapt
2024-01-31 16:31:29 +08:00
FrankLeeeee
c565519913
merge commit
2024-01-31 10:41:47 +08:00
Yuanheng Zhao
5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It ( #5325 )
...
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
2024-01-30 16:06:09 +08:00
yuehuayingxueluo
e8f0642f28
[Inference]Add Nopadding Llama Modeling ( #5327 )
...
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
2024-01-30 10:31:46 +08:00
Jianghai
1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM ( #5315 )
...
* add
* xi
* del
* del
* fix
2024-01-29 10:22:33 +08:00
Jianghai
7ddd8b37f0
fix ( #5311 )
2024-01-26 15:02:12 +08:00
yuehuayingxueluo
4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn ( #5304 )
...
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Frank Lee
7cfed5f076
[feat] refactored extension module ( #5298 )
...
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
2024-01-25 17:01:48 +08:00
Jianghai
c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel ( #5302 )
...
* add fused rotary and get cos cache func
* staged
* fix bugs
* fix bugs
2024-01-24 16:20:42 +08:00
Yuanheng Zhao
3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark ( #5301 )
...
* fix decoding kernel pytest
* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
Jianghai
8e606ecc7e
[Inference] Benchmarking rotary embedding and add a fetch function ( #5277 )
...
* fix bugs and add a cos/sin cache fetch func
* add docstring
* fix bug
* fix
2024-01-23 12:11:53 +08:00
Hongxin Liu
d7f8db8e21
[hotfix] fix 3d plugin test ( #5292 )
2024-01-22 15:19:04 +08:00
Yuanheng Zhao
6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking ( #5274 )
...
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai
9e2342bde2
[Hotfix] Fix bugs in testing continuous batching ( #5270 )
...
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
2024-01-18 16:31:14 +08:00
ver217
148469348a
Merge branch 'main' into sync/npu
2024-01-18 12:05:21 +08:00
Yaozheng Fang
5ae9099f92
[kernel] Add RMSLayerNorm triton kernel ( #5262 )
...
* add layerrmsnorm triton kernel
* add layerrmsnorm kernel
* modify the atol and rtol in test file
* Remove the logics of mean computations, and update the name of ther kernel functions and files
* add benchmark of rms norm
2024-01-18 10:21:03 +08:00
Zhongkai Zhao
5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism ( #5230 )
2024-01-17 17:42:29 +08:00
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
2024-01-17 15:22:33 +08:00
flybird11111
2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py ( #5276 )
...
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-17 13:38:55 +08:00
Frank Lee
d69cd2eb89
[workflow] fixed oom tests ( #5275 )
...
* [workflow] fixed oom tests
* polish
* polish
* polish
2024-01-16 18:55:13 +08:00
Yuanheng Zhao
0f2b46a41c
[kernel] Revise KVCache copy triton kernel API ( #5273 )
...
* [kernel/fix] revise kvcache copy kernel api
* fix benchmark
2024-01-16 14:41:02 +08:00
Yuanheng Zhao
fa85e02b3b
[kernel] Add KV cache copy kernel during decoding ( #5261 )
...
* add kv copy triton kernel during decoding stage
* add pytest and fix kernel
* fix test utilities
* revise kernel config
* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00
FrankLeeeee
1ded7e81ef
[git] fixed rebased files
2024-01-11 13:50:45 +00:00