Commit Graph

1069 Commits (f7aecc0c6bac001d10c1dd00274e0152e4c86df6)

Author SHA1 Message Date
Steve Luo f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) 2024-03-08 16:21:12 +08:00
xs_courtesy 95c21498d4 add silu_and_mul for infer 2024-03-07 16:57:49 +08:00
FrankLeeeee 0310b76e9d Merge branch 'main' into sync/main 2024-03-04 10:09:36 +08:00
yuehuayingxueluo 0aa27f1961
[Inference]Move benchmark-related code to the example directory. (#5408)
* move benchmark-related code to the example directory.

* fix bugs in test_fused_rotary_embedding.py
2024-02-28 16:46:03 +08:00
yuehuayingxueluo 600881a8ea
[Inference]Add CUDA KVCache Kernel (#5406)
* add cuda KVCache kernel

* annotation benchmark_kvcache_copy

* add use cuda

* fix import path

* move benchmark scripts to example/

* rm benchmark codes in test_kv_cache_memcpy.py

* rm redundancy codes

* rm redundancy codes

* pr was modified according to the review
2024-02-28 14:36:50 +08:00
QinLuo bf34c6fef6
[fsdp] impl save/load shard model/optimizer (#5357) 2024-02-27 13:51:14 +08:00
Yuanheng Zhao 19061188c3
[Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)
fix dependency in pytest
2024-02-26 16:17:47 +08:00
yuehuayingxueluo bc1da87366
[Fix/Inference] Fix format of input prompts and input model in inference engine (#5395)
* Fix bugs in inference_engine

* fix bugs in engine.py

* rm  CUDA_VISIBLE_DEVICES

* add request_ids in generate

* fix bug in engine.py

* add logger.debug for BatchBucket
2024-02-23 10:51:35 +08:00
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
* opt_view_and_memcopy

* fix bugs in ci

* fix ci bugs

* update benchmark scripts

* fix ci bugs
2024-02-21 13:23:57 +08:00
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383)
* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* fused kv copy

* fused copy

* colossalai/kernel/triton/no_pad_rotary_embedding.py

* del padding llama

* del
2024-02-21 11:31:48 +08:00
Yuanheng Zhao b21aac5bae
[Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)
* add kvcache manager funcs for batching

* add batch bucket for batching

* revise RunningList struct in handler

* add kvcache/batch funcs for compatibility

* use new batching methods

* fix indexing bugs

* revise abort logic

* use cpu seq lengths/block tables

* rm unused attr in Sequence

* fix type conversion/default arg

* add and revise pytests

* revise pytests, rm unused tests

* rm unused statements

* fix pop finished indexing issue

* fix: use index in batch when retrieving inputs/update seqs

* use dict instead of odict in batch struct

* arg type hinting

* fix make compress

* refine comments

* fix: pop_n_seqs to pop the first n seqs

* add check in request handler

* remove redundant conversion

* fix test for request handler

* fix pop method in batch bucket

* fix prefill adding
2024-02-19 17:18:20 +08:00
ver217 06db94fbc9 [moe] fix tests 2024-02-08 12:46:37 +08:00
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl 2024-02-07 19:21:02 +08:00
Jianghai 1f8c7e7046
[Inference] User Experience: update the logic of default tokenizer and generation config. (#5337)
* add

* fix

* fix

* pause

* fix

* fix pytest

* align

* fix

* license

* fix

* fix

* fix readme

* fix some bugs

* remove tokenizer config
2024-02-07 17:55:48 +08:00
yuehuayingxueluo 6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy (#5374)
* fused kv memcopy

* add TODO in test_kvcache_copy.py
2024-02-07 17:15:42 +08:00
Frank Lee 58740b5f68
[inference] added inference template (#5375) 2024-02-07 17:11:43 +08:00
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)
This reverts commit 9f4ab2eb92.
2024-02-07 14:27:04 +08:00
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348)
* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix
2024-02-07 11:36:04 +08:00
Hongxin Liu c53ddda88f
[lr-scheduler] fix load state dict and add test (#5369) 2024-02-06 14:23:32 +08:00
yuehuayingxueluo 631862f339
[Inference]Optimize generation process of inference engine (#5356)
* opt inference engine

* fix run_benchmark.sh

* fix generate in engine.py

* rollback tesh_inference_engine.py
2024-02-02 15:38:21 +08:00
Wenhao Chen 1c790c0877
[fix] remove unnecessary dp_size assert (#5351)
* fix: remove unnecessary assert

* test: add more 3d plugin tests

* fix: add warning
2024-02-02 14:40:20 +08:00
Frank Lee e76acbb076
[inference] moved ops tests to test_infer (#5354) 2024-02-02 13:51:22 +08:00
Frank Lee db1a763307
[inference] removed redundancy init_batch (#5353) 2024-02-02 11:44:15 +08:00
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
* [checkpointio] fix hybrid parallel optim checkpoint

* [extension] fix cuda extension

* [checkpointio] fix gemini optimizer checkpoint

* polish code
2024-02-01 16:13:06 +08:00
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv

* replace attn and mlp by shardformer

* fix bugs in mlp

* add docstrings

* fix test_inference_engine.py

* add optimize unbind

* add fused_addmm

* rm squeeze(1)

* refactor codes

* fix ci bugs

* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

* Removed the dependency on LlamaFlashAttention2

* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
Frank Lee f8e456d202
[inference] simplified config verification (#5346)
* [inference] simplified config verification

* polish

* polish
2024-02-01 15:31:01 +08:00
Jianghai df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)
* revise rotary embedding

* remove useless print

* adapt
2024-01-31 16:31:29 +08:00
FrankLeeeee c565519913 merge commit 2024-01-31 10:41:47 +08:00
Yuanheng Zhao 5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
* revise shape of kvcache (context attn kernel)

* revise shape of kvcache (flash decoding kernel)

* revise shape of kvcache (kvcache copy) and attn func

* init of kvcache in kvcache manager

* revise llama modeling

* revise block size retrieval

* use torch for rms_norm benchmarking

* revise block size retrieval
2024-01-30 16:06:09 +08:00
yuehuayingxueluo e8f0642f28
[Inference]Add Nopadding Llama Modeling (#5327)
* add nopadding llama modeling

* add nopadding_llama.py

* rm unused codes

* fix bugs in test_xine_copy.py

* fix code style
2024-01-30 10:31:46 +08:00
Jianghai 1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM (#5315)
* add

* xi

* del

* del

* fix
2024-01-29 10:22:33 +08:00
Jianghai 7ddd8b37f0
fix (#5311) 2024-01-26 15:02:12 +08:00
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn

* opt tmp tensor

* fix benchmark_llama

* fix code style

* fix None logic for output tensor

* fix adapted to get_xine_cache

* add comment

* fix ci bugs

* fix some codes

* rm duplicated codes

* rm duplicated codes

* fix code style

* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298)
* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2024-01-25 17:01:48 +08:00
Jianghai c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel (#5302)
* add fused rotary and get cos cache func

* staged

* fix bugs

* fix bugs
2024-01-24 16:20:42 +08:00
Yuanheng Zhao 3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
* fix decoding kernel pytest

* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
Jianghai 8e606ecc7e
[Inference] Benchmarking rotary embedding and add a fetch function (#5277)
* fix bugs and add a cos/sin cache fetch func

* add docstring

* fix bug

* fix
2024-01-23 12:11:53 +08:00
Hongxin Liu d7f8db8e21
[hotfix] fix 3d plugin test (#5292) 2024-01-22 15:19:04 +08:00
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
* prevent re-creating intermediate tensors

* add singleton class holding intermediate values

* fix triton kernel api

* add benchmark in pytest

* fix kernel api and add benchmark

* revise flash decoding triton kernel in/out shapes

* fix calling of triton kernel in modeling

* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270)
* fix bug

* fix bugs

* fix bugs

* fix bugs and add padding

* add funcs and fix bugs

* fix typos

* fix bugs

* add func
2024-01-18 16:31:14 +08:00
ver217 148469348a Merge branch 'main' into sync/npu 2024-01-18 12:05:21 +08:00
Yaozheng Fang 5ae9099f92
[kernel] Add RMSLayerNorm triton kernel (#5262)
* add layerrmsnorm triton kernel

* add layerrmsnorm kernel

* modify the atol and rtol in test file

* Remove the logics of mean computations, and update the name of ther kernel functions and files

* add benchmark of rms norm
2024-01-18 10:21:03 +08:00
Zhongkai Zhao 5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) 2024-01-17 17:42:29 +08:00
flybird11111 46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246)
* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix
2024-01-17 15:22:33 +08:00
flybird11111 2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)
* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-17 13:38:55 +08:00
Frank Lee d69cd2eb89
[workflow] fixed oom tests (#5275)
* [workflow] fixed oom tests

* polish

* polish

* polish
2024-01-16 18:55:13 +08:00
Yuanheng Zhao 0f2b46a41c
[kernel] Revise KVCache copy triton kernel API (#5273)
* [kernel/fix] revise kvcache copy kernel api

* fix benchmark
2024-01-16 14:41:02 +08:00
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261)
* add kv copy triton kernel during decoding stage

* add pytest and fix kernel

* fix test utilities

* revise kernel config

* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
Wenhao Chen ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268)
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00
FrankLeeeee 1ded7e81ef [git] fixed rebased files 2024-01-11 13:50:45 +00:00