Commit Graph

2984 Commits (8e606ecc7e89ffed80537e89a27bb1eb6759f4bc)

Author SHA1 Message Date
Jianghai 8e606ecc7e
[Inference] Benchmarking rotary embedding and add a fetch function (#5277)
* fix bugs and add a cos/sin cache fetch func

* add docstring

* fix bug

* fix
2024-01-23 12:11:53 +08:00
yuehuayingxueluo b7853196a0
Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding
[Inference/fix]Add utils.py for Rotary Embedding
2024-01-22 17:07:14 +08:00
yuehuayingxueluo cea9c86e45 add utils.py 2024-01-22 16:06:27 +08:00
yuehuayingxueluo bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm (#5283)
* adapted to rotary_embedding

* adapted to nopad rms norm

* fix bugs in benchmark

* fix flash_decoding.py
2024-01-22 10:55:34 +08:00
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
* prevent re-creating intermediate tensors

* add singleton class holding intermediate values

* fix triton kernel api

* add benchmark in pytest

* fix kernel api and add benchmark

* revise flash decoding triton kernel in/out shapes

* fix calling of triton kernel in modeling

* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270)
* fix bug

* fix bugs

* fix bugs

* fix bugs and add padding

* add funcs and fix bugs

* fix typos

* fix bugs

* add func
2024-01-18 16:31:14 +08:00
Yaozheng Fang 5ae9099f92
[kernel] Add RMSLayerNorm triton kernel (#5262)
* add layerrmsnorm triton kernel

* add layerrmsnorm kernel

* modify the atol and rtol in test file

* Remove the logics of mean computations, and update the name of ther kernel functions and files

* add benchmark of rms norm
2024-01-18 10:21:03 +08:00
yuehuayingxueluo 86b63f720c
[Inference]Adapted to the triton attn kernels (#5264)
* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print
2024-01-17 16:03:10 +08:00
Yuanheng Zhao 0f2b46a41c
[kernel] Revise KVCache copy triton kernel API (#5273)
* [kernel/fix] revise kvcache copy kernel api

* fix benchmark
2024-01-16 14:41:02 +08:00
Jianghai d8db500efc
[Inference] Fix request handler and add recycle logic (#5260)
* fix request handler

* fix comment
2024-01-15 17:50:46 +08:00
Frank Lee c597678da4
[doc] updated inference readme (#5269) 2024-01-15 17:37:41 +08:00
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261)
* add kv copy triton kernel during decoding stage

* add pytest and fix kernel

* fix test utilities

* revise kernel config

* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
FrankLeeeee 1ded7e81ef [git] fixed rebased files 2024-01-11 13:50:45 +00:00
Yuanheng Zhao 1513f20f4d [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)
* add flash decoding unpad triton kernel

* rename flash decoding kernel

* add kernel testing (draft)

* revise pytest

* support kv group (GQA)

* (trivial) fix api and pytest

* (trivial) func renaming

* (trivial) func/file renaming

* refactor pytest for attention

* (trivial) format and consistent vars of context/decode attn

* (trivial) remove test redundancy
2024-01-11 13:46:14 +00:00
Jianghai fded91d049 [Inference] Kernel: no pad rotary embedding (#5252)
* fix bugs

* comment

* use more accurate atol

* fix
2024-01-11 13:46:14 +00:00
yuehuayingxueluo d40eb26029 fix bugs in request_handler.py and engine.py 2024-01-11 13:46:14 +00:00
yuehuayingxueluo 10e3c9f923 rm torch.cuda.synchronize 2024-01-11 13:46:14 +00:00
yuehuayingxueluo fab294c7f4 fix CI bugs 2024-01-11 13:46:14 +00:00
yuehuayingxueluo 2a73e828eb fix bugs related to processing padding mask 2024-01-11 13:46:14 +00:00
Jianghai e545a871b8 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)
* fix accuracy

* alignment in attention

* fix attention

* fix

* fix bugs

* fix bugs

* fix bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo fa4fbdbffb adapted to pad_context_forward 2024-01-11 13:44:06 +00:00
yuehuayingxueluo 47e53eaa1c fix bugs in attention.py and request_handler.py 2024-01-11 13:44:06 +00:00
Jianghai bfd9b1b494 [Inference] Pytorch Attention func, pad&nopad input support (#5219)
* add attn

* add attention test

* fix attn forward

* fix decoding
2024-01-11 13:44:06 +00:00
yuehuayingxueluo 3ad1f3b78b fix beam_width 2024-01-11 13:39:56 +00:00
yuehuayingxueluo b2eb9cd186 Fixed a typo 2024-01-11 13:39:56 +00:00
yuehuayingxueluo bbfebfb9fc fix bugs in sampler 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 02c1bf8b2a add context_attention_unpadded 2024-01-11 13:39:56 +00:00
Yuanheng Zhao 07b5283b6a [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)
* add context attn unpadded triton kernel

* test compatibility

* kv cache copy (testing)

* fix k/v cache copy

* fix kv cache copy and test

* fix boundary of block ptrs

* add support for GQA/MQA and testing

* fix import statement

---------

Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo 4df8876fca Fixed a writing error 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 9489dc64d8 precision alignment 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 62968588d1 fix bugs in request_handler 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 62fd08ee44 Fixed a bug in the inference frame 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 86853a37d5 Add padding llama model 2024-01-11 13:39:56 +00:00
Jianghai 0e616462a7 [Inference] add logit processor and request handler (#5166)
* add logit processor and request handler

* add

* add

* add

* fix

* add search tokens and update func

* finish request handler

* add running list test

* fix test

* fix some bug

* add

* add

* fix bugs

* fix some bugs

* fix bug

* fix

* fix

* add copy fun

* del useless attn

* fix request status

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo 8daee26989 [Inference] Add the logic of the inference engine (#5173)
* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* Add the logic of the inference engine

* update engine and test

* Recover cache_manager.py

* add logger

* fix conflict

* update codes

* update codes

* update model and tokenizer

* fix add the logic about shardformer

* change kvcache_manager docstring

* add policy

* fix ci bug in test_kvcache_manager.py

* remove codes related o tokenizer and move model_policy

* fix  code style

* add ordered_set to requirements-infer.txt

* Delete extra empty lines

* add ordered_set to requirements-test.txt
2024-01-11 13:39:56 +00:00
Jianghai 93aeacca34 [Inference]Update inference config and fix test (#5178)
* unify the config setting

* fix test

* fix import

* fix test

* fix

* fix

* add logger

* revise log info

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Yuanheng Zhao 3de2e62299 [Inference] Add CacheBlock and KV-Cache Manager (#5156)
* [Inference] Add KVCache Manager

* function refactored

* add test for KVCache Manager

* add attr beam width

* Revise alloc func in CacheManager

* Fix docs and pytests

* add tp slicing for head number

* optimize shapes of tensors used as physical cache

* Apply using InferenceConfig on KVCacheManager

* rm duplicate config file

* Optimize cache allocation: use contiguous cache

* Fix config in pytest (and config)
2024-01-11 13:39:29 +00:00
yuehuayingxueluo fab9b931d9 [Inference]Add BatchInferState, Sequence and InferConfig (#5149)
* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct
2024-01-11 13:39:29 +00:00
Yuanheng Zhao 2bb92243d4 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)
* [inference/nfc] remove outdated inference tests

* remove outdated kernel tests

* remove deprecated triton kernels

* remove imports from deprecated kernels
2024-01-11 13:39:29 +00:00
Jianghai 56e75eeb06 [Inference] Add readme (roadmap) and fulfill request handler (#5147)
* request handler

* add readme

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Jianghai 4cf4682e70 [Inference] First PR for rebuild colossal-infer (#5143)
* add engine and scheduler

* add dirs

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
binmakeswell c174c4fc5f
[doc] fix doc typo (#5256)
* [doc] fix annotation display

* [doc] fix llama2 doc
2024-01-11 21:01:11 +08:00
flybird11111 e830ef917d
[ci] fix shardformer tests. (#5255)
* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-11 19:07:45 +08:00
digger yu 756c400ad2
fix typo in applications/ColossalEval/README.md (#5250) 2024-01-11 17:58:38 +08:00
Frank Lee 2b83418719
[ci] fixed ddp test (#5254)
* [ci] fixed ddp test

* polish
2024-01-11 17:16:32 +08:00
Frank Lee d5eeeb1416
[ci] fixed booster test (#5251)
* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test
2024-01-11 16:04:45 +08:00
Frank Lee edf94a35c3
[workflow] fixed build CI (#5240)
* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish
2024-01-10 22:34:16 +08:00
digger yu 41e52c1c6e
[doc] fix typo in Colossal-LLaMA-2/README.md (#5247) 2024-01-10 19:24:56 +08:00
Elsa Granger d565df3821
[pipeline] A more general _communicate in p2p (#5062)
* A more general _communicate

* feat: finish tree_flatten version p2p

* fix: update p2p api calls

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-08 15:37:27 +08:00
binmakeswell 7bc6969ce6
[doc] SwiftInfer release (#5236)
* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release
2024-01-08 09:55:12 +08:00