Commit Graph

65 Commits (1f8a75d470d548bfd4db877e73102b8fad5cdfa9)

Author SHA1 Message Date
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn

* opt tmp tensor

* fix benchmark_llama

* fix code style

* fix None logic for output tensor

* fix adapted to get_xine_cache

* add comment

* fix ci bugs

* fix some codes

* rm duplicated codes

* rm duplicated codes

* fix code style

* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
Yuanheng Zhao 3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
* fix decoding kernel pytest

* revise and add triton context attn benchmark
2024-01-23 17:16:02 +08:00
yuehuayingxueluo cea9c86e45 add utils.py 2024-01-22 16:06:27 +08:00
yuehuayingxueluo bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm (#5283)
* adapted to rotary_embedding

* adapted to nopad rms norm

* fix bugs in benchmark

* fix flash_decoding.py
2024-01-22 10:55:34 +08:00
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
* prevent re-creating intermediate tensors

* add singleton class holding intermediate values

* fix triton kernel api

* add benchmark in pytest

* fix kernel api and add benchmark

* revise flash decoding triton kernel in/out shapes

* fix calling of triton kernel in modeling

* fix pytest: extract to util functions
2024-01-19 15:47:16 +08:00
Jianghai 9e2342bde2
[Hotfix] Fix bugs in testing continuous batching (#5270)
* fix bug

* fix bugs

* fix bugs

* fix bugs and add padding

* add funcs and fix bugs

* fix typos

* fix bugs

* add func
2024-01-18 16:31:14 +08:00
yuehuayingxueluo 86b63f720c
[Inference]Adapted to the triton attn kernels (#5264)
* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print
2024-01-17 16:03:10 +08:00
Jianghai d8db500efc
[Inference] Fix request handler and add recycle logic (#5260)
* fix request handler

* fix comment
2024-01-15 17:50:46 +08:00
Frank Lee c597678da4
[doc] updated inference readme (#5269) 2024-01-15 17:37:41 +08:00
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261)
* add kv copy triton kernel during decoding stage

* add pytest and fix kernel

* fix test utilities

* revise kernel config

* add benchmark for kvcache copy
2024-01-15 17:37:20 +08:00
FrankLeeeee 1ded7e81ef [git] fixed rebased files 2024-01-11 13:50:45 +00:00
yuehuayingxueluo d40eb26029 fix bugs in request_handler.py and engine.py 2024-01-11 13:46:14 +00:00
yuehuayingxueluo 10e3c9f923 rm torch.cuda.synchronize 2024-01-11 13:46:14 +00:00
yuehuayingxueluo fab294c7f4 fix CI bugs 2024-01-11 13:46:14 +00:00
yuehuayingxueluo 2a73e828eb fix bugs related to processing padding mask 2024-01-11 13:46:14 +00:00
Jianghai e545a871b8 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)
* fix accuracy

* alignment in attention

* fix attention

* fix

* fix bugs

* fix bugs

* fix bugs
2024-01-11 13:46:14 +00:00
yuehuayingxueluo fa4fbdbffb adapted to pad_context_forward 2024-01-11 13:44:06 +00:00
yuehuayingxueluo 47e53eaa1c fix bugs in attention.py and request_handler.py 2024-01-11 13:44:06 +00:00
Jianghai bfd9b1b494 [Inference] Pytorch Attention func, pad&nopad input support (#5219)
* add attn

* add attention test

* fix attn forward

* fix decoding
2024-01-11 13:44:06 +00:00
yuehuayingxueluo 3ad1f3b78b fix beam_width 2024-01-11 13:39:56 +00:00
yuehuayingxueluo b2eb9cd186 Fixed a typo 2024-01-11 13:39:56 +00:00
yuehuayingxueluo bbfebfb9fc fix bugs in sampler 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 02c1bf8b2a add context_attention_unpadded 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 9489dc64d8 precision alignment 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 62968588d1 fix bugs in request_handler 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 62fd08ee44 Fixed a bug in the inference frame 2024-01-11 13:39:56 +00:00
yuehuayingxueluo 86853a37d5 Add padding llama model 2024-01-11 13:39:56 +00:00
Jianghai 0e616462a7 [Inference] add logit processor and request handler (#5166)
* add logit processor and request handler

* add

* add

* add

* fix

* add search tokens and update func

* finish request handler

* add running list test

* fix test

* fix some bug

* add

* add

* fix bugs

* fix some bugs

* fix bug

* fix

* fix

* add copy fun

* del useless attn

* fix request status

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:56 +00:00
yuehuayingxueluo 8daee26989 [Inference] Add the logic of the inference engine (#5173)
* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* Add the logic of the inference engine

* update engine and test

* Recover cache_manager.py

* add logger

* fix conflict

* update codes

* update codes

* update model and tokenizer

* fix add the logic about shardformer

* change kvcache_manager docstring

* add policy

* fix ci bug in test_kvcache_manager.py

* remove codes related o tokenizer and move model_policy

* fix  code style

* add ordered_set to requirements-infer.txt

* Delete extra empty lines

* add ordered_set to requirements-test.txt
2024-01-11 13:39:56 +00:00
Jianghai 93aeacca34 [Inference]Update inference config and fix test (#5178)
* unify the config setting

* fix test

* fix import

* fix test

* fix

* fix

* add logger

* revise log info

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Yuanheng Zhao 3de2e62299 [Inference] Add CacheBlock and KV-Cache Manager (#5156)
* [Inference] Add KVCache Manager

* function refactored

* add test for KVCache Manager

* add attr beam width

* Revise alloc func in CacheManager

* Fix docs and pytests

* add tp slicing for head number

* optimize shapes of tensors used as physical cache

* Apply using InferenceConfig on KVCacheManager

* rm duplicate config file

* Optimize cache allocation: use contiguous cache

* Fix config in pytest (and config)
2024-01-11 13:39:29 +00:00
yuehuayingxueluo fab9b931d9 [Inference]Add BatchInferState, Sequence and InferConfig (#5149)
* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct
2024-01-11 13:39:29 +00:00
Jianghai 56e75eeb06 [Inference] Add readme (roadmap) and fulfill request handler (#5147)
* request handler

* add readme

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Jianghai 4cf4682e70 [Inference] First PR for rebuild colossal-infer (#5143)
* add engine and scheduler

* add dirs

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2024-01-11 13:39:29 +00:00
Zhongkai Zhao 75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer (#5064)
* hotfix/Fix get model policy strategy in ShardFormer

* fix bug in auto policy
2023-11-22 11:19:39 +08:00
Hongxin Liu 1cd7efc520
[inference] refactor examples and fix schedule (#5077)
* [setup] refactor infer setup

* [hotfix] fix infenrece behavior on 1 1 gpu

* [exmaple] refactor inference examples
2023-11-21 10:46:03 +08:00
Xu Kai fb103cfd6e
[inference] update examples and engine (#5073)
* update examples and engine

* fix choices

* update example
2023-11-20 19:44:52 +08:00
Bin Jia 0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069) 2023-11-20 17:15:37 +08:00
Cuiqing Li (李崔卿) bce919708f
[Kernels]added flash-decoidng of triton (#5063)
* added flash-decoidng of triton based on lightllm kernel

* add req

* clean

* clean

* delete build.sh

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-11-20 13:58:29 +08:00
Xu Kai fd6482ad8c
[inference] Refactor inference architecture (#5057)
* [inference] support only TP (#4998)

* support only tp

* enable tp

* add support for bloom (#5008)

* [refactor] refactor gptq and smoothquant llama (#5012)

* refactor gptq and smoothquant llama

* fix import error

* fix linear import torch-int

* fix smoothquant llama import error

* fix import accelerate error

* fix bug

* fix import smooth cuda

* fix smoothcuda

* [Inference Refactor] Merge chatglm2 with pp and tp (#5023)

merge chatglm with pp and tp

* [Refactor] remove useless inference code (#5022)

* remove useless code

* fix quant model

* fix test import bug

* mv original inference legacy

* fix chatglm2

* [Refactor] refactor policy search and quant type controlling in inference (#5035)

* [Refactor] refactor policy search and quant type controling in inference

* [inference] update readme (#5051)

* update readme

* update readme

* fix architecture

* fix table

* fix table

* [inference] udpate example (#5053)

* udpate example

* fix run.sh

* fix rebase bug

* fix some errors

* update readme

* add some features

* update interface

* update readme

* update benchmark

* add requirements-infer

---------

Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-19 21:05:05 +08:00
Cuiqing Li (李崔卿) 28052a71fb
[Kernels]Update triton kernels into 2.1.0 (#5046)
* update flash-context-attention

* adding kernels

* fix

* reset

* add build script

* add building process

* add llama2 exmaple

* add colossal-llama2 test

* clean

* fall back test setting

* fix test file

* clean

* clean

* clean

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-11-16 16:43:15 +08:00
Zhongkai Zhao 70885d707d
[hotfix] Suport extra_kwargs in ShardConfig (#5031)
* [refactor]: replace inference args with extra_kwargs in ShardConfig

* modify shardconfig

* polish code

* fix policy bug in llama

* fix bug in auto policy

* remove setattr in ShardConfig
2023-11-10 10:49:50 +08:00
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015)
* Refactor MoE Manager setup method

* unshard optim ckpt

* optim io

* update transformer version

* update requirements

* update ckpt

* update ckpt

* update ckpt

* fix engine

* fix engine
2023-11-08 15:07:03 +00:00
Jianghai ef4c14a5e2
[Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014)
* fix bug

* fix

* fix multiquery

* fix multiquery

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2023-11-07 15:01:50 +08:00
github-actions[bot] c36e782d80
[format] applied code formatting on changed files in pull request 4926 (#5007)
Co-authored-by: github-actions <github-actions@github.com>
2023-11-06 17:08:12 +08:00
littsk 1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926)
* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915)

* Add layer norm gradients all-reduce for sequence parallel.

* skip pipeline inference test

* [hotfix] fixing polices of sequence parallel (#4922)

* Add layer norm gradients all-reduce for sequence parallel.

* fix parameter passing when calling get_autopolicy

---------

Co-authored-by: littsk <1214689160@qq.com>

* Hotfix/add grad all reduce for sequence parallel (#4927)

* Add layer norm gradients all-reduce for sequence parallel.


* fix parameter passing when calling get_autopolicy

* fix bug using wrong variables

---------

Co-authored-by: littsk <1214689160@qq.com>

* fix policy initialization

* fix bloom and chatglm policices

* polish code of handling layernorm

* fix moe module

* polish code of class initializing

---------

Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-03 13:32:43 +08:00
Bin Jia b6696beb04
[Pipeline Inference] Merge pp with tp (#4993)
* refactor pipeline into new CaiInferEngine

* updata llama modeling forward

* merge tp with pp

* update docstring

* optimize test workflow and example

* fix typo

* add assert and todo
2023-11-01 12:46:21 +08:00
Cuiqing Li (李崔卿) 4f0234f236
[doc]Update doc for colossal-inference (#4989)
* update doc

* Update README.md

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-10-31 10:48:07 +08:00
Cuiqing Li 459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965)
* adding flash-decoding

* clean

* adding kernel

* adding flash-decoding

* add integration

* add

* adding kernel

* adding kernel

* adding triton 2.1.0 features for inference

* update bloom triton kernel

* remove useless vllm kernels

* clean codes

* fix

* adding files

* fix readme

* update llama flash-decoding

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>
2023-10-30 14:04:37 +08:00
Jianghai cf579ff46d
[Inference] Dynamic Batching Inference, online and offline (#4953)
* [inference] Dynamic Batching for Single and Multiple GPUs (#4831)

* finish batch manager

* 1

* first

* fix

* fix dynamic batching

* llama infer

* finish test

* support different lengths generating

* del prints

* del prints

* fix

* fix bug

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [inference] Async dynamic batching  (#4894)

* finish input and output logic

* add generate

* test forward

* 1

* [inference]Re push async dynamic batching (#4901)

* adapt to ray server

* finish async

* finish test

* del test

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>

* Revert "[inference]Re push async dynamic batching (#4901)" (#4905)

This reverts commit fbf3c09e67.

* Revert "[inference] Async dynamic batching  (#4894)"

This reverts commit fced140250.

* Revert "[inference] Async dynamic batching  (#4894)" (#4909)

This reverts commit fced140250.

* Add Ray Distributed Environment Init Scripts

* support DynamicBatchManager base function

* revert _set_tokenizer version

* add driver async generate

* add async test

* fix bugs in test_ray_dist.py

* add get_tokenizer.py

* fix code style

* fix bugs about No module named 'pydantic' in ci test

* fix bugs in ci test

* fix bugs in ci test

* fix bugs in ci test

* [infer]Add Ray Distributed Environment Init Scripts (#4911)

* Revert "[inference] Async dynamic batching  (#4894)"

This reverts commit fced140250.

* Add Ray Distributed Environment Init Scripts

* support DynamicBatchManager base function

* revert _set_tokenizer version

* add driver async generate

* add async test

* fix bugs in test_ray_dist.py

* add get_tokenizer.py

* fix code style

* fix bugs about No module named 'pydantic' in ci test

* fix bugs in ci test

* fix bugs in ci test

* fix bugs in ci test

* support dynamic batch for bloom model and is_running function

* [Inference]Test for new Async engine (#4935)

* infer engine

* infer engine

* test engine

* test engine

* new manager

* change step

* add

* test

* fix

* fix

* finish test

* finish test

* finish test

* finish test

* add license

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>

* add assertion for config (#4947)

* [Inference] Finish dynamic batching offline test (#4948)

* test

* fix test

* fix quant

* add default

* fix

* fix some bugs

* fix some bugs

* fix

* fix bug

* fix bugs

* reset param

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
2023-10-30 10:52:19 +08:00