Commit Graph

3265 Commits (5bbab1533ae7672ab37e91b7bc9e584b3a4e9cc1)

Author SHA1 Message Date
Yuanheng Zhao 5bbab1533a
[ci] Fix example tests (#5714)
* [fix] revise timeout value on example CI

* trivial
2024-05-14 16:08:51 +08:00
傅剑寒 121d7ad629
[Inference] Delete duplicated copy_vector (#5716) 2024-05-14 14:35:33 +08:00
Steve Luo 7806842f2d
add paged-attetionv2: support seq length split across thread block (#5707) 2024-05-14 12:46:54 +08:00
Runyu Lu 18d67d0e8e
[Feat]Inference RPC Server Support (#5705)
* rpc support source
* kv cache logical/physical disaggregation
* sampler refactor
* colossalai launch built in
* Unitest
* Rpyc support

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-14 10:00:55 +08:00
yuehuayingxueluo de4bf3dedf
[Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)
* Adapt repetition_penalty and no_repeat_ngram_size

* fix no_repeat_ngram_size_logit_process

* remove batch_updated

* fix annotation

* modified codes based on the review feedback.

* rm get_batch_token_ids
2024-05-11 15:13:25 +08:00
傅剑寒 50104ab340
[Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)
* add convert_fp8 op for fp8 test in the future

* rerun ci
2024-05-10 18:39:54 +08:00
傅剑寒 bfad39357b
[Inference/Feat] Add quant kvcache interface (#5700)
* add quant kvcache interface

* delete unused output

* complete args comments
2024-05-09 18:03:24 +08:00
Jianghai 492520dbdb
Merge pull request #5588 from hpcaitech/feat/online-serving
[Feature]Online Serving
2024-05-09 17:19:45 +08:00
CjhHa1 5d9a49483d [Inference] Add example test_ci script 2024-05-09 05:44:05 +00:00
CjhHa1 bc9063adf1 resolve rebase conflicts on Branch feat/online-serving 2024-05-08 15:20:53 +00:00
Jianghai 61a1b2e798 [Inference] Fix bugs and docs for feat/online-server (#5598)
* fix test bugs

* add do sample test

* del useless lines

* fix comments

* fix tests

* delete version tag

* delete version tag

* add

* del test sever

* fix test

* fix

* Revert "add"

This reverts commit b9305fb024.
2024-05-08 15:20:53 +00:00
CjhHa1 7bbb28e48b [Inference] resolve rebase conflicts
fix
2024-05-08 15:20:53 +00:00
Jianghai c064032865 [Online Server] Chat Api for streaming and not streaming response (#5470)
* fix bugs

* fix bugs

* fix api server

* fix api server

* add chat api and test

* del request.n
2024-05-08 15:20:53 +00:00
Jianghai de378cd2ab [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)
* finish online test and add examples

* fix test_contionus_batching

* fix some bugs

* fix bash

* fix

* fix inference

* finish revision

* fix typos

* revision
2024-05-08 15:20:52 +00:00
Jianghai 69cd7e069d [Inference] ADD async and sync Api server using FastAPI (#5396)
* add api server

* fix

* add

* add completion service and fix bug

* add generation config

* revise shardformer

* fix bugs

* add docstrings and fix some bugs

* fix bugs and add choices for prompt template
2024-05-08 15:18:28 +00:00
yuehuayingxueluo d482922035
[Inference] Support the logic related to ignoring EOS token (#5693)
* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* support ignore EOS token

* change variable's name

* fix annotation
2024-05-08 19:59:10 +08:00
yuehuayingxueluo 9c2fe7935f
[Inference]Adapt temperature processing logic (#5689)
* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg
2024-05-08 17:58:29 +08:00
Yuanheng Zhao 12e7c28d5e
[hotfix] fix OpenMOE example import path (#5697) 2024-05-08 15:48:47 +08:00
Yuanheng Zhao 55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements (#5688)
* clean requirements

* modify example inference struct

* add test ci scripts

* mark test_infer as submodule

* rm deprecated cls & deps

* import of HAS_FLASH_ATTN

* prune inference tests to be run

* prune triton kernel tests

* increment pytest timeout mins

* revert import path in openmoe
2024-05-08 11:30:15 +08:00
Yuanheng Zhao f9afe0addd
[hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)
- Fix key value number assignment in KVCacheManager, as well as method of accessing
2024-05-07 23:13:14 +08:00
傅剑寒 1ace1065e6
[Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686) 2024-05-06 15:35:13 +08:00
Yuanheng Zhao db7b3051f4
[Sync] Update from main to feature/colossal-infer (Merge pull request #5685)
[Sync] Update from main to feature/colossal-infer

- Merge pull request #5685 from yuanheng-zhao/inference/merge/main
2024-05-06 14:43:38 +08:00
Steve Luo 725fbd2ed0
[Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) 2024-05-06 10:55:34 +08:00
Yuanheng Zhao 8754abae24 [Fix] Fix & Update Inference Tests (compatibility w/ main) 2024-05-05 16:28:56 +00:00
Yuanheng Zhao 56ed09aba5 [sync] resolve conflicts of merging main 2024-05-05 05:14:00 +00:00
Yuanheng Zhao 537a3cbc4d
[kernel] Support New KCache Layout - Triton Kernel (#5677)
* kvmemcpy triton for new kcache layout

* revise tests for new kcache layout

* naive triton flash decoding - new kcache layout

* rotary triton kernel - new kcache layout

* remove redundancy - triton decoding

* remove redundancy - triton kvcache copy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-05-03 17:20:45 +08:00
傅剑寒 9df016fc45
[Inference] Fix quant bits order (#5681) 2024-04-30 19:38:00 +08:00
yuehuayingxueluo f79963199c
[inference]Add alibi to flash attn function (#5678)
* add alibi to flash attn function

* rm redundant modifications
2024-04-30 19:35:05 +08:00
傅剑寒 ef8e4ffe31
[Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680) 2024-04-30 18:33:53 +08:00
Steve Luo 5cd75ce4c7
[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)
* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

* refactor decode_kv_cache_memcpy

* enable alibi in pagedattention
2024-04-30 15:52:23 +08:00
yuehuayingxueluo 5f00002e43
[Inference] Adapt Baichuan2-13B TP (#5659)
* adapt to baichuan2 13B

* add baichuan2 13B TP

* update baichuan tp logic

* rm unused code

* Fix TP logic

* fix alibi slopes tp logic

* rm nn.Module

* Polished the code.

* change BAICHUAN_MODEL_NAME_OR_PATH

* Modified the logic for loading Baichuan weights.

* fix typos
2024-04-30 15:47:07 +08:00
傅剑寒 808ee6e4ad
[Inference/Feat] Feat quant kvcache step2 (#5674) 2024-04-30 11:26:36 +08:00
Wang Binluo d3f34ee8cc
[Shardformer] add assert for num of attention heads divisible by tp_size (#5670)
* add assert for num of attention heads divisible by tp_size

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-29 18:47:47 +08:00
flybird11111 6af6d6fc9f
[shardformer] support bias_gelu_jit_fused for models (#5647)
* support gelu_bias_fused for gpt2

* support gelu_bias_fused for gpt2

fix

fix

fix

* fix

fix

* fix
2024-04-29 15:33:51 +08:00
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666)
* [misc] remove config arg from initialize

* [misc] remove old tensor contrusctor

* [plugin] add npu support for ddp

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [devops] fix doc test ci

* [test] fix test launch

* [doc] update launch doc

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-29 10:40:11 +08:00
linsj20 91fa553775 [Feature] qlora support (#5586)
* [feature] qlora support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* qlora follow commit

* migrate qutization folder to colossalai/

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-28 10:51:27 +08:00
flybird11111 8954a0c2e2 [LowLevelZero] low level zero support lora (#5153)
* low level zero support lora

low level zero support lora

* add checkpoint test

* add checkpoint test

* fix

* fix

* fix

* fix

fix

fix

fix

* fix

* fix

fix

fix

fix

fix

fix

fix

* fix

* fix

fix

fix

fix

fix

fix

fix

* fix

* test ci

* git # This is a combination of 3 commits.

Update low_level_zero_plugin.py

Update low_level_zero_plugin.py

fix

fix

fix

* fix naming

fix naming

fix naming

fix
2024-04-28 10:51:27 +08:00
Baizhou Zhang 14b0d4c7e5 [lora] add lora APIs for booster, support lora for TorchDDP (#4981)
* add apis and peft requirement

* add liscense and implement apis

* add checkpointio apis

* add torchddp fwd_bwd test

* add support_lora methods

* add checkpointio test and debug

* delete unneeded codes

* remove peft from LICENSE

* add concrete methods for enable_lora

* simplify enable_lora api

* fix requirements
2024-04-28 10:51:27 +08:00
Hongxin Liu c1594e4bad
[devops] fix release docker ci (#5665) 2024-04-27 19:11:57 +08:00
Hongxin Liu 4cfbf30a5e
[release] update version (#5654) 2024-04-27 18:59:47 +08:00
Tong Li 68ec99e946
[hotfix] add soft link to support required files (#5661) 2024-04-26 21:12:04 +08:00
傅剑寒 8ccb6714e7
[Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656) 2024-04-26 19:40:37 +08:00
Yuanheng Zhao 5be590b99e
[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)
* add context attn triton kernel - new kcache layout

* add benchmark triton

* tiny revise

* trivial - code style, comment
2024-04-26 17:51:49 +08:00
binmakeswell b8a711aa2d
[news] llama3 and open-sora v1.1 (#5655)
* [news] llama3 and open-sora v1.1

* [news] llama3 and open-sora v1.1
2024-04-26 15:36:37 +08:00
Hongxin Liu 2082852f3f
[lazyinit] skip whisper test (#5653) 2024-04-26 14:03:12 +08:00
flybird11111 8b7d535977
fix gptj (#5652) 2024-04-26 11:52:27 +08:00
yuehuayingxueluo 3c91e3f176
[Inference]Adapt to baichuan2 13B (#5614)
* adapt to baichuan2 13B

* adapt to baichuan2 13B

* change BAICHUAN_MODEL_NAME_OR_PATH

* fix test_decoding_attn.py

* Modifications based on review comments.

* change BAICHUAN_MODEL_NAME_OR_PATH

* mv attn mask processes to test flash decoding

* mv get_alibi_slopes baichuan modeling

* fix bugs in test_baichuan.py
2024-04-25 23:11:30 +08:00
Yuanheng Zhao f342a93871
[Fix] Remove obsolete files - inference (#5650) 2024-04-25 22:04:59 +08:00
Hongxin Liu 1b387ca9fe
[shardformer] refactor pipeline grad ckpt config (#5646)
* [shardformer] refactor pipeline grad ckpt config

* [shardformer] refactor pipeline grad ckpt config

* [pipeline] fix stage manager
2024-04-25 15:19:30 +08:00
Season 7ef91606e1
[Fix]: implement thread-safety singleton to avoid deadlock for very large-scale training scenarios (#5625)
* implement thread-safety singleton

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor singleton implementation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-25 14:45:52 +08:00