ColossalAI

Commit Graph

Author	SHA1	Message	Date
yuehuayingxueluo	2a73e828eb	fix bugs related to processing padding mask	2024-01-11 13:46:14 +00:00
Jianghai	e545a871b8	[Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229 ) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs	2024-01-11 13:46:14 +00:00
yuehuayingxueluo	fa4fbdbffb	adapted to pad_context_forward	2024-01-11 13:44:06 +00:00
yuehuayingxueluo	47e53eaa1c	fix bugs in attention.py and request_handler.py	2024-01-11 13:44:06 +00:00
Jianghai	bfd9b1b494	[Inference] Pytorch Attention func, pad&nopad input support (#5219 ) * add attn * add attention test * fix attn forward * fix decoding	2024-01-11 13:44:06 +00:00
yuehuayingxueluo	3ad1f3b78b	fix beam_width	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	b2eb9cd186	Fixed a typo	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	bbfebfb9fc	fix bugs in sampler	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	02c1bf8b2a	add context_attention_unpadded	2024-01-11 13:39:56 +00:00
Yuanheng Zhao	07b5283b6a	[kernel] Add triton kernel for context attention (FAv2) without padding (#5192 ) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	9489dc64d8	precision alignment	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	62968588d1	fix bugs in request_handler	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	62fd08ee44	Fixed a bug in the inference frame	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	86853a37d5	Add padding llama model	2024-01-11 13:39:56 +00:00
Jianghai	0e616462a7	[Inference] add logit processor and request handler (#5166 ) * add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2024-01-11 13:39:56 +00:00
yuehuayingxueluo	8daee26989	[Inference] Add the logic of the inference engine (#5173 ) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt	2024-01-11 13:39:56 +00:00
Jianghai	93aeacca34	[Inference]Update inference config and fix test (#5178 ) * unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2024-01-11 13:39:29 +00:00
Yuanheng Zhao	3de2e62299	[Inference] Add CacheBlock and KV-Cache Manager (#5156 ) * [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config)	2024-01-11 13:39:29 +00:00
yuehuayingxueluo	fab9b931d9	[Inference]Add BatchInferState, Sequence and InferConfig (#5149 ) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct	2024-01-11 13:39:29 +00:00
Yuanheng Zhao	2bb92243d4	[Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159 ) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels	2024-01-11 13:39:29 +00:00
Jianghai	56e75eeb06	[Inference] Add readme (roadmap) and fulfill request handler (#5147 ) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2024-01-11 13:39:29 +00:00
Jianghai	4cf4682e70	[Inference] First PR for rebuild colossal-infer (#5143 ) * add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2024-01-11 13:39:29 +00:00
binmakeswell	c174c4fc5f	[doc] fix doc typo (#5256 ) * [doc] fix annotation display * [doc] fix llama2 doc	2024-01-11 21:01:11 +08:00
flybird11111	e830ef917d	[ci] fix shardformer tests. (#5255 ) * fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	2024-01-11 19:07:45 +08:00
Frank Lee	9102d655ab	[hotfix] removed unused flag (#5242 )	2024-01-09 14:57:07 +08:00
Hongxin Liu	d202cc28c0	[npu] change device to accelerator api (#5239 ) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>	2024-01-09 10:20:05 +08:00
Elsa Granger	d565df3821	[pipeline] A more general _communicate in p2p (#5062 ) * A more general _communicate * feat: finish tree_flatten version p2p * fix: update p2p api calls --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	2024-01-08 15:37:27 +08:00
Xuanlei Zhao	dd2c28a323	[npu] use extension for op builder (#5172 ) * update extension * update cpu adam * update is * add doc for cpu adam * update kernel * update commit * update flash * update memory efficient * update flash attn * update flash attention loader * update api * fix * update doc * update example time limit * reverse change * fix doc * remove useless kernel * fix * not use warning * update * update	2024-01-08 11:39:16 +08:00
digger yu	b0b53a171c	[nfc] fix typo colossalai/shardformer/ (#5133 )	2024-01-04 16:21:55 +08:00
flybird11111	451e9142b8	fix flash attn (#5209 )	2024-01-03 14:39:53 +08:00
flybird11111	365671be10	fix-test (#5210 ) fix-test fix-test	2024-01-03 14:26:13 +08:00
Wenhao Chen	d799a3088f	[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214 ) * fix: add fallback order option and update 1f1b * fix: fix deadlock comm in interleaved pp * test: modify p2p test	2024-01-03 11:34:49 +08:00
Wenhao Chen	3c0d82b19b	[pipeline]: support arbitrary batch size in forward_only mode (#5201 ) * fix: remove drop last in val & test dataloader * feat: add run_forward_only, support arbitrary bs * chore: modify ci script	2024-01-02 23:41:12 +08:00
flybird11111	02d2328a04	support linear accumulation fusion (#5199 ) support linear accumulation fusion support linear accumulation fusion fix	2023-12-29 18:22:42 +08:00
Wenhao Chen	4fa689fca1	[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134 ) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin	2023-12-22 10:44:00 +08:00
flybird11111	79718fae04	[shardformer] llama support DistCrossEntropy (#5176 ) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878) * Add finetuning Colossal-Llama-2 example * Add finetuning Colossal-Llama-2 example 2 * Add finetuning Colossal-Llama-2 example and support NEFTuning * Add inference example and refine neftune * Modify readme file * update the imports --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com> * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * fix ci * fix ci --------- Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>	2023-12-13 01:39:14 +08:00
flybird11111	21aa5de00b	[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150 ) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix	2023-12-08 11:10:51 +08:00
flybird11111	3dbbf83f1c	fix (#5158 ) fix	2023-12-05 14:28:36 +08:00
flybird11111	2a2ec49aa7	[plugin]fix 3d checkpoint load when booster boost without optimizer. (#5135 ) * fix 3d checkpoint load when booster boost without optimizer fix 3d checkpoint load when booster boost without optimizer * test ci * revert ci * fix fix	2023-11-30 18:37:47 +08:00
Xuanlei Zhao	d6df19bae7	[npu] support triangle attention for llama (#5130 ) * update fused attn * update spda * tri attn * update triangle * import * fix * fix	2023-11-30 14:21:30 +08:00
Frank Lee	f4e72c9992	[accelerator] init the accelerator module (#5129 ) * [accelerator] init the accelerator module * polish code * polish code * polish code * polish code	2023-11-30 13:25:17 +08:00
github-actions[bot]	d10ee42f68	[format] applied code formatting on changed files in pull request 5088 (#5127 ) Co-authored-by: github-actions <github-actions@github.com>	2023-11-29 13:38:37 +08:00
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	2023-11-28 16:54:42 +08:00
アマデウス	126cf180bc	[hotfix] fixed memory usage of shardformer module replacement (#5122 )	2023-11-28 15:38:26 +08:00
Xuanlei Zhao	68fcaa2225	remove duplicate import (#5100 )	2023-11-23 15:15:01 +08:00
Xuanlei Zhao	3acbf6d496	[npu] add npu support for hybrid plugin and llama (#5090 ) * llama 3d * update * fix autocast	2023-11-22 19:23:21 +08:00
flybird11111	aae496631c	[shardformer]fix flash attention, when mask is casual, just don't unpad it (#5084 ) * fix flash attn * fix fix	2023-11-22 16:00:07 +08:00
Zhongkai Zhao	75af66cd81	[Hotfix] Fix model policy matching strategy in ShardFormer (#5064 ) * hotfix/Fix get model policy strategy in ShardFormer * fix bug in auto policy	2023-11-22 11:19:39 +08:00
flybird11111	4ccb9ded7d	[gemini]fix gemini optimzer, saving Shardformer in Gemini got list assignment index out of range (#5085 )	2023-11-22 11:14:25 +08:00
Jun Gao	dce05da535	fix thrust-transform-reduce error (#5078 )	2023-11-21 15:09:35 +08:00
Hongxin Liu	1cd7efc520	[inference] refactor examples and fix schedule (#5077 ) * [setup] refactor infer setup * [hotfix] fix infenrece behavior on 1 1 gpu * [exmaple] refactor inference examples	2023-11-21 10:46:03 +08:00
Bin Jia	4e3959d316	[hotfix/hybridengine] Fix init model with random parameters in benchmark (#5074 ) * fix init model with random parameters * fix example	2023-11-20 20:15:25 +08:00
github-actions[bot]	8921a73c90	[format] applied code formatting on changed files in pull request 5067 (#5072 ) Co-authored-by: github-actions <github-actions@github.com>	2023-11-20 19:46:43 +08:00
Xu Kai	fb103cfd6e	[inference] update examples and engine (#5073 ) * update examples and engine * fix choices * update example	2023-11-20 19:44:52 +08:00
Bin Jia	0c7d8bebd5	[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069 )	2023-11-20 17:15:37 +08:00
Hongxin Liu	e5ce4c8ea6	[npu] add npu support for gemini and zero (#5067 ) * [npu] setup device utils (#5047) * [npu] add npu device support * [npu] support low level zero * [test] update npu zero plugin test * [hotfix] fix import * [test] recover tests * [npu] gemini support npu (#5052) * [npu] refactor device utils * [gemini] support npu * [example] llama2+gemini support npu * [kernel] add arm cpu adam kernel (#5065) * [kernel] add arm cpu adam * [optim] update adam optimizer * [kernel] arm cpu adam remove bf16 support	2023-11-20 16:12:41 +08:00
Cuiqing Li (李崔卿)	bce919708f	[Kernels]added flash-decoidng of triton (#5063 ) * added flash-decoidng of triton based on lightllm kernel * add req * clean * clean * delete build.sh --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-11-20 13:58:29 +08:00
Xu Kai	fd6482ad8c	[inference] Refactor inference architecture (#5057 ) * [inference] support only TP (#4998) * support only tp * enable tp * add support for bloom (#5008) * [refactor] refactor gptq and smoothquant llama (#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>	2023-11-19 21:05:05 +08:00
Wenhao Chen	3c08f17348	[hotfix]: modify create_ep_hierarchical_group and add test (#5032 ) * feat: modify create_ep_hierarchical_group args * test: add ep tests * fix: remove get_process_group_ranks * fix: fix src_rank	2023-11-17 10:53:00 +08:00
flybird11111	97cd0cd559	[shardformer] fix llama error when transformers upgraded. (#5055 ) * fix-llama * Update llama.py	2023-11-16 21:34:04 +08:00
flybird11111	3e02154710	[gemini] gemini support extra-dp (#5043 ) * support ddp * fix * fix * fix fix * support ddp * fix * fix * fix fix * simplify tests * fix * fix * fix fix fix * fix	2023-11-16 21:03:04 +08:00
Elsa Granger	b2ad0d9e8f	[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017 ) * Use p2p * Cannot bidirectonal send p2p * Refactor tensor creation and serialization in P2P communication * Fix llama forward args in flash attention * Add flop estimate from megatron * Support loading weight not in weight_map when strict=False in hybrid_parallel * Use send_forward_recv_backward, etc in 1f1b * Use dataclass for metdata Remove torch.cuda.synchronize() as suggested * Add comment about the torch.cuda.synchronize for potential error * Typo * Update hybrid_parallel_checkpoint_io.py * Update p2p.py * Update one_f_one_b.py * Update p2p.py --------- Co-authored-by: flybird11111 <1829166702@qq.com>	2023-11-16 20:15:59 +08:00
Cuiqing Li (李崔卿)	28052a71fb	[Kernels]Update triton kernels into 2.1.0 (#5046 ) * update flash-context-attention * adding kernels * fix * reset * add build script * add building process * add llama2 exmaple * add colossal-llama2 test * clean * fall back test setting * fix test file * clean * clean * clean --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-11-16 16:43:15 +08:00
Zhongkai Zhao	70885d707d	[hotfix] Suport extra_kwargs in ShardConfig (#5031 ) * [refactor]: replace inference args with extra_kwargs in ShardConfig * modify shardconfig * polish code * fix policy bug in llama * fix bug in auto policy * remove setattr in ShardConfig	2023-11-10 10:49:50 +08:00
flybird11111	576a2f7b10	[gemini] gemini support tensor parallelism. (#4942 ) * [colossalai]fix typo * [inference] Add smmoothquant for llama (#4904) * [inference] add int8 rotary embedding kernel for smoothquant (#4843) * [inference] add smoothquant llama attention (#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (#4902) * rafactor code * add license * add torch-int and smoothquant license * Update flash_attention_patch.py To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer. https://github.com/huggingface/transformers/pull/25598 * [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) * [kernel] support pure fp16 for cpu adam (#4896) * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919) * [kernel] fix cpu adam * [test] update gemini optim test * [format] applied code formatting on changed files in pull request 4908 (#4918) Co-authored-by: github-actions <github-actions@github.com> * [gemini] support gradient accumulation (#4869) * add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case * [hotfix] fix torch 2.0 compatibility (#4936) * [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit * [test] add no master test for low level zero plugin (#4934) * [format] applied code formatting on changed files in pull request 4820 (#4886) Co-authored-by: github-actions <github-actions@github.com> * [nfc] fix some typo with colossalai/ docs/ etc. (#4920) * [Refactor] Integrated some lightllm kernels into token-attention (#4946) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> * [test] merge old components to test to model zoo (#4945) * [test] add custom models in model zoo * [test] update legacy test * [test] update model zoo * [test] update gemini test * [test] remove components to test * [inference] add reference and fix some bugs (#4937) * add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <xukai16@foxamil.com> * [Inference]ADD Bench Chatglm2 script (#4963) * add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Pipeline inference] Combine kvcache with pipeline inference (#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test * updated c++17 compiler flags (#4983) * [Inference] Dynamic Batching Inference, online and offline (#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * Revert "[inference]Re push async dynamic batching (#4901)" (#4905) This reverts commit `fbf3c09e67`. * Revert "[inference] Async dynamic batching (#4894)" This reverts commit `fced140250`. * Revert "[inference] Async dynamic batching (#4894)" (#4909) This reverts commit `fced140250`. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (#4911) * Revert "[inference] Async dynamic batching (#4894)" This reverts commit `fced140250`. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * add assertion for config (#4947) * [Inference] Finish dynamic batching offline test (#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * fix ColossalEval (#4992) Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> * [doc]Update doc for colossal-inference (#4989) * update doc * Update README.md --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * [hotfix] Fix the bug where process groups were not being properly released. (#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit `479900c139`. * [hotfix] fix the bug of repeatedly storing param group (#4951) * [doc] add supported feature diagram for hybrid parallel plugin (#4996) * [Pipeline Inference] Merge pp with tp (#4993) * refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo * [release] update version (#4995) * [release] update version * [hotfix] fix ci * [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp * fix fix fix * update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO * support fused layernorm support fused layernorm support fused layernorm * update fusedlayernorm update fusedlayernorm update fusedlayernorm * add sequence parallel to gemini add sequence parallel to gemini * fix * fix comments fix comments fix comments * fix * fix t5 * clear cache * fix * activate ci * activate ci * fix * fix * fix * fix * revert * modify tp gather method modify tp gather method modify tp gather method modify tp gather method * fix test --------- Co-authored-by: Xu Kai <xukai16@foxmail.com> Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> Co-authored-by: Xu Kai <xukai16@foxamil.com> Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: littsk <1214689160@qq.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>	2023-11-10 10:15:16 +08:00
Jun Gao	a4489384d5	[shardformer] Fix serialization error with Tensor Parallel state saving (#5018 ) * Fix serialization error with Tensor Parallel state saving * Refactor state_dict CPU transfer using tree_map	2023-11-09 17:00:25 +08:00
Wenhao Chen	724441279b	[moe]: fix ep/tp tests, add hierarchical all2all (#4982 ) * fix: add warning for EP different behavior * fix: use shard_data in ep & tp model * to: add used_capacity * fix: fix router test * feat: add create_ep_node_group * feat: add create_ep_hierarchical_group fn * feat: add HierarchicalAllToAll * test: add hierarchical all2all test * fix: fix test errors * fix: simplify create_ep_hierarchical_group * fix: add hierarchical_alltoall arg * fix: fix environ typo * revert: revert process mesh order * to: add todo mark * fix: skip hierarchical_comm if torch < 1.13.1	2023-11-09 06:31:00 +00:00
Xuanlei Zhao	f71e63b0f3	[moe] support optimizer checkpoint (#5015 ) * Refactor MoE Manager setup method * unshard optim ckpt * optim io * update transformer version * update requirements * update ckpt * update ckpt * update ckpt * fix engine * fix engine	2023-11-08 15:07:03 +00:00
Jianghai	ef4c14a5e2	[Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014 ) * fix bug * fix * fix multiquery * fix multiquery --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2023-11-07 15:01:50 +08:00
github-actions[bot]	c36e782d80	[format] applied code formatting on changed files in pull request 4926 (#5007 ) Co-authored-by: github-actions <github-actions@github.com>	2023-11-06 17:08:12 +08:00
littsk	1a3315e336	[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926 ) * [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915) * Add layer norm gradients all-reduce for sequence parallel. * skip pipeline inference test * [hotfix] fixing polices of sequence parallel (#4922) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy --------- Co-authored-by: littsk <1214689160@qq.com> * Hotfix/add grad all reduce for sequence parallel (#4927) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy * fix bug using wrong variables --------- Co-authored-by: littsk <1214689160@qq.com> * fix policy initialization * fix bloom and chatglm policices * polish code of handling layernorm * fix moe module * polish code of class initializing --------- Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>	2023-11-03 13:32:43 +08:00
Baizhou Zhang	d99b2c961a	[hotfix] fix grad accumulation plus clipping for gemini (#5002 )	2023-11-02 17:59:10 +08:00
Xuanlei Zhao	dc003c304c	[moe] merge moe into main (#4978 ) * update moe module * support openmoe	2023-11-02 02:21:24 +00:00
Bin Jia	b6696beb04	[Pipeline Inference] Merge pp with tp (#4993 ) * refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo	2023-11-01 12:46:21 +08:00
Baizhou Zhang	c040d70aa0	[hotfix] fix the bug of repeatedly storing param group (#4951 )	2023-10-31 14:48:01 +08:00
littsk	be82b5d4ca	[hotfix] Fix the bug where process groups were not being properly released. (#4940 ) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit `479900c139`.	2023-10-31 14:47:30 +08:00
Cuiqing Li (李崔卿)	4f0234f236	[doc]Update doc for colossal-inference (#4989 ) * update doc * Update README.md --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-10-31 10:48:07 +08:00
Cuiqing Li	459a88c806	[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965 ) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-10-30 14:04:37 +08:00
Jianghai	cf579ff46d	[Inference] Dynamic Batching Inference, online and offline (#4953 ) * [inference] Dynamic Batching for Single and Multiple GPUs (#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * Revert "[inference]Re push async dynamic batching (#4901)" (#4905) This reverts commit `fbf3c09e67`. * Revert "[inference] Async dynamic batching (#4894)" This reverts commit `fced140250`. * Revert "[inference] Async dynamic batching (#4894)" (#4909) This reverts commit `fced140250`. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (#4911) * Revert "[inference] Async dynamic batching (#4894)" This reverts commit `fced140250`. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * add assertion for config (#4947) * [Inference] Finish dynamic batching offline test (#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2023-10-30 10:52:19 +08:00
Bin Jia	1db6727678	[Pipeline inference] Combine kvcache with pipeline inference (#4938 ) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test	2023-10-27 16:19:54 +08:00
Xu Kai	785802e809	[inference] add reference and fix some bugs (#4937 ) * add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <xukai16@foxamil.com>	2023-10-20 13:39:34 +08:00
Hongxin Liu	b8e770c832	[test] merge old components to test to model zoo (#4945 ) * [test] add custom models in model zoo * [test] update legacy test * [test] update model zoo * [test] update gemini test * [test] remove components to test	2023-10-20 10:35:08 +08:00
Cuiqing Li	3a41e8304e	[Refactor] Integrated some lightllm kernels into token-attention (#4946 ) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>	2023-10-19 22:22:47 +08:00
digger yu	11009103be	[nfc] fix some typo with colossalai/ docs/ etc. (#4920 )	2023-10-18 15:44:04 +08:00
github-actions[bot]	486d06a2d5	[format] applied code formatting on changed files in pull request 4820 (#4886 ) Co-authored-by: github-actions <github-actions@github.com>	2023-10-18 11:46:37 +08:00
Zhongkai Zhao	c7aa319ba0	[test] add no master test for low level zero plugin (#4934 )	2023-10-18 11:41:23 +08:00
Hongxin Liu	1f5d2e8062	[hotfix] fix torch 2.0 compatibility (#4936 ) * [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit	2023-10-18 11:05:25 +08:00
Baizhou Zhang	21ba89cab6	[gemini] support gradient accumulation (#4869 ) * add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case	2023-10-17 14:07:21 +08:00
Hongxin Liu	4f68b3f10c	[kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921 ) * [kernel] support pure fp16 for cpu adam (#4896) * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919) * [kernel] fix cpu adam * [test] update gemini optim test	2023-10-16 21:56:53 +08:00
Xu Kai	611a5a80ca	[inference] Add smmoothquant for llama (#4904 ) * [inference] add int8 rotary embedding kernel for smoothquant (#4843) * [inference] add smoothquant llama attention (#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (#4902) * rafactor code * add license * add torch-int and smoothquant license	2023-10-16 11:28:44 +08:00
Zhongkai Zhao	a0684e7bd6	[feature] support no master weights option for low level zero plugin (#4816 ) * [feature] support no master weights for low level zero plugin * [feature] support no master weights for low level zero plugin, remove data copy when no master weights * remove data copy and typecasting when no master weights * not load weights to cpu when using no master weights * fix grad: use fp16 grad when no master weights * only do not update working param when no master weights * fix: only do not update working param when no master weights * fix: passing params in dict format in hybrid plugin * fix: remove extra params (tp_process_group) in hybrid_parallel_plugin	2023-10-13 07:57:45 +00:00
Xu Kai	77a9328304	[inference] add llama2 support (#4898 ) * add llama2 support * fix multi group bug	2023-10-13 13:09:23 +08:00
Baizhou Zhang	39f2582e98	[hotfix] fix lr scheduler bug in torch 2.0 (#4864 )	2023-10-12 14:04:24 +08:00
littsk	83b52c56cd	[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837 ) * Add clip_grad_norm for hibrid_parallel_plugin * polish code * add unittests * Move tp to a higher-level optimizer interface. * bug fix * polish code	2023-10-12 11:32:37 +08:00
Hongxin Liu	df63564184	[gemini] support amp o3 for gemini (#4872 ) * [gemini] support no reuse fp16 chunk * [gemini] support no master weight for optim * [gemini] support no master weight for gemini ddp * [test] update gemini tests * [test] update gemini tests * [plugin] update gemini plugin * [test] fix gemini checkpointio test * [test] fix gemini checkpoint io	2023-10-12 10:39:08 +08:00
ppt0011	1dcaf249bd	[doc] add reminder for issue encountered with hybrid adam	2023-10-11 17:51:14 +08:00
Bin Jia	08a9f76b2f	[Pipeline Inference] Sync pipeline inference branch to main (#4820 ) * [pipeline inference] pipeline inference (#4492) * add pp stage manager as circle stage * fix a bug when create process group * add ppinfer basic framework * add micro batch manager and support kvcache-pp gpt2 fwd * add generate schedule * use mb size to control mb number * support generate with kv cache * add output, remove unused code * add test * reuse shardformer to build model * refactor some code and use the same attribute name of hf * fix review and add test for generation * remove unused file * fix CI * add cache clear * fix code error * fix typo * [Pipeline inference] Modify to tieweight (#4599) * add pp stage manager as circle stage * fix a bug when create process group * add ppinfer basic framework * add micro batch manager and support kvcache-pp gpt2 fwd * add generate schedule * use mb size to control mb number * support generate with kv cache * add output, remove unused code * add test * reuse shardformer to build model * refactor some code and use the same attribute name of hf * fix review and add test for generation * remove unused file * modify the way of saving newtokens * modify to tieweight * modify test * remove unused file * solve review * add docstring * [Pipeline inference] support llama pipeline inference (#4647) * support llama pipeline inference * remove tie weight operation * [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708) * add benchmark verbose * fix export tokens * fix benchmark verbose * add P2POp style to do p2p communication * modify schedule as p2p type when ppsize is 2 * remove unused code and add docstring * [Pipeline inference] Refactor code, add docsting, fix bug (#4790) * add benchmark script * update argparse * fix fp16 load * refactor code style * add docstring * polish code * fix test bug * [Pipeline inference] Add pipeline inference docs (#4817) * add readme doc * add a ico * Add performance * update table of contents * refactor code (#4873)	2023-10-11 11:40:06 +08:00
Camille Zhong	cd6a962e66	[NFC] polish code style (#4799 )	2023-10-07 13:36:52 +08:00
Michelle	07ed155e86	[NFC] polish colossalai/inference/quant/gptq/cai_gptq/__init__.py code style (#4792 )	2023-10-07 13:36:52 +08:00
littsk	eef96e0877	polish code for gptq (#4793 )	2023-10-07 13:36:52 +08:00

1 2 3 4 5 ...

1792 Commits (9afa52061f89dde87a73e36f740f62781d658a01)