ColossalAI

Commit Graph

Author	SHA1	Message	Date
Yuanheng Zhao	f342a93871	[Fix] Remove obsolete files - inference (#5650 )	2024-04-25 22:04:59 +08:00
Hongxin Liu	1b387ca9fe	[shardformer] refactor pipeline grad ckpt config (#5646 ) * [shardformer] refactor pipeline grad ckpt config * [shardformer] refactor pipeline grad ckpt config * [pipeline] fix stage manager	2024-04-25 15:19:30 +08:00
Steve Luo	a8fd3b0342	[Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643 ) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-25 14:24:02 +08:00
yuehuayingxueluo	90cd5227a3	[Fix/Inference]Fix vllm benchmark (#5630 ) * Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name	2024-04-24 14:51:36 +08:00
傅剑寒	279300dc5f	[Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613 ) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc	2024-04-24 14:17:54 +08:00
Yuanheng Zhao	04863a9b14	[example] Update Llama Inference example (#5629 ) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup	2024-04-23 22:23:07 +08:00
binmakeswell	f4c5aafe29	[example] llama3 (#5631 ) * release llama3 * [release] llama3 * [release] llama3 * [release] llama3 * [release] llama3	2024-04-23 18:48:07 +08:00
Hongxin Liu	4de4e31818	[exampe] update llama example (#5626 ) * [plugin] support dp inside for hybriad parallel * [example] update llama benchmark * [example] update llama benchmark * [example] update llama readme * [example] update llama readme	2024-04-23 14:12:20 +08:00
Steve Luo	ccf72797e3	feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611 )	2024-04-19 15:34:53 +08:00
Edenzzzz	d83c633ca6	[hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606 ) * fix no pad token bug * fixed some auto parallel codegen bug, but might not run on torch 2.1 --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu>	2024-04-18 18:15:50 +08:00
Steve Luo	be396ad6cc	[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531 ) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-18 16:45:07 +08:00
yuehuayingxueluo	56b222eff8	[inference/model]Adapted to the baichuan2-7B model (#5591 ) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE'	2024-04-15 16:53:02 +08:00
Yuanheng	ed5ebd1735	[Fix] resolve conflicts of merging main	2024-04-08 16:21:47 +08:00
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-08 15:09:40 +08:00
digger yu	341263df48	[hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548 )	2024-04-07 19:04:58 +08:00
digger yu	a799ca343b	[fix] fix typo s/muiti-node /multi-node etc. (#5448 )	2024-04-07 18:42:15 +08:00
Edenzzzz	15055f9a36	[hotfix] quick fixes to make legacy tutorials runnable (#5559 ) Co-authored-by: Edenzzzz <wtan45@wisc.edu>	2024-04-07 12:06:27 +08:00
Wenhao Chen	e614aa34f3	[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508 ) * feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig` * feat: apply `GradientCheckpointConfig` to policy and llama_forward * feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager * fix: add optional args for `distribute_layer` and `get_stage_index` * fix: fix changed API calls * test: update llama tests * style: polish `GradientCheckpointConfig` * fix: fix pipeline utils tests	2024-04-01 11:34:58 +08:00
Yuanheng Zhao	36c4bb2893	[Fix] Grok-1 use tokenizer from the same pretrained path (#5532 ) * [fix] use tokenizer from the same pretrained path * trust remote code	2024-03-28 16:30:04 +08:00
yuehuayingxueluo	934e31afb2	The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519 )	2024-03-28 10:42:51 +08:00
Insu Jang	00525f7772	[shardformer] fix pipeline forward error if custom layer distribution is used (#5189 ) * Use self.[distribute_layers\|get_stage_index] to exploit custom layer distribution * Change static methods for t5 layer distribution to member functions * Change static methods for whisper layer distribution to member functions * Replace whisper policy usage with self one * Fix test case to use non-static layer distribution methods * fix: fix typo --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	2024-03-27 13:57:00 +08:00
Yuanheng Zhao	131f32a076	[fix] fix grok-1 example typo (#5506 )	2024-03-26 10:19:42 +08:00
binmakeswell	34e909256c	[release] grok-1 inference benchmark (#5500 ) * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark	2024-03-25 14:42:51 +08:00
yuehuayingxueluo	87079cffe8	[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461 ) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32	2024-03-25 13:40:34 +08:00
Wenhao Chen	bb0a668fee	[hotfix] set return_outputs=False in examples and polish code (#5404 ) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value	2024-03-25 12:31:09 +08:00
Yuanheng Zhao	5fcd7795cd	[example] update Grok-1 inference (#5495 ) * revise grok-1 example * remove unused arg in scripts * prevent re-installing torch * update readme * revert modifying colossalai requirements * add perf * trivial * add tokenizer url	2024-03-24 20:24:11 +08:00
binmakeswell	6df844b8c4	[release] grok-1 314b inference (#5490 ) * [release] grok-1 inference * [release] grok-1 inference * [release] grok-1 inference	2024-03-22 15:48:12 +08:00
Hongxin Liu	848a574c26	[example] add grok-1 inference (#5485 ) * [misc] add submodule * remove submodule * [example] support grok-1 tp inference * [example] add grok-1 inference script * [example] refactor code * [example] add grok-1 readme * [exmaple] add test ci * [exmaple] update readme	2024-03-21 18:07:22 +08:00
yuehuayingxueluo	f366a5ea1f	[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418 ) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline	2024-03-13 17:20:03 +08:00
digger yu	385e85afd4	[hotfix] fix typo s/keywrods/keywords etc. (#5429 )	2024-03-12 11:25:16 +08:00
Steve Luo	f7aecc0c6b	feat rmsnorm cuda kernel and add unittest, benchmark script (#5417 )	2024-03-08 16:21:12 +08:00
Youngon	68f55a709c	[hotfix] fix stable diffusion inference bug. (#5289 ) * Update train_ddp.yaml delete "strategy" to fix DDP config loading bug in "main.py" * Update train_ddp.yaml fix inference with scripts/txt2img.py config file load bug. * Update README.md add pretrain model test code.	2024-03-05 22:03:40 +08:00
Luo Yihang	e239cf9060	[hotfix] fix typo of openmoe model source (#5403 )	2024-03-05 21:44:38 +08:00
MickeyCHAN	e304e4db35	[hotfix] fix sd vit import error (#5420 ) * fix import error * Update dpt_depth.py --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com>	2024-03-05 21:41:23 +08:00
Hongxin Liu	070df689e6	[devops] fix extention building (#5427 )	2024-03-05 15:35:54 +08:00
flybird11111	29695cf70c	[example]add gpt2 benchmark example script. (#5295 ) * benchmark gpt2 * fix fix fix fix * [doc] fix typo in Colossal-LLaMA-2/README.md (#5247) * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed ddp test (#5254) * [ci] fixed ddp test * polish * fix typo in applications/ColossalEval/README.md (#5250) * [ci] fix shardformer tests. (#5255) * fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [doc] fix doc typo (#5256) * [doc] fix annotation display * [doc] fix llama2 doc * [hotfix]: add pp sanity check and fix mbs arg (#5268) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check * [workflow] fixed incomplete bash command (#5272) * [workflow] fixed oom tests (#5275) * [workflow] fixed oom tests * polish * polish * polish * [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276) * fix ci fix * fix test * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests * fix --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [shardformer] hybridparallelplugin support gradients accumulation. (#5246) * support gradients acc fix fix fix fix fix fix fix fix fix fix fix fix fix * fix fix * fix fix fix * [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) * fix auto loading gpt2 tokenizer (#5279) * [doc] add llama2-13B disyplay (#5285) * Update README.md * fix 13b typo --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com> * fix llama pretrain (#5287) * fix * fix * fix fix * fix fix fix * fix fix * benchmark gpt2 * fix fix fix fix * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * fix fix * fix fix fix * fix * fix fix fix fix fix * fix * Update shardformer.py --------- Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: Wenhao Chen <cwher@outlook.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com> Co-authored-by: Desperado-Jia <502205863@qq.com>	2024-03-04 16:18:13 +08:00
FrankLeeeee	0310b76e9d	Merge branch 'main' into sync/main	2024-03-04 10:09:36 +08:00
yuehuayingxueluo	0aa27f1961	[Inference]Move benchmark-related code to the example directory. (#5408 ) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py	2024-02-28 16:46:03 +08:00
yuehuayingxueluo	600881a8ea	[Inference]Add CUDA KVCache Kernel (#5406 ) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review	2024-02-28 14:36:50 +08:00
Hongxin Liu	d882d18c65	[example] reuse flash attn patch (#5400 )	2024-02-27 11:22:07 +08:00
yuehuayingxueluo	bc1da87366	[Fix/Inference] Fix format of input prompts and input model in inference engine (#5395 ) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket	2024-02-23 10:51:35 +08:00
yuehuayingxueluo	2a718c8be8	Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390 ) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs	2024-02-21 13:23:57 +08:00
Jianghai	730103819d	[Inference]Fused kv copy into rotary calculation (#5383 ) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del	2024-02-21 11:31:48 +08:00
yuehuayingxueluo	8c69debdc7	[Inference]Support vllm testing in benchmark scripts (#5379 ) * add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style	2024-02-08 15:27:26 +08:00
Frank Lee	8106ede07f	Revert "[Inference] Adapt to Fused rotary (#5348 )" (#5373 ) This reverts commit `9f4ab2eb92`.	2024-02-07 14:27:04 +08:00
Jianghai	9f4ab2eb92	[Inference] Adapt to Fused rotary (#5348 ) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix	2024-02-07 11:36:04 +08:00
yuehuayingxueluo	631862f339	[Inference]Optimize generation process of inference engine (#5356 ) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py	2024-02-02 15:38:21 +08:00
yuehuayingxueluo	21ad4a27f9	[Inference/opt]Optimize the mid tensor of RMS Norm (#5350 ) * opt rms_norm * fix bugs in rms_layernorm	2024-02-02 15:06:01 +08:00
yuehuayingxueluo	249644c23b	[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340 ) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py	2024-02-01 15:49:39 +08:00
FrankLeeeee	c565519913	merge commit	2024-01-31 10:41:47 +08:00
digger yu	71321a07cf	fix typo change dosen't to doesn't (#5308 )	2024-01-30 09:57:38 +08:00
Frank Lee	8823cc4831	Merge pull request #5310 from hpcaitech/feature/npu Feature/npu	2024-01-29 13:49:39 +08:00
yuehuayingxueluo	4f28cb43c0	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 ) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py	2024-01-26 14:00:10 +08:00
Frank Lee	7cfed5f076	[feat] refactored extension module (#5298 ) * [feat] refactored extension module * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish	2024-01-25 17:01:48 +08:00
digger yu	bce9499ed3	fix some typo (#5307 )	2024-01-25 13:56:27 +08:00
yuehuayingxueluo	bfff9254ac	[inference] Adapted to Rotary Embedding and RMS Norm (#5283 ) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py	2024-01-22 10:55:34 +08:00
flybird11111	f7e3f82a7e	fix llama pretrain (#5287 )	2024-01-19 17:49:02 +08:00
Jianghai	9e2342bde2	[Hotfix] Fix bugs in testing continuous batching (#5270 ) * fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func	2024-01-18 16:31:14 +08:00
ver217	148469348a	Merge branch 'main' into sync/npu	2024-01-18 12:05:21 +08:00
yuehuayingxueluo	86b63f720c	[Inference]Adapted to the triton attn kernels (#5264 ) * adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print	2024-01-17 16:03:10 +08:00
Wenhao Chen	ef4f0ee854	[hotfix]: add pp sanity check and fix mbs arg (#5268 ) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check	2024-01-15 15:57:40 +08:00
binmakeswell	c174c4fc5f	[doc] fix doc typo (#5256 ) * [doc] fix annotation display * [doc] fix llama2 doc	2024-01-11 21:01:11 +08:00
Hongxin Liu	d202cc28c0	[npu] change device to accelerator api (#5239 ) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>	2024-01-09 10:20:05 +08:00
Xuanlei Zhao	dd2c28a323	[npu] use extension for op builder (#5172 ) * update extension * update cpu adam * update is * add doc for cpu adam * update kernel * update commit * update flash * update memory efficient * update flash attn * update flash attention loader * update api * fix * update doc * update example time limit * reverse change * fix doc * remove useless kernel * fix * not use warning * update * update	2024-01-08 11:39:16 +08:00
Wenhao Chen	3c0d82b19b	[pipeline]: support arbitrary batch size in forward_only mode (#5201 ) * fix: remove drop last in val & test dataloader * feat: add run_forward_only, support arbitrary bs * chore: modify ci script	2024-01-02 23:41:12 +08:00
Wenhao Chen	4fa689fca1	[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134 ) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin	2023-12-22 10:44:00 +08:00
flybird11111	21aa5de00b	[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150 ) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix	2023-12-08 11:10:51 +08:00
binmakeswell	177c79f2d1	[doc] add moe news (#5128 ) * [doc] add moe news * [doc] add moe news * [doc] add moe news	2023-11-28 17:44:06 +08:00
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	2023-11-28 16:54:42 +08:00
digger yu	d5661f0f25	[nfc] fix typo change directoty to directory (#5111 )	2023-11-27 18:25:53 +08:00
Xuanlei Zhao	3acbf6d496	[npu] add npu support for hybrid plugin and llama (#5090 ) * llama 3d * update * fix autocast	2023-11-22 19:23:21 +08:00
flybird11111	aae496631c	[shardformer]fix flash attention, when mask is casual, just don't unpad it (#5084 ) * fix flash attn * fix fix	2023-11-22 16:00:07 +08:00
Hongxin Liu	1cd7efc520	[inference] refactor examples and fix schedule (#5077 ) * [setup] refactor infer setup * [hotfix] fix infenrece behavior on 1 1 gpu * [exmaple] refactor inference examples	2023-11-21 10:46:03 +08:00
Bin Jia	4e3959d316	[hotfix/hybridengine] Fix init model with random parameters in benchmark (#5074 ) * fix init model with random parameters * fix example	2023-11-20 20:15:25 +08:00
github-actions[bot]	8921a73c90	[format] applied code formatting on changed files in pull request 5067 (#5072 ) Co-authored-by: github-actions <github-actions@github.com>	2023-11-20 19:46:43 +08:00
Xu Kai	fb103cfd6e	[inference] update examples and engine (#5073 ) * update examples and engine * fix choices * update example	2023-11-20 19:44:52 +08:00
Hongxin Liu	e5ce4c8ea6	[npu] add npu support for gemini and zero (#5067 ) * [npu] setup device utils (#5047) * [npu] add npu device support * [npu] support low level zero * [test] update npu zero plugin test * [hotfix] fix import * [test] recover tests * [npu] gemini support npu (#5052) * [npu] refactor device utils * [gemini] support npu * [example] llama2+gemini support npu * [kernel] add arm cpu adam kernel (#5065) * [kernel] add arm cpu adam * [optim] update adam optimizer * [kernel] arm cpu adam remove bf16 support	2023-11-20 16:12:41 +08:00
Cuiqing Li (李崔卿)	bce919708f	[Kernels]added flash-decoidng of triton (#5063 ) * added flash-decoidng of triton based on lightllm kernel * add req * clean * clean * delete build.sh --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-11-20 13:58:29 +08:00
Xu Kai	fd6482ad8c	[inference] Refactor inference architecture (#5057 ) * [inference] support only TP (#4998) * support only tp * enable tp * add support for bloom (#5008) * [refactor] refactor gptq and smoothquant llama (#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>	2023-11-19 21:05:05 +08:00
flybird11111	bc09b95f50	[exampe] fix llama example' loss error when using gemini plugin (#5060 ) fix llama example	2023-11-18 18:41:58 +08:00
Elsa Granger	b2ad0d9e8f	[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017 ) * Use p2p * Cannot bidirectonal send p2p * Refactor tensor creation and serialization in P2P communication * Fix llama forward args in flash attention * Add flop estimate from megatron * Support loading weight not in weight_map when strict=False in hybrid_parallel * Use send_forward_recv_backward, etc in 1f1b * Use dataclass for metdata Remove torch.cuda.synchronize() as suggested * Add comment about the torch.cuda.synchronize for potential error * Typo * Update hybrid_parallel_checkpoint_io.py * Update p2p.py * Update one_f_one_b.py * Update p2p.py --------- Co-authored-by: flybird11111 <1829166702@qq.com>	2023-11-16 20:15:59 +08:00
Cuiqing Li (李崔卿)	28052a71fb	[Kernels]Update triton kernels into 2.1.0 (#5046 ) * update flash-context-attention * adding kernels * fix * reset * add build script * add building process * add llama2 exmaple * add colossal-llama2 test * clean * fall back test setting * fix test file * clean * clean * clean --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-11-16 16:43:15 +08:00
Zhongkai Zhao	70885d707d	[hotfix] Suport extra_kwargs in ShardConfig (#5031 ) * [refactor]: replace inference args with extra_kwargs in ShardConfig * modify shardconfig * polish code * fix policy bug in llama * fix bug in auto policy * remove setattr in ShardConfig	2023-11-10 10:49:50 +08:00
Wenhao Chen	724441279b	[moe]: fix ep/tp tests, add hierarchical all2all (#4982 ) * fix: add warning for EP different behavior * fix: use shard_data in ep & tp model * to: add used_capacity * fix: fix router test * feat: add create_ep_node_group * feat: add create_ep_hierarchical_group fn * feat: add HierarchicalAllToAll * test: add hierarchical all2all test * fix: fix test errors * fix: simplify create_ep_hierarchical_group * fix: add hierarchical_alltoall arg * fix: fix environ typo * revert: revert process mesh order * to: add todo mark * fix: skip hierarchical_comm if torch < 1.13.1	2023-11-09 06:31:00 +00:00
Xuanlei Zhao	f71e63b0f3	[moe] support optimizer checkpoint (#5015 ) * Refactor MoE Manager setup method * unshard optim ckpt * optim io * update transformer version * update requirements * update ckpt * update ckpt * update ckpt * fix engine * fix engine	2023-11-08 15:07:03 +00:00
Xuanlei Zhao	dc003c304c	[moe] merge moe into main (#4978 ) * update moe module * support openmoe	2023-11-02 02:21:24 +00:00
Cuiqing Li	459a88c806	[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965 ) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	2023-10-30 14:04:37 +08:00
アマデウス	4e4a10c97d	updated c++17 compiler flags (#4983 )	2023-10-27 18:19:56 +08:00
Jianghai	c6cd629e7a	[Inference]ADD Bench Chatglm2 script (#4963 ) * add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	2023-10-24 13:11:15 +08:00
Xu Kai	785802e809	[inference] add reference and fix some bugs (#4937 ) * add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <xukai16@foxamil.com>	2023-10-20 13:39:34 +08:00
Cuiqing Li	3a41e8304e	[Refactor] Integrated some lightllm kernels into token-attention (#4946 ) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>	2023-10-19 22:22:47 +08:00
Xu Kai	611a5a80ca	[inference] Add smmoothquant for llama (#4904 ) * [inference] add int8 rotary embedding kernel for smoothquant (#4843) * [inference] add smoothquant llama attention (#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (#4902) * rafactor code * add license * add torch-int and smoothquant license	2023-10-16 11:28:44 +08:00
Blagoy Simandoff	8aed02b957	[nfc] fix minor typo in README (#4846 )	2023-10-07 17:51:11 +08:00
Xu Kai	d1fcc0fa4d	[infer] fix test bug (#4838 ) * fix test bug * delete useless code * fix typo	2023-10-04 10:01:03 +08:00
Jianghai	013a4bedf0	[inference]fix import bug and delete down useless init (#4830 ) * fix import bug and release useless init * fix * fix * fix	2023-10-04 09:18:45 +08:00
Yuanheng Zhao	573f270537	[Infer] Serving example w/ ray-serve (multiple GPU case) (#4841 ) * fix imports * add ray-serve with Colossal-Infer tp * trivial: send requests script * add README * fix worker port * fix readme * use app builder and autoscaling * trivial: input args * clean code; revise readme * testci (skip example test) * use auto model/tokenizer * revert imports fix (fixed in other PRs)	2023-10-02 17:48:38 +08:00
Yuanheng Zhao	3a74eb4b3a	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 ) * add Colossal-Inference serving example w/ TorchServe * add dockerfile * fix dockerfile * fix dockerfile: fix commit hash, install curl * refactor file structure * revise readme * trivial * trivial: dockerfile format * clean dir; revise readme * fix comments: fix imports and configs * fix formats * remove unused requirements	2023-10-02 17:42:37 +08:00
binmakeswell	822051d888	[doc] update slack link (#4823 )	2023-09-27 17:37:39 +08:00
flybird11111	26cd6d850c	[fix] fix weekly runing example (#4787 ) * [fix] fix weekly runing example * [fix] fix weekly runing example	2023-09-25 16:19:33 +08:00
Xu Kai	946ab56c48	[feature] add gptq for inference (#4754 ) * [gptq] add gptq kernel (#4416) * add gptq * refactor code * fix tests * replace auto-gptq * rname inferance/quant * refactor test * add auto-gptq as an option * reset requirements * change assert and check auto-gptq * add import warnings * change test flash attn version * remove example * change requirements of flash_attn * modify tests * [skip ci] change requirements-test * [gptq] faster gptq cuda kernel (#4494) * [skip ci] add cuda kernels * add license * [skip ci] fix max_input_len * format files & change test size * [skip ci] * [gptq] add gptq tensor parallel (#4538) * add gptq tensor parallel * add gptq tp * delete print * add test gptq check * add test auto gptq check * [gptq] combine gptq and kv cache manager (#4706) * combine gptq and kv cache manager * add init bits * delete useless code * add model path * delete usless print and update test * delete usless import * move option gptq to shard config * change replace linear to shardformer * update bloom policy * delete useless code * fix import bug and delete uselss code * change colossalai/gptq to colossalai/quant/gptq * update import linear for tests * delete useless code and mv gptq_kernel to kernel directory * fix triton kernel * add triton import	2023-09-22 11:02:50 +08:00

1 2 3 4 5 ...

470 Commits (e0c68ab6d3d64f401208d6ec66815995cee233c3)