ColossalAI

Commit Graph

Author	SHA1	Message	Date
Edenzzzz	f5c84af0b0	[Feature] Zigzag Ring attention (#5905 ) * halfway * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add sp_mode to benchmark; fix varlen interface * update softmax_lse shape by new interface * change tester name * remove buffer clone; support packed seq layout * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	3 months ago
pre-commit-ci[bot]	7c2f79fa98	[pre-commit.ci] pre-commit autoupdate (#5572 ) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/PyCQA/autoflake: v2.2.1 → v2.3.1](https://github.com/PyCQA/autoflake/compare/v2.2.1...v2.3.1) - [github.com/pycqa/isort: 5.12.0 → 5.13.2](https://github.com/pycqa/isort/compare/5.12.0...5.13.2) - [github.com/psf/black-pre-commit-mirror: 23.9.1 → 24.4.2](https://github.com/psf/black-pre-commit-mirror/compare/23.9.1...24.4.2) - [github.com/pre-commit/mirrors-clang-format: v13.0.1 → v18.1.7](https://github.com/pre-commit/mirrors-clang-format/compare/v13.0.1...v18.1.7) - [github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.6.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.3.0...v4.6.0) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	5 months ago
flybird11111	a1e39f4c0d	[install]fix setup (#5786 ) * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	6 months ago
Charles Coulombe	c46e09715c	Allow building cuda extension without a device. (#5535 ) Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are.	6 months ago
傅剑寒	121d7ad629	[Inference] Delete duplicated copy_vector (#5716 )	7 months ago
Steve Luo	7806842f2d	add paged-attetionv2: support seq length split across thread block (#5707 )	7 months ago
傅剑寒	50104ab340	[Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706 ) * add convert_fp8 op for fp8 test in the future * rerun ci	7 months ago
傅剑寒	1ace1065e6	[Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686 )	7 months ago
Steve Luo	725fbd2ed0	[Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679 )	7 months ago
傅剑寒	9df016fc45	[Inference] Fix quant bits order (#5681 )	7 months ago
傅剑寒	ef8e4ffe31	[Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680 )	7 months ago
Steve Luo	5cd75ce4c7	[Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663 ) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention	7 months ago
傅剑寒	808ee6e4ad	[Inference/Feat] Feat quant kvcache step2 (#5674 )	7 months ago
傅剑寒	8ccb6714e7	[Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656 )	7 months ago
Steve Luo	a8fd3b0342	[Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643 ) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	7 months ago
傅剑寒	279300dc5f	[Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613 ) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc	7 months ago
yuehuayingxueluo	12f10d5b0b	[Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623 ) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH	7 months ago
Steve Luo	ccf72797e3	feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611 )	7 months ago
Steve Luo	be396ad6cc	[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531 ) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	7 months ago
傅剑寒	d4cb023b62	[Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593 ) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file	8 months ago
傅剑寒	a21912339a	refactor csrc (#5582 )	8 months ago
pre-commit-ci[bot]	d78817539e	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	8 months ago
Yuanheng	ed5ebd1735	[Fix] resolve conflicts of merging main	8 months ago
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	8 months ago
傅剑寒	7ebdf48ac5	add cast and op_functor for cuda build-in types (#5546 )	8 months ago
傅剑寒	a2878e39f4	[Inference] Add Reduce Utils (#5537 ) * add reduce utils * add using to delele namespace prefix	8 months ago
yuehuayingxueluo	04aca9e55b	[Inference/Kernel]Add get_cos_and_sin Kernel (#5528 ) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'.	8 months ago
yuehuayingxueluo	934e31afb2	The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519 )	8 months ago
Hongxin Liu	19e1a5cf16	[shardformer] update colo attention to support custom mask (#5510 ) * [feature] refactor colo attention (#5462) * [extension] update api * [feature] add colo attention * [feature] update sdpa * [feature] update npu attention * [feature] update flash-attn * [test] add flash attn test * [test] update flash attn test * [shardformer] update modeling to fit colo attention (#5465) * [misc] refactor folder structure * [shardformer] update llama flash-attn * [shardformer] fix llama policy * [devops] update tensornvme install * [test] update llama test * [shardformer] update colo attn kernel dispatch * [shardformer] update blip2 * [shardformer] update chatglm * [shardformer] update gpt2 * [shardformer] update gptj * [shardformer] update opt * [shardformer] update vit * [shardformer] update colo attention mask prep * [shardformer] update whisper * [test] fix shardformer tests (#5514) * [test] fix shardformer tests * [test] fix shardformer tests	8 months ago
yuehuayingxueluo	87079cffe8	[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461 ) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32	8 months ago
傅剑寒	7ff42cc06d	add vec_type_trait implementation (#5473 )	8 months ago
xs_courtesy	48c4f29b27	refactor vector utils	8 months ago
xs_courtesy	5724b9e31e	add some comments	9 months ago
xs_courtesy	388e043930	add implementatino for GetGPULaunchConfig1D	9 months ago
yuehuayingxueluo	f366a5ea1f	[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418 ) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline	9 months ago
Steve Luo	ed431de4e4	fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454 )	9 months ago
xs_courtesy	c1c45e9d8e	fix include path	9 months ago
Steve Luo	b699f54007	optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441 )	9 months ago
xs_courtesy	095c070a6e	refactor code	9 months ago
傅剑寒	21e1e3645c	Merge pull request #5435 from Courtesy-Xs/add_gpu_launch_config Add query and other components	9 months ago
Steve Luo	f7aecc0c6b	feat rmsnorm cuda kernel and add unittest, benchmark script (#5417 )	9 months ago
xs_courtesy	5eb5ff1464	refactor code	9 months ago
xs_courtesy	01d289d8e5	Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into add_gpu_launch_config	9 months ago
xs_courtesy	a46598ac59	add reusable utils for cuda	9 months ago
xs_courtesy	95c21498d4	add silu_and_mul for infer	9 months ago
Hongxin Liu	070df689e6	[devops] fix extention building (#5427 )	9 months ago
FrankLeeeee	0310b76e9d	Merge branch 'main' into sync/main	9 months ago
yuehuayingxueluo	600881a8ea	[Inference]Add CUDA KVCache Kernel (#5406 ) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review	9 months ago
Hongxin Liu	ffffc32dc7	[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347 ) * [checkpointio] fix hybrid parallel optim checkpoint * [extension] fix cuda extension * [checkpointio] fix gemini optimizer checkpoint * polish code	10 months ago
Frank Lee	abd8e77ad8	[extension] fixed exception catch (#5342 )	10 months ago

1 2

53 Commits (bc7eeade33e33e3a7c2df26fedab707f3a62d6fe)