ColossalAI

Commit Graph

Author	SHA1	Message	Date
Jiarui Fang	294a6060d0	[tensor] ZeRO use ColoTensor as the base class. (#828 ) * [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. * [tensor] ZeRO use ColoTensor as the base class. * polish	3 years ago
Ziyue Jiang	8e6fdb4f29	[tensor]fix test_linear (#826 )	3 years ago
Ziyue Jiang	1a9e2c2dff	[tensor] fix kwargs in colo_tensor torch_funtion (#825 )	3 years ago
Jiarui Fang	eb1b89908c	[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824 )	3 years ago
Jiarui Fang	2ecc3d7a55	[tensor] lazy init (#823 )	3 years ago
Jiarui Fang	68dcd51d41	[Tensor] update ColoTensor torch_function (#822 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish * polish code * add a new tensor structure and override linear for it * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * [tensor] renaming and reorganize directory structure. * rm useless dir * polish * polish * [tensor] hander the function not wrapped * polish	3 years ago
Jiarui Fang	0ce8924ceb	[tensor] reorganize files (#820 )	3 years ago
Jiarui Fang	ab962b9735	[gemini] a new tensor structure (#818 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish * polish code * add a new tensor structure and override linear for it * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish	3 years ago
FrankLeeeee	70ed11d07e	[cli] added check installation cli	3 years ago
YuliangLiu0306	c7eca40f51	Merge pull request #812 from FrankLeeeee/feature/cli [cli] fixed single-node process launching	3 years ago
Jiarui Fang	3ddbd1bce1	[gemini] collect cpu-gpu moving volume in each iteration (#813 )	3 years ago
FrankLeeeee	d522cb704e	[cli] fixed single-node process launching	3 years ago
Jiarui Fang	61c20b44bc	[log] local throughput metrics (#811 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish	3 years ago
ver217	dd92b90a68	[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808 ) * init fp16 param directly * polish code	3 years ago
Jiarui Fang	227d1cd4b3	[gemini] APIs to set cpu memory capacity (#809 )	3 years ago
FrankLeeeee	f63e91d280	[cli] fixed a bug in user args and refactored the module structure	3 years ago
Jiarui Fang	e761ad2cd7	Revert "[zero] add ZeroTensorShardStrategy (#793 )" (#806 )	3 years ago
HELSON	88759e289e	[zero] add ZeroTensorShardStrategy (#793 )	3 years ago
Jiarui Fang	681addb512	[refactor] moving grad acc logic to engine (#804 )	3 years ago
Frank Lee	05d9ae5999	[cli] add missing requirement (#805 )	3 years ago
YuliangLiu0306	de2f581d43	[cli] added micro benchmarking for tp (#789 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [CLI]add cli benchmark feature * fix CodeFactor issues. * refactor the module structure.	3 years ago
YuliangLiu0306	cfadc9df8e	[cli] added distributed launcher command (#791 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [CLI]add cli launcher feature * remove testing message used during developing * refactor the module structure.	3 years ago
Jiarui Fang	4d9332b4c5	[refactor] moving memtracer to gemini (#801 )	3 years ago
Jiarui Fang	8711c706f4	[hotfix] fix grad offload when enabling reuse_fp16_shard	3 years ago
ver217	f1fa1a675f	fix grad offload when enabling reuse_fp16_shard	3 years ago
HELSON	4c4388c46e	[hotfix] fix memory leak in zero (#781 )	3 years ago
Ziyue Jiang	4b01da24cd	[TP] change the check assert in split batch 2d (#772 )	3 years ago
ver217	846406a07a	[gemini] fix auto tensor placement policy (#775 )	3 years ago
HELSON	a65cbb7e4e	[zero] refactor shard and gather operation (#773 )	3 years ago
ver217	6e553748a7	polish sharded optim docstr and warning (#770 )	3 years ago
LuGY	80e37eec42	fix the ckpt bugs when using DDP (#769 )	3 years ago
Frank Lee	920fe31526	[compatibility] used backward-compatible API for global process group (#758 )	3 years ago
Frank Lee	4ea49cb536	[test] added a decorator for address already in use error with backward compatibility (#760 ) * [test] added a decorator for address already in use error with backward compatibility * [test] added a decorator for address already in use error with backward compatibility	3 years ago
Jiarui Fang	10ef8afdd2	[gemini] init genimi individual directory (#754 )	3 years ago
ver217	dcca614eee	[hotfix] fix test_stateful_tensor_mgr (#762 )	3 years ago
ver217	a93a7d7364	[hotfix] fix reuse_fp16_shard of sharded model (#756 ) * fix reuse_fp16_shard * disable test stm * polish code	3 years ago
ver217	8f7ce94b8e	[hotfix] fix auto tensor placement policy (#753 )	3 years ago
HELSON	84c6700b2a	[zero] refactor memstats_collector (#746 )	3 years ago
アマデウス	b8899e0905	[TP] allow layernorm without bias (#750 )	3 years ago
Jiarui Fang	3d7dc46d33	[zero] use factory pattern for tensor_placement_policy (#752 )	3 years ago
ver217	4b048a8728	fix prepare grads in sharded optim (#749 )	3 years ago
ver217	097772546e	fix initialize about zero	3 years ago
ver217	e396bb71f2	[zero] add tensor placement policies (#743 ) * add tensor placement policies * polish comments * polish comments * update moe unit tests	3 years ago
HELSON	22c4b88d56	[zero] refactor ShardedParamV2 for convenience (#742 )	3 years ago
HELSON	340e59f968	[utils] add synchronized cuda memory monitor (#740 )	3 years ago
ver217	e6212f56cd	[hotfix] fix memory leak in backward of sharded model (#741 )	3 years ago
Frank Lee	a4e91bc87f	[bug] fixed grad scaler compatibility with torch 1.8 (#735 )	3 years ago
Jiarui Fang	53cb584808	[utils] correct cpu memory used and capacity in the context of multi-process (#726 )	3 years ago
Jiarui Fang	7db3ccc79b	[hotfix] remove duplicated param register to stateful tensor manager (#728 )	3 years ago
Frank Lee	1cb7bdad3b	[util] fixed communication API depth with PyTorch 1.9 (#721 )	3 years ago

1 2 3 4 5 ...

328 Commits (294a6060d003b7767da4e382f7b8dea7b83d6b6b)