ColossalAI

Commit Graph

Author	SHA1	Message	Date
Jiarui Fang	52736205d9	[checkpoint] make unitest faster (#1217 )	2 years ago
Jiarui Fang	f38006ea83	[checkpoint] checkpoint for ColoTensor Model (#1196 )	2 years ago
Jiarui Fang	ae7d3f4927	[refactor] move process group from _DistSpec to ColoTensor. (#1203 )	2 years ago
YuliangLiu0306	63d2a93878	[context]support arbitary module materialization. (#1193 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]support arbitary module materialization. * [test]add numerical check for lazy init context.	2 years ago
YuliangLiu0306	2053e138a2	[context]use meta tensor to init model lazily. (#1187 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]use meta tensor to init model lazily. * polish * make module with device kwargs bypass the normal init. * change unit test to adapt updated context.	2 years ago
YuliangLiu0306	e27645376d	[hotfix]different overflow status lead to communication stuck. (#1175 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix some bugs caused by refactored schedule. * [hotfix]different overflow statu llead to communication stuck.	2 years ago
Jiarui Fang	4b9bba8116	[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168 )	2 years ago
Frank Lee	f8eec98ff5	[tensor] fixed non-serializable colo parameter during model checkpointing (#1153 )	2 years ago
Frank Lee	73ad05fc8c	[zero] added error message to handle on-the-fly import of torch Module class (#1135 ) * [zero] added error message to handle on-the-fly import of torch Module class * polish code	2 years ago
Frank Lee	2b2dc1c86b	[pipeline] refactor the pipeline module (#1087 ) * [pipeline] refactor the pipeline module * polish code	3 years ago
Frank Lee	bad5d4c0a1	[context] support lazy init of module (#1088 ) * [context] support lazy init of module * polish code	3 years ago
Frank Lee	bfdc5ccb7b	[context] maintain the context object in with statement (#1073 )	3 years ago
Jiarui Fang	49832b2344	[refactory] add nn.parallel module (#1068 )	3 years ago
Jiarui Fang	a00644079e	reorgnize colotensor directory (#1062 ) * reorgnize colotensor directory * polish code	3 years ago
Ziyue Jiang	df9dcbbff6	[Tensor] add hybrid device demo and fix bugs (#1059 )	3 years ago
Ziyue Jiang	7c530b9de2	[Tensor] add Parameter inheritance for ColoParameter (#1041 ) * add Parameter inheritance for ColoParameter * remove tricks * remove tricks * polish * polish	3 years ago
Ziyue Jiang	6c5996a56e	[Tensor] add module check and bert test (#1031 ) * add Embedding * Add bert test * polish * add check module test * polish * polish * polish * polish	3 years ago
Ziyue Jiang	32291dd73f	[Tensor] add module handler for linear (#1021 ) * add module spec for linear * polish * polish * polish	3 years ago
ver217	007ca0df92	fix colo init context (#1026 )	3 years ago
ver217	ad536e308e	[tensor] refactor colo-tensor (#992 ) * refactor colo-tensor and update linear op * polish code * polish code * update ops and unit tests * update unit tests * polish code * rename dist_spec module * polish code * polish code * remove unneeded import * fix pipelinable	3 years ago
Ziyue Jiang	d73c2b1d79	[Tensor] fix init context (#931 ) * change torch.Parameter to ColoParameter * fix post assignment for init context * polish * polish	3 years ago
Ziyue Jiang	dfc88b85ea	[Tensor] simplify named param (#928 ) * simplify ColoModulize * simplify ColoModulize * polish * polish	3 years ago
YuliangLiu0306	32a45cd7ef	[pipelinable]use pipelinable to support GPT model. (#903 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipelinable]use pipelinable to support GPT model. * fix a bug caused by ShardedModel * polish * fix front func list	3 years ago
Ziyue Jiang	c195d2814c	[Tensor] add from_pretrained support and bert pretrained test (#921 ) * add from_pretrained support and test * polish * polish * polish * polish	3 years ago
Jiarui Fang	ab95ec9aea	[Tensor] init ColoParameter (#914 )	3 years ago
Jiarui Fang	d16671da75	[Tensor] initialize the ColoOptimizer (#898 ) * [Tensor] activation is an attr of ColoTensor * [Tensor] add optimizer * only detach parameters in context * polish code	3 years ago
Jiarui Fang	676f191532	[Tensor] activation is an attr of ColoTensor (#897 )	3 years ago
Jiarui Fang	26c49639d8	[Tensor] overriding paramters() for Module using ColoTensor (#889 )	3 years ago
ver217	4df6471f5d	fix import error (#880 )	3 years ago
Jiarui Fang	d01d3b8cb0	colo init context add device attr. (#866 )	3 years ago
YuliangLiu0306	c6930d8ddf	[pipelinable]use ColoTensor to replace dummy tensor. (#853 )	3 years ago
ver217	232142f402	[utils] refactor profiler (#837 ) * add model data profiler * add a subclass of torch.profiler.profile * refactor folder structure * remove redundant codes * polish code * use GeminiMemoryManager * fix import path * fix stm profiler ext * polish comments * remove useless file	3 years ago
Jiarui Fang	62f059251b	[Tensor] init a tp network training unittest (#849 )	3 years ago
ver217	0dea140760	[hotfix] add deconstructor for stateful tensor (#848 ) * add deconstructor for stateful tensor * fix colo init context	3 years ago
YuliangLiu0306	35ea6e1023	[pipelinable]use pipelinable context to initialize non-pipeline model (#816 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]add module lazy init feature to support large model initization. * [pipeline]add to_layer_list and partition method to support arbitrary non-pp model * refactor the module structure * polish * [pipelinable]add unit test for pipelinable * polish * polish * Fix CodeFactor issues.	3 years ago
Jiarui Fang	8789850eea	Init Conext supports lazy allocate model memory (#842 )	3 years ago
Jiarui Fang	eb1b89908c	[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824 )	3 years ago
Jiarui Fang	227d1cd4b3	[gemini] APIs to set cpu memory capacity (#809 )	3 years ago
Jiarui Fang	681addb512	[refactor] moving grad acc logic to engine (#804 )	3 years ago
Jiarui Fang	4d9332b4c5	[refactor] moving memtracer to gemini (#801 )	3 years ago
HELSON	84c6700b2a	[zero] refactor memstats_collector (#746 )	3 years ago
HELSON	340e59f968	[utils] add synchronized cuda memory monitor (#740 )	3 years ago
Jiarui Fang	53cb584808	[utils] correct cpu memory used and capacity in the context of multi-process (#726 )	3 years ago
Frank Lee	2412429d54	[util] fixed activation checkpointing on torch 1.9 (#719 )	3 years ago
Jiarui Fang	193dc8dacb	[refactor] refactor the memory utils (#715 )	3 years ago
LuGY	140263a394	[hotfix]fixed bugs of assigning grad states to non leaf nodes (#711 ) * fixed bugs of assigning grad states to non leaf nodes * use detach()	3 years ago
ver217	ab8c6b4a0e	[zero] refactor memstats collector (#706 ) * refactor memstats collector * fix disposable * polish code	3 years ago
ver217	3c9cd5bb5e	[zero] stateful tensor manager (#687 ) * [WIP] stateful tensor manager * add eviction strategy * polish code * polish code * polish comment * add unit test * fix sampler bug * polish code * fix max sampling cnt resetting bug * fix sampler bug * polish code * fix bug * fix unit test Co-authored-by: jiaruifang <fangjiarui123@gmail.com>	3 years ago
Jiarui Fang	59bf2dc590	[zero] initialize a stateful tensor manager (#614 )	3 years ago
Jiarui Fang	0aab52301e	[hotfix] fix a bug in model data stats tracing (#655 )	3 years ago

1 2 3

124 Commits (11973d892d0273cc4719e30997cfaeafe4bc506c)