ColossalAI

Commit Graph

Author	SHA1	Message	Date
Jiarui Fang	9f4fb3f28a	[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937 )	2 years ago
Frank Lee	e6ec99d389	[utils] fixed lazy init context (#1867 )	2 years ago
Jiarui Fang	3ce4463fe6	[utils] remove lazy_memory_allocate from ColoInitContext (#1844 )	2 years ago
ver217	99870726b1	[CheckpointIO] a uniform checkpoint I/O module (#1689 )	2 years ago
HELSON	1468e4bcfc	[zero] add constant placement policy (#1705 ) * fixes memory leak when paramter is in fp16 in ZeroDDP init. * bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release. * adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.	2 years ago
Kirigaya Kazuto	3b2a59b0ba	[pipeline/rank_recorder] fix bug when process data before backward \| add a tool for multiple ranks debug (#1681 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing * [pipeline/pytree] add pytree to process args and kwargs \| provide to process args and kwargs after forward	2 years ago
CsRic	2ac46f7be4	[NFC] polish utils/tensor_detector/__init__.py code style (#1573 ) Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>	2 years ago
LuGY	c7d4932956	[NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style (#1566 )	2 years ago
Kirigaya Kazuto	318fbf1145	[NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style (#1559 )	2 years ago
ver217	ae71036cd2	[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548 ) * refactor parallel layer * broadcast rank0 model after load ckpt	2 years ago
ver217	2bed096848	[utils] optimize partition_tensor_parallel_state_dict (#1546 )	2 years ago
ver217	a203b709d5	[hotfix] fix init context (#1543 ) * fix init context * fix lazy init ctx	2 years ago
Boyuan Yao	47fd8e4a02	[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460 ) * [utils] Add use_reetrant=False into colossalai checkpoint * [utils] add some annotation in utils.activaion_checkpoint * [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py * [test] modify test_activation_checkpoint.py * [test] modify test for reentrant=False	2 years ago
Frank Lee	5a52e21fe3	[test] fixed the activation codegen test (#1447 ) * [test] fixed the activation codegen test * polish code	2 years ago
ver217	821c6172e2	[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442 )	2 years ago
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2 years ago
HELSON	b6fd165f66	[checkpoint] add kwargs for load_state_dict (#1374 )	2 years ago
Frank Lee	0c1a16ea5b	[util] standard checkpoint function naming (#1377 )	2 years ago
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2 years ago
HELSON	8463290642	[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368 )	2 years ago
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2 years ago
HELSON	943a96323e	[hotfix] fix no optimizer in save/load (#1363 )	2 years ago
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2 years ago
Frank Lee	2cc1175c76	[fx] tested the complete workflow for auto-parallel (#1336 ) * [fx] tested the complete workflow for auto-parallel * polish code * polish code * polish code	2 years ago
HELSON	f92c100ddd	[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339 )	2 years ago
Frank Lee	250be4d31e	[utils] integrated colotensor with lazy init context (#1324 ) * [utils] integrated colotensor with lazy init context * polish code * polish code * polish code	2 years ago
Jiarui Fang	9e4c6449b0	[checkpoint] add ColoOptimizer checkpointing (#1316 )	2 years ago
Jiarui Fang	3ef3791a3b	[checkpoint] add test for bert and hotfix save bugs (#1297 )	2 years ago
Jiarui Fang	4165eabb1e	[hotfix] remove potiential circle import (#1307 ) * make it faster * [hotfix] remove circle import	2 years ago
Jiarui Fang	c92f84fcdb	[tensor] distributed checkpointing for parameters (#1240 )	2 years ago
Jiarui Fang	9bcd2fd4af	[tensor] a shorter shard and replicate spec (#1245 )	2 years ago
Jiarui Fang	20da6e48c8	[checkpoint] save sharded optimizer states (#1237 )	2 years ago
Jiarui Fang	3b500984b1	[tensor] fix some unittests (#1234 )	2 years ago
ver217	a45ddf2d5f	[hotfix] fix sharded optim step and clip_grad_norm (#1226 )	2 years ago
Yi Zhao	04537bf83e	[checkpoint]support generalized scheduler (#1222 )	2 years ago
Jiarui Fang	52736205d9	[checkpoint] make unitest faster (#1217 )	2 years ago
Jiarui Fang	f38006ea83	[checkpoint] checkpoint for ColoTensor Model (#1196 )	2 years ago
Jiarui Fang	ae7d3f4927	[refactor] move process group from _DistSpec to ColoTensor. (#1203 )	2 years ago
YuliangLiu0306	63d2a93878	[context]support arbitary module materialization. (#1193 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]support arbitary module materialization. * [test]add numerical check for lazy init context.	2 years ago
YuliangLiu0306	2053e138a2	[context]use meta tensor to init model lazily. (#1187 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]use meta tensor to init model lazily. * polish * make module with device kwargs bypass the normal init. * change unit test to adapt updated context.	2 years ago
YuliangLiu0306	e27645376d	[hotfix]different overflow status lead to communication stuck. (#1175 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix some bugs caused by refactored schedule. * [hotfix]different overflow statu llead to communication stuck.	2 years ago
Jiarui Fang	4b9bba8116	[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168 )	2 years ago
Frank Lee	f8eec98ff5	[tensor] fixed non-serializable colo parameter during model checkpointing (#1153 )	2 years ago
Frank Lee	73ad05fc8c	[zero] added error message to handle on-the-fly import of torch Module class (#1135 ) * [zero] added error message to handle on-the-fly import of torch Module class * polish code	2 years ago
Frank Lee	2b2dc1c86b	[pipeline] refactor the pipeline module (#1087 ) * [pipeline] refactor the pipeline module * polish code	3 years ago
Frank Lee	bad5d4c0a1	[context] support lazy init of module (#1088 ) * [context] support lazy init of module * polish code	3 years ago
Frank Lee	bfdc5ccb7b	[context] maintain the context object in with statement (#1073 )	3 years ago
Jiarui Fang	49832b2344	[refactory] add nn.parallel module (#1068 )	3 years ago
Jiarui Fang	a00644079e	reorgnize colotensor directory (#1062 ) * reorgnize colotensor directory * polish code	3 years ago
Ziyue Jiang	df9dcbbff6	[Tensor] add hybrid device demo and fix bugs (#1059 )	3 years ago

1 2 3 4

159 Commits (cf68cc92accd5f0a2538b24e03f1f4f857b69fb9)