ColossalAI

Commit Graph

Author	SHA1	Message	Date
CsRic	ea961d8fd1	[NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717 ) Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>	2 years ago
HELSON	1468e4bcfc	[zero] add constant placement policy (#1705 ) * fixes memory leak when paramter is in fp16 in ZeroDDP init. * bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release. * adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.	2 years ago
HELSON	b28991dd0a	[feature] A new ZeRO implementation (#1644 )	2 years ago
Jiarui Fang	c5d39215f6	Revert "[feature] new zero implementation (#1623 )" (#1643 ) This reverts commit `5be118f405`.	2 years ago
HELSON	5be118f405	[feature] new zero implementation (#1623 )	2 years ago
HELSON	f7f2248771	[moe] fix MoE bugs (#1628 ) * remove forced FP32 modules * correct no_shard-contexts' positions	2 years ago
ver217	c9e8ce67b8	fix move fp32 shards (#1604 )	2 years ago
Fazzie-Maqianli	06dccdde44	[NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554 )	2 years ago
ver217	821c6172e2	[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442 )	2 years ago
ver217	6df3e19be9	[hotfix] zero optim prevents calling inner optim.zero_grad (#1422 )	2 years ago
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2 years ago
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2 years ago
ver217	6b43c789fd	fix zero optim backward_by_grad and save/load (#1353 )	2 years ago
ver217	d068af81a3	[doc] update rst and docstring (#1351 ) * update rst * add zero docstr * fix docstr * remove fx.tracer.meta_patch * fix docstr * fix docstr * update fx rst * fix fx docstr * remove useless rst	2 years ago
ver217	ce470ba37e	[checkpoint] sharded optim save/load grad scaler (#1350 )	2 years ago
ver217	7a05367101	[hotfix] shared model returns cpu state_dict (#1328 )	2 years ago
Jiarui Fang	4165eabb1e	[hotfix] remove potiential circle import (#1307 ) * make it faster * [hotfix] remove circle import	2 years ago
ver217	a45ddf2d5f	[hotfix] fix sharded optim step and clip_grad_norm (#1226 )	2 years ago
Jiarui Fang	a444633d13	warmup ratio configration (#1192 )	2 years ago
Jiarui Fang	372f791444	[refactor] move chunk and chunkmgr to directory gemini (#1182 )	2 years ago
ver217	9e1daa63d2	[zero] sharded optim supports loading local state dict (#1170 ) * sharded optim supports loading local state dict * polish code * add unit test	2 years ago
ver217	561e90493f	[zero] zero optim supports loading local state dict (#1171 ) * zero optim supports loading local state dict * polish code * add unit test	2 years ago
ver217	8106d7b8c7	[ddp] refactor ColoDDP and ZeroDDP (#1146 ) * ColoDDP supports overwriting default process group * rename ColoDDPV2 to ZeroDDP * add docstr for ZeroDDP * polish docstr	2 years ago
ver217	6690a61b4d	[hotfix] prevent nested ZeRO (#1140 )	2 years ago
Frank Lee	15aab1476e	[zero] avoid zero hook spam by changing log to debug level (#1137 )	2 years ago
ver217	a1a7899cae	[hotfix] fix zero init ctx numel (#1128 )	2 years ago
ver217	f0a954f16d	[ddp] add set_params_to_ignore for ColoDDP (#1122 ) * add set_params_to_ignore for ColoDDP * polish code * fix zero hook v2 * add unit test * polish docstr	2 years ago
Frank Lee	14e5b11d7f	[zero] fixed api consistency (#1098 )	3 years ago
Frank Lee	cb18922c47	[doc] added documentation to chunk and chunk manager (#1094 ) * [doc] added documentation to chunk and chunk manager * polish code * polish code * polish code	3 years ago
ver217	1f894e033f	[gemini] zero supports gemini (#1093 ) * add placement policy * add gemini mgr * update mem stats collector * update zero * update zero optim * fix bugs * zero optim monitor os * polish unit test * polish unit test * add assert	3 years ago
ver217	be01db37c8	[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077 ) * polish chunk manager * polish unit test * impl add_extern_static_tensor for chunk mgr * add mem stats collector v2 * polish code * polish unit test * polish code * polish get chunks	3 years ago
ver217	c5cd3b0f35	[zero] zero optim copy chunk rather than copy tensor (#1070 )	3 years ago
Jiarui Fang	49832b2344	[refactory] add nn.parallel module (#1068 )	3 years ago
ver217	e3fde4ee6b	fix import error in sharded model v2 (#1053 )	3 years ago
ver217	51b9a49655	[zero] add zero optimizer for ColoTensor (#1046 ) * add zero optimizer * torch ok * unit test ok * polish code * fix bugs * polish unit test * polish zero optim * polish colo ddp v2 * refactor folder structure * add comment * polish unit test * polish zero optim * polish unit test	3 years ago
ver217	9492a561c3	[tensor] ColoTensor supports ZeRo (#1015 ) * impl chunk manager * impl param op hook * add reduce_chunk * add zero hook v2 * add zero dp * fix TensorInfo * impl load balancing when using zero without chunk * fix zero hook * polish chunk * fix bugs * ddp ok * zero ok * polish code * fix bugs about load balancing * polish code * polish code * add ene-to-end test * polish code * polish code * polish code * fix typo * add test_chunk * fix bugs * fix bugs * polish code	3 years ago
ver217	7cfd6c827e	[zero] add load_state_dict for sharded model (#894 ) * add load_state_dict for sharded model * fix bug * fix bug * fix ckpt dtype and device * support load state dict in zero init ctx * fix bugs	3 years ago
ver217	c4d903e64a	[gemini] accelerate adjust_layout() (#878 ) * add lru cache * polish code * update unit test * fix sharded optim	3 years ago
HELSON	425b4a96b8	[gemini] polish stateful_tensor_mgr (#876 )	3 years ago
ver217	d7e0303d1e	[zero] use GeminiMemoryManager when sampling model data (#850 )	3 years ago
ver217	0f7ed8c192	fix _post_init_method of zero init ctx (#847 )	3 years ago
HELSON	e5ea3fdeef	[gemini] add GeminiMemoryManger (#832 ) * refactor StatefulTensor, tensor utilities * add unitest for GeminiMemoryManager	3 years ago
Jiarui Fang	595bedf767	revert zero tensors back (#829 )	3 years ago
Jiarui Fang	294a6060d0	[tensor] ZeRO use ColoTensor as the base class. (#828 ) * [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. * [tensor] ZeRO use ColoTensor as the base class. * polish	3 years ago
Jiarui Fang	eb1b89908c	[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824 )	3 years ago
Jiarui Fang	3ddbd1bce1	[gemini] collect cpu-gpu moving volume in each iteration (#813 )	3 years ago
Jiarui Fang	61c20b44bc	[log] local throughput metrics (#811 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish	3 years ago
ver217	dd92b90a68	[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808 ) * init fp16 param directly * polish code	3 years ago
Jiarui Fang	e761ad2cd7	Revert "[zero] add ZeroTensorShardStrategy (#793 )" (#806 )	3 years ago
HELSON	88759e289e	[zero] add ZeroTensorShardStrategy (#793 )	3 years ago

1 2 3 4

178 Commits (a4ed5b0d0d926f9e3f84711799e21db795a339e9)