ColossalAI

Commit Graph

Author	SHA1	Message	Date
YH	a22407cc02	[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173 ) * Fix confusing variable name in zero opt * Apply lint * Fix util func * Fix minor util func * Fix zero param optimizer name	2023-04-27 18:43:14 +08:00
Hongxin Liu	50793b35f4	[gemini] accelerate inference (#3641 ) * [gemini] support don't scatter after inference * [chat] update colossalai strategy * [chat] fix opt benchmark * [chat] update opt benchmark * [gemini] optimize inference * [test] add gemini inference test * [chat] fix unit test ci * [chat] fix ci * [chat] fix ci * [chat] skip checkpoint test	2023-04-26 16:32:40 +08:00
Hongxin Liu	4b3240cb59	[booster] add low level zero plugin (#3594 ) * [booster] add low level zero plugin * [booster] fix gemini plugin test * [booster] fix precision * [booster] add low level zero plugin test * [test] fix booster plugin test oom * [test] fix booster plugin test oom * [test] fix googlenet and inception output trans * [test] fix diffuser clip vision model * [test] fix torchaudio_wav2vec2_base * [test] fix low level zero plugin test	2023-04-26 14:37:25 +08:00
digger-yu	b9a8dff7e5	[doc] Fix typo under colossalai and doc(#3618 ) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402	2023-04-26 11:38:43 +08:00
Hongxin Liu	12eff9eb4c	[gemini] state dict supports fp16 (#3590 ) * [gemini] save state dict support fp16 * [gemini] save state dict shard support fp16 * [gemini] fix state dict * [gemini] fix state dict	2023-04-19 11:01:48 +08:00
Hongxin Liu	f313babd11	[gemini] support save state dict in shards (#3581 ) * [gemini] support state dict shard * [gemini] add test state dict shard * [gemini] polish docstr * [gemini] fix merge * [gemini] polish code	2023-04-17 17:11:09 +08:00
YH	d329c294ec	Add docstr for zero3 chunk search utils (#3572 )	2023-04-17 12:44:17 +08:00
Hongxin Liu	173dad0562	[misc] add verbose arg for zero and op builder (#3552 ) * [misc] add print verbose * [gemini] add print verbose * [zero] add print verbose for low level * [misc] add print verbose for op builder	2023-04-17 11:25:35 +08:00
Hongxin Liu	152239bbfa	[gemini] gemini supports lazy init (#3379 ) * [gemini] fix nvme optimizer init * [gemini] gemini supports lazy init * [gemini] add init example * [gemini] add fool model * [zero] update gemini ddp * [zero] update init example * add chunk method * add chunk method * [lazyinit] fix lazy tensor tolist * [gemini] fix buffer materialization * [misc] remove useless file * [booster] update gemini plugin * [test] update gemini plugin test * [test] fix gemini plugin test * [gemini] fix import * [gemini] fix import * [lazyinit] use new metatensor * [lazyinit] use new metatensor * [lazyinit] fix __set__ method	2023-04-12 16:03:25 +08:00
YH	bcf0cbcbe7	[doc] Add docs for clip args in zero optim (#3504 )	2023-04-10 11:11:28 +08:00
ver217	573af84184	[example] update examples related to zero/gemini (#3431 ) * [zero] update legacy import * [zero] update examples * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix import	2023-04-04 17:32:51 +08:00
ver217	26b7aac0be	[zero] reorganize zero/gemini folder structure (#3424 ) * [zero] refactor low-level zero folder structure * [zero] fix legacy zero import path * [zero] fix legacy zero import path * [zero] remove useless import * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] fix test import path * [zero] fix test * [zero] fix circular import * [zero] update import	2023-04-04 13:48:16 +08:00
YH	80aed29cd3	[zero] Refactor ZeroContextConfig class using dataclass (#3186 )	2023-03-21 12:36:47 +08:00
YH	9d644ff09f	Fix docstr for zero statedict (#3185 )	2023-03-21 11:48:21 +08:00
ver217	823f3b9cf4	[doc] add deepspeed citation and copyright (#2996 ) * [doc] add deepspeed citation and copyright * [doc] add deepspeed citation and copyright * [doc] add deepspeed citation and copyright	2023-03-04 20:08:11 +08:00
YH	7b13f7db18	[zero] trivial zero optimizer refactoring (#2869 ) * Fix mionr grad store interface * Apply lint	2023-02-27 14:04:53 +08:00
Boyuan Yao	8e3f66a0d1	[zero] fix wrong import (#2777 )	2023-02-17 10:26:07 +08:00
Nikita Shulga	01066152f1	Don't use `torch._six` (#2775 ) * Don't use `torch._six` This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709 * Update common.py	2023-02-17 09:22:45 +08:00
YH	ae86a29e23	Refact method of grad store (#2687 )	2023-02-15 22:27:58 +08:00
HELSON	df4f020ee3	[zero1&2] only append parameters with gradients (#2681 )	2023-02-13 18:00:16 +08:00
HELSON	b528eea0f0	[zero] add zero wrappers (#2523 ) * [zero] add zero wrappers * change names * add wrapper functions to init	2023-01-29 17:52:58 +08:00
HELSON	077a5cdde4	[zero] fix gradient clipping in hybrid parallelism (#2521 ) * [zero] fix gradient clipping in hybrid parallelism * [testing] change model name to avoid pytest warning * [hotfix] fix unit testing	2023-01-29 15:09:57 +08:00
HELSON	d565a24849	[zero] add unit testings for hybrid parallelism (#2486 )	2023-01-18 10:36:10 +08:00
HELSON	a5dc4253c6	[zero] polish low level optimizer (#2473 )	2023-01-13 14:56:17 +08:00
Jiarui Fang	867c8c2d3a	[zero] low level optim supports ProcessGroup (#2464 )	2023-01-13 10:05:58 +08:00
HELSON	7829aa094e	[ddp] add is_ddp_ignored (#2434 ) [ddp] rename to is_ddp_ignored	2023-01-11 12:22:45 +08:00
HELSON	62c38e3330	[zero] polish low level zero optimizer (#2275 )	2023-01-03 17:22:34 +08:00
HELSON	a7d95b7024	[example] add zero1, zero2 example in GPT examples (#2146 ) * [example] add zero1 and zero2 for GPT * update readme in gpt example * polish code * change init value * update readme	2022-12-20 14:30:27 +08:00
Jiarui Fang	c89c66a858	[Gemini] update API of the chunkmemstatscollector. (#2129 )	2022-12-14 00:47:06 +08:00
Jiarui Fang	2938edf446	[Gemini] update the non model data record method in runtime memory tracer (#2128 )	2022-12-13 17:11:31 +08:00
Jiarui Fang	e99edfcb51	[NFC] polish comments for Chunk class (#2116 )	2022-12-12 15:39:31 +08:00
Jiarui Fang	33f4412102	[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084 )	2022-12-06 16:43:06 +08:00
Jiarui Fang	b3b89865e2	[Gemini] ParamOpHook -> ColoParamOpHook (#2080 )	2022-12-05 17:11:06 +08:00
HELSON	a1ce02d740	[zero] test gradient accumulation (#1964 ) * [zero] fix memory leak for zero2 * [zero] test gradient accumulation * [zero] remove grad clip test	2022-11-29 13:00:30 +08:00
Jiarui Fang	cc0ed7cf33	[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972 )	2022-11-17 14:43:49 +08:00
Jiarui Fang	c4739a725a	[Gemini] polish memstats collector (#1962 )	2022-11-16 15:45:57 +08:00
Jiarui Fang	f7e276fa71	[Gemini] add GeminiAdamOptimizer (#1960 )	2022-11-16 14:44:28 +08:00
HELSON	7066dfbf82	[zero] fix memory leak for zero2 (#1955 )	2022-11-16 11:43:24 +08:00
HELSON	6e51d296f0	[zero] migrate zero1&2 (#1878 ) * add zero1&2 optimizer * rename test ditectory * rename test files * change tolerance in test	2022-11-11 09:26:40 +08:00
Zihao	20e255d4e8	MemStatsCollectorStatic (#1765 )	2022-11-07 16:49:03 +08:00
HELSON	c6a1a62636	[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786 ) * [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 * [zero] add cpu shard init * [zero] add tiny example test * [colo_tensor] fix bugs for torch-1.11	2022-11-02 16:11:34 +08:00
CsRic	ea961d8fd1	[NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717 ) Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>	2022-10-19 12:20:51 +08:00
HELSON	1468e4bcfc	[zero] add constant placement policy (#1705 ) * fixes memory leak when paramter is in fp16 in ZeroDDP init. * bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release. * adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.	2022-10-14 17:53:16 +08:00
HELSON	b28991dd0a	[feature] A new ZeRO implementation (#1644 )	2022-10-09 09:18:51 +08:00
Jiarui Fang	c5d39215f6	Revert "[feature] new zero implementation (#1623 )" (#1643 ) This reverts commit `5be118f405`.	2022-09-26 10:06:03 +08:00
HELSON	5be118f405	[feature] new zero implementation (#1623 )	2022-09-24 19:58:18 +08:00
HELSON	f7f2248771	[moe] fix MoE bugs (#1628 ) * remove forced FP32 modules * correct no_shard-contexts' positions	2022-09-22 13:56:30 +08:00
ver217	c9e8ce67b8	fix move fp32 shards (#1604 )	2022-09-16 17:33:16 +08:00
Fazzie-Maqianli	06dccdde44	[NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554 )	2022-09-08 22:11:04 +08:00
ver217	821c6172e2	[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442 )	2022-08-11 22:58:58 +08:00
ver217	6df3e19be9	[hotfix] zero optim prevents calling inner optim.zero_grad (#1422 )	2022-08-09 16:08:12 +08:00
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2022-07-29 13:22:50 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
ver217	6b43c789fd	fix zero optim backward_by_grad and save/load (#1353 )	2022-07-21 16:43:58 +08:00
ver217	d068af81a3	[doc] update rst and docstring (#1351 ) * update rst * add zero docstr * fix docstr * remove fx.tracer.meta_patch * fix docstr * fix docstr * update fx rst * fix fx docstr * remove useless rst	2022-07-21 15:54:53 +08:00
ver217	ce470ba37e	[checkpoint] sharded optim save/load grad scaler (#1350 )	2022-07-21 15:21:21 +08:00
ver217	7a05367101	[hotfix] shared model returns cpu state_dict (#1328 )	2022-07-15 22:11:37 +08:00
Jiarui Fang	4165eabb1e	[hotfix] remove potiential circle import (#1307 ) * make it faster * [hotfix] remove circle import	2022-07-14 13:44:26 +08:00
ver217	a45ddf2d5f	[hotfix] fix sharded optim step and clip_grad_norm (#1226 )	2022-07-08 13:34:48 +08:00
Jiarui Fang	a444633d13	warmup ratio configration (#1192 )	2022-06-30 15:23:50 +08:00
Jiarui Fang	372f791444	[refactor] move chunk and chunkmgr to directory gemini (#1182 )	2022-06-29 13:31:02 +08:00
ver217	9e1daa63d2	[zero] sharded optim supports loading local state dict (#1170 ) * sharded optim supports loading local state dict * polish code * add unit test	2022-06-24 18:05:16 +08:00
ver217	561e90493f	[zero] zero optim supports loading local state dict (#1171 ) * zero optim supports loading local state dict * polish code * add unit test	2022-06-24 17:25:57 +08:00
ver217	8106d7b8c7	[ddp] refactor ColoDDP and ZeroDDP (#1146 ) * ColoDDP supports overwriting default process group * rename ColoDDPV2 to ZeroDDP * add docstr for ZeroDDP * polish docstr	2022-06-21 16:35:23 +08:00
ver217	6690a61b4d	[hotfix] prevent nested ZeRO (#1140 )	2022-06-21 11:33:53 +08:00
Frank Lee	15aab1476e	[zero] avoid zero hook spam by changing log to debug level (#1137 )	2022-06-21 10:44:01 +08:00
ver217	a1a7899cae	[hotfix] fix zero init ctx numel (#1128 )	2022-06-16 17:17:27 +08:00
ver217	f0a954f16d	[ddp] add set_params_to_ignore for ColoDDP (#1122 ) * add set_params_to_ignore for ColoDDP * polish code * fix zero hook v2 * add unit test * polish docstr	2022-06-16 12:54:46 +08:00
Frank Lee	14e5b11d7f	[zero] fixed api consistency (#1098 )	2022-06-10 16:59:59 +08:00
Frank Lee	cb18922c47	[doc] added documentation to chunk and chunk manager (#1094 ) * [doc] added documentation to chunk and chunk manager * polish code * polish code * polish code	2022-06-10 15:33:06 +08:00
ver217	1f894e033f	[gemini] zero supports gemini (#1093 ) * add placement policy * add gemini mgr * update mem stats collector * update zero * update zero optim * fix bugs * zero optim monitor os * polish unit test * polish unit test * add assert	2022-06-10 14:48:28 +08:00
ver217	be01db37c8	[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077 ) * polish chunk manager * polish unit test * impl add_extern_static_tensor for chunk mgr * add mem stats collector v2 * polish code * polish unit test * polish code * polish get chunks	2022-06-09 20:56:34 +08:00
ver217	c5cd3b0f35	[zero] zero optim copy chunk rather than copy tensor (#1070 )	2022-06-07 10:30:46 +08:00
Jiarui Fang	49832b2344	[refactory] add nn.parallel module (#1068 )	2022-06-06 15:34:41 +08:00
ver217	e3fde4ee6b	fix import error in sharded model v2 (#1053 )	2022-06-02 13:48:22 +08:00
ver217	51b9a49655	[zero] add zero optimizer for ColoTensor (#1046 ) * add zero optimizer * torch ok * unit test ok * polish code * fix bugs * polish unit test * polish zero optim * polish colo ddp v2 * refactor folder structure * add comment * polish unit test * polish zero optim * polish unit test	2022-06-02 12:13:15 +08:00
ver217	9492a561c3	[tensor] ColoTensor supports ZeRo (#1015 ) * impl chunk manager * impl param op hook * add reduce_chunk * add zero hook v2 * add zero dp * fix TensorInfo * impl load balancing when using zero without chunk * fix zero hook * polish chunk * fix bugs * ddp ok * zero ok * polish code * fix bugs about load balancing * polish code * polish code * add ene-to-end test * polish code * polish code * polish code * fix typo * add test_chunk * fix bugs * fix bugs * polish code	2022-05-31 12:00:12 +08:00
ver217	7cfd6c827e	[zero] add load_state_dict for sharded model (#894 ) * add load_state_dict for sharded model * fix bug * fix bug * fix ckpt dtype and device * support load state dict in zero init ctx * fix bugs	2022-05-27 10:25:08 +08:00
ver217	c4d903e64a	[gemini] accelerate adjust_layout() (#878 ) * add lru cache * polish code * update unit test * fix sharded optim	2022-04-26 18:08:31 +08:00
HELSON	425b4a96b8	[gemini] polish stateful_tensor_mgr (#876 )	2022-04-26 15:05:03 +08:00
ver217	d7e0303d1e	[zero] use GeminiMemoryManager when sampling model data (#850 )	2022-04-24 17:17:22 +08:00
ver217	0f7ed8c192	fix _post_init_method of zero init ctx (#847 )	2022-04-24 14:16:50 +08:00
HELSON	e5ea3fdeef	[gemini] add GeminiMemoryManger (#832 ) * refactor StatefulTensor, tensor utilities * add unitest for GeminiMemoryManager	2022-04-24 13:08:48 +08:00
Jiarui Fang	595bedf767	revert zero tensors back (#829 )	2022-04-22 12:12:35 +08:00
Jiarui Fang	294a6060d0	[tensor] ZeRO use ColoTensor as the base class. (#828 ) * [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. * [tensor] ZeRO use ColoTensor as the base class. * polish	2022-04-22 12:00:48 +08:00
Jiarui Fang	eb1b89908c	[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824 )	2022-04-21 16:03:18 +08:00
Jiarui Fang	3ddbd1bce1	[gemini] collect cpu-gpu moving volume in each iteration (#813 )	2022-04-20 11:29:48 +08:00
Jiarui Fang	61c20b44bc	[log] local throughput metrics (#811 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish	2022-04-20 10:05:39 +08:00
ver217	dd92b90a68	[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808 ) * init fp16 param directly * polish code	2022-04-19 16:16:48 +08:00
Jiarui Fang	e761ad2cd7	Revert "[zero] add ZeroTensorShardStrategy (#793 )" (#806 )	2022-04-19 14:40:02 +08:00
HELSON	88759e289e	[zero] add ZeroTensorShardStrategy (#793 )	2022-04-19 14:32:45 +08:00
Jiarui Fang	4d9332b4c5	[refactor] moving memtracer to gemini (#801 )	2022-04-19 10:13:08 +08:00
Jiarui Fang	8711c706f4	[hotfix] fix grad offload when enabling reuse_fp16_shard	2022-04-18 14:58:21 +08:00
ver217	f1fa1a675f	fix grad offload when enabling reuse_fp16_shard	2022-04-18 14:07:39 +08:00
HELSON	4c4388c46e	[hotfix] fix memory leak in zero (#781 )	2022-04-18 13:57:03 +08:00
HELSON	a65cbb7e4e	[zero] refactor shard and gather operation (#773 )	2022-04-15 14:41:31 +08:00
ver217	6e553748a7	polish sharded optim docstr and warning (#770 )	2022-04-14 21:03:59 +08:00
Jiarui Fang	10ef8afdd2	[gemini] init genimi individual directory (#754 )	2022-04-14 16:40:26 +08:00
ver217	dcca614eee	[hotfix] fix test_stateful_tensor_mgr (#762 )	2022-04-14 15:50:09 +08:00
ver217	a93a7d7364	[hotfix] fix reuse_fp16_shard of sharded model (#756 ) * fix reuse_fp16_shard * disable test stm * polish code	2022-04-14 14:56:46 +08:00

1 2 3 4 5 ...

269 Commits (af952673f758c71126b27de8b32bdf5df8f74b69)