ColossalAI

Commit Graph

Author	SHA1	Message	Date
Frank Lee	ad678921db	[fx] patched torch.full for huggingface opt (#1386 )	2022-07-29 17:56:28 +08:00
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2022-07-29 15:58:06 +08:00
Jiarui Fang	f792507ff3	[chunk] add PG check for tensor appending (#1383 )	2022-07-29 13:27:05 +08:00
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2022-07-29 13:22:50 +08:00
ver217	7d5d628e07	[DDP] test ddp state dict uses more strict threshold (#1382 )	2022-07-28 17:29:04 +08:00
YuliangLiu0306	df54481473	[hotfix] fix some bugs during gpt2 testing (#1379 )	2022-07-28 17:21:07 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
HELSON	b6fd165f66	[checkpoint] add kwargs for load_state_dict (#1374 )	2022-07-28 15:56:52 +08:00
github-actions[bot]	50dec605e1	Automated submodule synchronization (#1380 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-28 11:12:52 +08:00
ver217	83328329dd	[hotfix] fix zero ddp buffer cast (#1376 ) * fix zero ddp buffer cast * fix zero ddp ignore params	2022-07-28 10:54:44 +08:00
ver217	5d5031e946	fix zero ddp state dict (#1378 )	2022-07-28 09:31:42 +08:00
Frank Lee	0c1a16ea5b	[util] standard checkpoint function naming (#1377 )	2022-07-28 09:29:30 +08:00
YuliangLiu0306	52bc2dc271	[fx] update split module pass and add customized policy (#1373 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]update split module pass and add customized policy	2022-07-27 13:40:54 +08:00
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2022-07-27 11:03:14 +08:00
github-actions[bot]	fb6f085907	Automated submodule synchronization (#1372 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-27 09:25:03 +08:00
Boyuan Yao	bb640ec728	[fx] Add colotracer compatibility test on torchrec (#1370 )	2022-07-26 17:54:39 +08:00
ver217	c415240db6	[nvme] CPUAdam and HybridAdam support NVMe offload (#1360 ) * impl nvme optimizer * update cpu adam * add unit test * update hybrid adam * update docstr * add TODOs * update CI * fix CI * fix CI * fix CI path * fix CI path * fix CI path * fix install tensornvme * fix CI * fix CI path * fix CI env variables * test CI * test CI * fix CI * fix nvme optim __del__ * fix adam __del__ * fix nvme optim * fix CI env variables * fix nvme optim import * test CI * test CI * fix CI	2022-07-26 17:25:24 +08:00
HELSON	8463290642	[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368 )	2022-07-26 14:41:53 +08:00
github-actions[bot]	c491c2a948	Automated submodule synchronization (#1364 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-26 14:31:45 +08:00
YuliangLiu0306	5542816690	[fx]add gpt2 passes for pipeline performance test (#1366 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add gpt2 passes for pipeline performance test	2022-07-26 14:31:00 +08:00
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2022-07-26 14:13:38 +08:00
HELSON	943a96323e	[hotfix] fix no optimizer in save/load (#1363 )	2022-07-26 10:53:53 +08:00
Frank Lee	cd063ac37f	[fx] added activation checkpoint codegen support for torch < 1.12 (#1359 )	2022-07-25 23:35:31 +08:00
HELSON	4417804129	[unit test] add megatron init test in zero_optim (#1358 )	2022-07-25 11:18:08 +08:00
HELSON	7a065dc9f6	[hotfix] fix megatron_init in test_gpt2.py (#1357 )	2022-07-25 10:28:19 +08:00
Frank Lee	644582eee9	[fx] added activation checkpoint codegen (#1355 )	2022-07-25 09:39:10 +08:00
ver217	38fd8844c0	[docker] add tensornvme in docker (#1354 ) * add tensornvme in docker * fix dockerfile * fix dockerfile	2022-07-21 17:44:00 +08:00
ver217	6b43c789fd	fix zero optim backward_by_grad and save/load (#1353 )	2022-07-21 16:43:58 +08:00
ver217	d068af81a3	[doc] update rst and docstring (#1351 ) * update rst * add zero docstr * fix docstr * remove fx.tracer.meta_patch * fix docstr * fix docstr * update fx rst * fix fx docstr * remove useless rst	2022-07-21 15:54:53 +08:00
Frank Lee	274c1a3b5f	[fx] fixed apex normalization patch exception (#1352 )	2022-07-21 15:29:11 +08:00
ver217	ce470ba37e	[checkpoint] sharded optim save/load grad scaler (#1350 )	2022-07-21 15:21:21 +08:00
Frank Lee	05fae1fd56	[fx] added activation checkpointing annotation (#1349 ) * [fx] added activation checkpointing annotation * polish code * polish code	2022-07-21 11:14:28 +08:00
YuliangLiu0306	051592c64e	[fx] update MetaInforProp pass to process more complex node.meta (#1344 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] update MetaInforProp pass to process more complex node.meta	2022-07-21 10:57:52 +08:00
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2022-07-21 10:53:15 +08:00
github-actions[bot]	6160a1d6a7	Automated submodule synchronization (#1348 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-21 10:50:27 +08:00
binmakeswell	92b0b139eb	[NFC] add OPT (#1345 )	2022-07-20 15:02:07 +08:00
YuliangLiu0306	942c8cd1fb	[fx] refactor tracer to trace complete graph (#1342 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] refactor tracer to trace complete graph * add comments and solve conflicts.	2022-07-20 11:20:38 +08:00
Frank Lee	2cc1175c76	[fx] tested the complete workflow for auto-parallel (#1336 ) * [fx] tested the complete workflow for auto-parallel * polish code * polish code * polish code	2022-07-20 10:45:17 +08:00
YuliangLiu0306	4631fef8a0	[fx]refactor tracer (#1335 )	2022-07-19 15:50:42 +08:00
HELSON	bf5066fba7	[refactor] refactor ColoTensor's unit tests (#1340 )	2022-07-19 15:46:24 +08:00
HELSON	f92c100ddd	[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339 )	2022-07-19 14:15:28 +08:00
Frank Lee	f3ce7b8336	[fx] recovered skipped pipeline tests (#1338 )	2022-07-19 09:49:50 +08:00
ver217	0c51ff2c13	[hotfix] ZeroDDP use new process group (#1333 ) * process group supports getting ranks in group * chunk mgr receives a process group * update unit test * fix unit tests	2022-07-18 14:14:52 +08:00
Frank Lee	11d1436a67	[workflow] update docker build workflow to use proxy (#1334 )	2022-07-18 14:09:41 +08:00
Frank Lee	75abc75c15	[fx] fixed compatiblity issue with torch 1.10 (#1331 )	2022-07-18 11:41:27 +08:00
Frank Lee	069d6fdc84	[workflow] update 8-gpu test to use torch 1.11 (#1332 )	2022-07-18 11:41:13 +08:00
fastalgo	7857fd7616	Update README.md	2022-07-16 19:00:59 -07:00
Frank Lee	169954f87e	[test] removed outdated unit test for meta context (#1329 )	2022-07-15 23:16:23 +08:00
ver217	7a05367101	[hotfix] shared model returns cpu state_dict (#1328 )	2022-07-15 22:11:37 +08:00
Frank Lee	b2475d8c5c	[fx] fixed unit tests for torch 1.12 (#1327 )	2022-07-15 18:22:15 +08:00

... 2 3 4 5 6 ...

1063 Commits (06dccdde449e433d83dc42d7898a2ceed654053c) All Branches Search

1063 Commits (06dccdde449e433d83dc42d7898a2ceed654053c)

All Branches