ColossalAI

Commit Graph

Author	SHA1	Message	Date
YuliangLiu0306	33f0744d51	[tensor] add shape consistency feature to support auto spec transform (#1418 ) * [tensor] add shape consistency feature to supportauto sharding spec transform. * [tensor] remove unused argument in simulator, add doc string for target pair.	2022-08-10 11:29:17 +08:00
HELSON	4fb3c52cf0	[zero] add unit test for AgChunk's append, close, access (#1423 )	2022-08-09 18:03:10 +08:00
HELSON	c577ed016e	[zero] add AgChunk (#1417 )	2022-08-09 16:39:48 +08:00
Jiarui Fang	d209aff684	Add FreqAwareEmbeddingBag (#1421 )	2022-08-09 16:26:12 +08:00
ver217	6df3e19be9	[hotfix] zero optim prevents calling inner optim.zero_grad (#1422 )	2022-08-09 16:08:12 +08:00
Jiarui Fang	504419d261	[FAW] add cache manager for the cached embedding (#1419 )	2022-08-09 15:17:17 +08:00
Kirigaya Kazuto	44fd3c83ab	[communication] add p2p_v2.py to support communication with List[Any] (#1407 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [communication] add p2p_v2.py to support communication with List[Any] * Delete _pipeline_schedule_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [communication] remove print code * [communication] remove print code	2022-08-09 11:40:04 +08:00
github-actions[bot]	1590f59908	Automated submodule synchronization (#1415 ) Co-authored-by: github-actions <github-actions@github.com>	2022-08-09 10:07:04 +08:00
github-actions[bot]	9b442ecdc3	Automated submodule synchronization (#1404 ) Co-authored-by: github-actions <github-actions@github.com>	2022-08-08 11:24:58 +08:00
YuliangLiu0306	7c96055c68	[tensor]build sharding spec to replace distspec in future. (#1405 )	2022-08-08 11:15:57 +08:00
ver217	12b4887097	[hotfix] fix CPUAdam kernel nullptr (#1410 )	2022-08-05 19:45:45 +08:00
github-actions[bot]	1e5eb0874c	Automated submodule synchronization (#1396 ) Co-authored-by: github-actions <github-actions@github.com>	2022-08-03 09:18:45 +08:00
YuliangLiu0306	0442f940f0	[device] add DeviceMesh class to support logical device layout (#1394 ) * [device] add DeviceMesh class to support logical device layout * polish code * add doc string	2022-08-02 19:23:48 +08:00
ver217	04c9a86af8	[zero] ZeroDDP supports controlling outputs' dtype (#1399 )	2022-08-02 17:49:11 +08:00
HELSON	4e98e938ce	[zero] alleviate memory usage in ZeRODDP state_dict (#1398 )	2022-08-02 15:49:13 +08:00
Jiarui Fang	4f5f8f77d1	update nvme on readme (#1397 )	2022-08-02 11:39:37 +08:00
ver217	56b8863b87	[zero] chunk manager allows filtering ex-large params (#1393 )	2022-08-02 10:40:27 +08:00
Frank Lee	adf5054ff8	[fx] fixed torchaudio conformer tracing (#1392 )	2022-08-01 16:08:28 +08:00
Frank Lee	7d6293927f	[fx] patched torch.max and data movement operator (#1391 ) * [fx] patched torch.max and data movement operator * polish code	2022-08-01 15:31:50 +08:00
fastalgo	db89600cf2	Update README.md	2022-07-30 22:11:07 +08:00
Frank Lee	89e60d1505	[fx] fixed indentation error in checkpointing codegen (#1385 )	2022-07-30 00:27:12 +08:00
HELSON	c7221cb2d4	[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388 )	2022-07-29 19:33:24 +08:00
Frank Lee	ad678921db	[fx] patched torch.full for huggingface opt (#1386 )	2022-07-29 17:56:28 +08:00
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2022-07-29 15:58:06 +08:00
Jiarui Fang	f792507ff3	[chunk] add PG check for tensor appending (#1383 )	2022-07-29 13:27:05 +08:00
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2022-07-29 13:22:50 +08:00
ver217	7d5d628e07	[DDP] test ddp state dict uses more strict threshold (#1382 )	2022-07-28 17:29:04 +08:00
YuliangLiu0306	df54481473	[hotfix] fix some bugs during gpt2 testing (#1379 )	2022-07-28 17:21:07 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
HELSON	b6fd165f66	[checkpoint] add kwargs for load_state_dict (#1374 )	2022-07-28 15:56:52 +08:00
github-actions[bot]	50dec605e1	Automated submodule synchronization (#1380 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-28 11:12:52 +08:00
ver217	83328329dd	[hotfix] fix zero ddp buffer cast (#1376 ) * fix zero ddp buffer cast * fix zero ddp ignore params	2022-07-28 10:54:44 +08:00
ver217	5d5031e946	fix zero ddp state dict (#1378 )	2022-07-28 09:31:42 +08:00
Frank Lee	0c1a16ea5b	[util] standard checkpoint function naming (#1377 )	2022-07-28 09:29:30 +08:00
YuliangLiu0306	52bc2dc271	[fx] update split module pass and add customized policy (#1373 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]update split module pass and add customized policy	2022-07-27 13:40:54 +08:00
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2022-07-27 11:03:14 +08:00
github-actions[bot]	fb6f085907	Automated submodule synchronization (#1372 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-27 09:25:03 +08:00
Boyuan Yao	bb640ec728	[fx] Add colotracer compatibility test on torchrec (#1370 )	2022-07-26 17:54:39 +08:00
ver217	c415240db6	[nvme] CPUAdam and HybridAdam support NVMe offload (#1360 ) * impl nvme optimizer * update cpu adam * add unit test * update hybrid adam * update docstr * add TODOs * update CI * fix CI * fix CI * fix CI path * fix CI path * fix CI path * fix install tensornvme * fix CI * fix CI path * fix CI env variables * test CI * test CI * fix CI * fix nvme optim __del__ * fix adam __del__ * fix nvme optim * fix CI env variables * fix nvme optim import * test CI * test CI * fix CI	2022-07-26 17:25:24 +08:00
HELSON	8463290642	[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368 )	2022-07-26 14:41:53 +08:00
github-actions[bot]	c491c2a948	Automated submodule synchronization (#1364 ) Co-authored-by: github-actions <github-actions@github.com>	2022-07-26 14:31:45 +08:00
YuliangLiu0306	5542816690	[fx]add gpt2 passes for pipeline performance test (#1366 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add gpt2 passes for pipeline performance test	2022-07-26 14:31:00 +08:00
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2022-07-26 14:13:38 +08:00
HELSON	943a96323e	[hotfix] fix no optimizer in save/load (#1363 )	2022-07-26 10:53:53 +08:00
Frank Lee	cd063ac37f	[fx] added activation checkpoint codegen support for torch < 1.12 (#1359 )	2022-07-25 23:35:31 +08:00
HELSON	4417804129	[unit test] add megatron init test in zero_optim (#1358 )	2022-07-25 11:18:08 +08:00
HELSON	7a065dc9f6	[hotfix] fix megatron_init in test_gpt2.py (#1357 )	2022-07-25 10:28:19 +08:00
Frank Lee	644582eee9	[fx] added activation checkpoint codegen (#1355 )	2022-07-25 09:39:10 +08:00
ver217	38fd8844c0	[docker] add tensornvme in docker (#1354 ) * add tensornvme in docker * fix dockerfile * fix dockerfile	2022-07-21 17:44:00 +08:00
ver217	6b43c789fd	fix zero optim backward_by_grad and save/load (#1353 )	2022-07-21 16:43:58 +08:00

... 5 6 7 8 9 ...

1235 Commits (e8a9bebc8770b9430f4150a400e6fef43cf02d4f) All Branches Search

1235 Commits (e8a9bebc8770b9430f4150a400e6fef43cf02d4f)

All Branches