Commit Graph

654 Commits (0442f940f021d024ca390485f0cdf0856fe6cb36)

Author SHA1 Message Date
YuliangLiu0306 0442f940f0
[device] add DeviceMesh class to support logical device layout (#1394)
2 years ago
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399)
2 years ago
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398)
2 years ago
ver217 56b8863b87
[zero] chunk manager allows filtering ex-large params (#1393)
2 years ago
Frank Lee 7d6293927f
[fx] patched torch.max and data movement operator (#1391)
2 years ago
Frank Lee 89e60d1505
[fx] fixed indentation error in checkpointing codegen (#1385)
2 years ago
HELSON c7221cb2d4
[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388)
2 years ago
Frank Lee ad678921db
[fx] patched torch.full for huggingface opt (#1386)
2 years ago
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387)
2 years ago
Jiarui Fang f792507ff3
[chunk] add PG check for tensor appending (#1383)
2 years ago
ver217 8dced41ad0
[zero] zero optim state_dict takes only_rank_0 (#1384)
2 years ago
YuliangLiu0306 df54481473
[hotfix] fix some bugs during gpt2 testing (#1379)
2 years ago
ver217 828b9e5e0d
[hotfix] fix zero optim save/load state dict (#1381)
2 years ago
HELSON b6fd165f66
[checkpoint] add kwargs for load_state_dict (#1374)
2 years ago
ver217 83328329dd
[hotfix] fix zero ddp buffer cast (#1376)
2 years ago
ver217 5d5031e946
fix zero ddp state dict (#1378)
2 years ago
Frank Lee 0c1a16ea5b
[util] standard checkpoint function naming (#1377)
2 years ago
YuliangLiu0306 52bc2dc271
[fx] update split module pass and add customized policy (#1373)
2 years ago
Super Daniel be229217ce
[fx] add torchaudio test (#1369)
2 years ago
ver217 c415240db6
[nvme] CPUAdam and HybridAdam support NVMe offload (#1360)
2 years ago
HELSON 8463290642
[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368)
2 years ago
YuliangLiu0306 5542816690
[fx]add gpt2 passes for pipeline performance test (#1366)
2 years ago
HELSON 87775a0682
[colotensor] use cpu memory to store state_dict (#1367)
2 years ago
HELSON 943a96323e
[hotfix] fix no optimizer in save/load (#1363)
2 years ago
Frank Lee cd063ac37f
[fx] added activation checkpoint codegen support for torch < 1.12 (#1359)
2 years ago
Frank Lee 644582eee9
[fx] added activation checkpoint codegen (#1355)
2 years ago
ver217 6b43c789fd
fix zero optim backward_by_grad and save/load (#1353)
2 years ago
ver217 d068af81a3
[doc] update rst and docstring (#1351)
2 years ago
Frank Lee 274c1a3b5f
[fx] fixed apex normalization patch exception (#1352)
2 years ago
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350)
2 years ago
Frank Lee 05fae1fd56
[fx] added activation checkpointing annotation (#1349)
2 years ago
YuliangLiu0306 051592c64e
[fx] update MetaInforProp pass to process more complex node.meta (#1344)
2 years ago
HELSON 7a8702c06d
[colotensor] add Tensor.view op and its unit test (#1343)
2 years ago
YuliangLiu0306 942c8cd1fb
[fx] refactor tracer to trace complete graph (#1342)
2 years ago
Frank Lee 2cc1175c76
[fx] tested the complete workflow for auto-parallel (#1336)
2 years ago
YuliangLiu0306 4631fef8a0
[fx]refactor tracer (#1335)
2 years ago
HELSON f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339)
2 years ago
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
2 years ago
Frank Lee 75abc75c15
[fx] fixed compatiblity issue with torch 1.10 (#1331)
2 years ago
ver217 7a05367101
[hotfix] shared model returns cpu state_dict (#1328)
2 years ago
Frank Lee b2475d8c5c
[fx] fixed unit tests for torch 1.12 (#1327)
2 years ago
HELSON d49708ae43
[hotfix] fix ddp for unit test test_gpt2 (#1326)
2 years ago
Frank Lee 250be4d31e
[utils] integrated colotensor with lazy init context (#1324)
2 years ago
YuliangLiu0306 e8acf55e8b
[fx] add balanced policy v2 (#1251)
2 years ago
XYE ca2d3f284f
[fx] Add unit test and fix bugs for transform_mlp_pass (#1299)
2 years ago
HELSON 1b41686461
[hotfix] fix unit test test_module_spec (#1321)
2 years ago
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316)
2 years ago
ver217 7c70bfbefa
[hotfix] fix PipelineSharedModuleGradientHandler (#1314)
2 years ago
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312)
2 years ago
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310)
2 years ago