323 Commits (f92c100ddd4a93a72f710dc476936c6890b8bffe)

Author SHA1 Message Date
HELSON f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2 years ago
Frank Lee f3ce7b8336
[fx] recovered skipped pipeline tests (#1338) 2 years ago
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333) 2 years ago
Frank Lee 75abc75c15
[fx] fixed compatiblity issue with torch 1.10 (#1331) 2 years ago
Frank Lee 169954f87e
[test] removed outdated unit test for meta context (#1329) 2 years ago
ver217 7a05367101
[hotfix] shared model returns cpu state_dict (#1328) 2 years ago
Frank Lee b2475d8c5c
[fx] fixed unit tests for torch 1.12 (#1327) 2 years ago
HELSON d49708ae43
[hotfix] fix ddp for unit test test_gpt2 (#1326) 2 years ago
Frank Lee 250be4d31e
[utils] integrated colotensor with lazy init context (#1324) 2 years ago
YuliangLiu0306 e8acf55e8b
[fx] add balanced policy v2 (#1251) 2 years ago
XYE ca2d3f284f
[fx] Add unit test and fix bugs for transform_mlp_pass (#1299) 2 years ago
HELSON 1b41686461
[hotfix] fix unit test test_module_spec (#1321) 2 years ago
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316) 2 years ago
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312) 2 years ago
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310) 2 years ago
HELSON 36086927e1
[hotfix] fix ColoTensor GPT2 unitest (#1309) 2 years ago
Jiarui Fang 3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs (#1297) 2 years ago
Jiarui Fang bd71e2a88b
[hotfix] add missing file (#1308) 2 years ago
Frank Lee 4f4d8c3656
[fx] added apex normalization to patched modules (#1300) 2 years ago
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307) 2 years ago
YuliangLiu0306 93a75433df
[hotfix] skip some unittest due to CI environment. (#1301) 2 years ago
HELSON 260a55804a
[hotfix] fix shape error in backward when using ColoTensor (#1298) 2 years ago
Frank Lee 7e8114a8dd
[hotfix] skipped unsafe test cases (#1282) 2 years ago
Jiarui Fang 79fe7b027a
[hotfix] test model unittest hotfix (#1281) 2 years ago
Jiarui Fang e56731e916 [hotfix] test_gpt.py duplicated (#1279) 2 years ago
HELSON abba4d84e1
[hotfix] fix bert model test in unitests (#1272) 2 years ago
YuliangLiu0306 01ea68b2e6
[tests] remove T5 test skip decorator (#1271) 2 years ago
Jiarui Fang ca9d5ee91c
[hotfix] torchvison fx unittests miss import pytest (#1277) 2 years ago
Jiarui Fang c92f84fcdb
[tensor] distributed checkpointing for parameters (#1240) 2 years ago
Frank Lee 4a09fc0947
[fx] fixed tracing with apex-based T5 model (#1252) 2 years ago
YuliangLiu0306 97d713855a
[fx] methods to get fx graph property. (#1246) 2 years ago
YuliangLiu0306 30b4fc0eb0
[fx]add split module pass and unit test from pipeline passes (#1242) 2 years ago
Jiarui Fang 1aad903c15
[tensor] redistribute among different process groups (#1247) 2 years ago
Jiarui Fang 9bcd2fd4af
[tensor] a shorter shard and replicate spec (#1245) 2 years ago
Jiarui Fang 2699dfbbfd
[rename] convert_to_dist -> redistribute (#1243) 2 years ago
HELSON f6add9b720
[tensor] redirect .data.__get__ to a tensor instance (#1239) 2 years ago
Jiarui Fang 20da6e48c8
[checkpoint] save sharded optimizer states (#1237) 2 years ago
Jiarui Fang 4a76084dc9
[tensor] add zero_like colo op, important for Optimizer (#1236) 2 years ago
Jiarui Fang 3b500984b1
[tensor] fix some unittests (#1234) 2 years ago
HELSON 0453776def
[tensor] fix a assertion in colo_tensor cross_entropy (#1232) 2 years ago
Jiarui Fang 0e199d71e8
[hotfix] fx get comm size bugs (#1233) 2 years ago
HELSON 42ab36b762
[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) 2 years ago
Yi Zhao 04537bf83e
[checkpoint]support generalized scheduler (#1222) 2 years ago
Jiarui Fang a98319f023
[tensor] torch function return colotensor (#1229) 2 years ago
Frank Lee 5581170890
[fx] fixed huggingface OPT and T5 results misalignment (#1227) 2 years ago
YuliangLiu0306 2b7dca44b5
[fx]get communication size between partitions (#1224) 2 years ago
Frank Lee 84f2298a96
[fx] added patches for tracing swin transformer (#1228) 2 years ago
Frank Lee 37fcf96b7f
[fx] fixed timm tracing result misalignment (#1225) 2 years ago
Frank Lee b6cb5a47ad
[fx] added timm model tracing testing (#1221) 2 years ago
Jiarui Fang 15d988f954
[tensor] sharded global process group (#1219) 2 years ago