Commit Graph

890 Commits (44178041299fe2a4d12b4aa85fa4f6d745df19d5)
 

Author SHA1 Message Date
HELSON 4417804129
[unit test] add megatron init test in zero_optim (#1358)
2 years ago
HELSON 7a065dc9f6
[hotfix] fix megatron_init in test_gpt2.py (#1357)
2 years ago
Frank Lee 644582eee9
[fx] added activation checkpoint codegen (#1355)
2 years ago
ver217 38fd8844c0
[docker] add tensornvme in docker (#1354)
2 years ago
ver217 6b43c789fd
fix zero optim backward_by_grad and save/load (#1353)
2 years ago
ver217 d068af81a3
[doc] update rst and docstring (#1351)
2 years ago
Frank Lee 274c1a3b5f
[fx] fixed apex normalization patch exception (#1352)
2 years ago
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350)
2 years ago
Frank Lee 05fae1fd56
[fx] added activation checkpointing annotation (#1349)
2 years ago
YuliangLiu0306 051592c64e
[fx] update MetaInforProp pass to process more complex node.meta (#1344)
2 years ago
HELSON 7a8702c06d
[colotensor] add Tensor.view op and its unit test (#1343)
2 years ago
github-actions[bot] 6160a1d6a7
Automated submodule synchronization (#1348)
2 years ago
binmakeswell 92b0b139eb
[NFC] add OPT (#1345)
2 years ago
YuliangLiu0306 942c8cd1fb
[fx] refactor tracer to trace complete graph (#1342)
2 years ago
Frank Lee 2cc1175c76
[fx] tested the complete workflow for auto-parallel (#1336)
2 years ago
YuliangLiu0306 4631fef8a0
[fx]refactor tracer (#1335)
2 years ago
HELSON bf5066fba7
[refactor] refactor ColoTensor's unit tests (#1340)
2 years ago
HELSON f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339)
2 years ago
Frank Lee f3ce7b8336
[fx] recovered skipped pipeline tests (#1338)
2 years ago
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
2 years ago
Frank Lee 11d1436a67
[workflow] update docker build workflow to use proxy (#1334)
2 years ago
Frank Lee 75abc75c15
[fx] fixed compatiblity issue with torch 1.10 (#1331)
2 years ago
Frank Lee 069d6fdc84
[workflow] update 8-gpu test to use torch 1.11 (#1332)
2 years ago
fastalgo 7857fd7616
Update README.md
2 years ago
Frank Lee 169954f87e
[test] removed outdated unit test for meta context (#1329)
2 years ago
ver217 7a05367101
[hotfix] shared model returns cpu state_dict (#1328)
2 years ago
Frank Lee b2475d8c5c
[fx] fixed unit tests for torch 1.12 (#1327)
2 years ago
HELSON d49708ae43
[hotfix] fix ddp for unit test test_gpt2 (#1326)
2 years ago
Frank Lee 250be4d31e
[utils] integrated colotensor with lazy init context (#1324)
2 years ago
Frank Lee 659a740738
[workflow] roll back to use torch 1.11 for unit testing (#1325)
2 years ago
Frank Lee 4d5dbf48a6
[workflow] fixed trigger condition for 8-gpu unit test (#1323)
2 years ago
YuliangLiu0306 e8acf55e8b
[fx] add balanced policy v2 (#1251)
2 years ago
XYE ca2d3f284f
[fx] Add unit test and fix bugs for transform_mlp_pass (#1299)
2 years ago
HELSON 1b41686461
[hotfix] fix unit test test_module_spec (#1321)
2 years ago
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316)
2 years ago
Frank Lee 7c2634f4b3
[workflow] updated release bdist workflow (#1318)
2 years ago
github-actions[bot] 869cf3d3b8
Automated submodule synchronization (#1319)
2 years ago
Frank Lee efdc240f1f
[workflow] disable SHM for compatibility CI on rtx3080 (#1315)
2 years ago
ver217 7c70bfbefa
[hotfix] fix PipelineSharedModuleGradientHandler (#1314)
2 years ago
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312)
2 years ago
Frank Lee c9c37dcc4d
[workflow] updated pytorch compatibility test (#1311)
2 years ago
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310)
2 years ago
HELSON 36086927e1
[hotfix] fix ColoTensor GPT2 unitest (#1309)
2 years ago
Jiarui Fang 3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs (#1297)
2 years ago
Jiarui Fang bd71e2a88b
[hotfix] add missing file (#1308)
2 years ago
Frank Lee 4f4d8c3656
[fx] added apex normalization to patched modules (#1300)
2 years ago
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307)
2 years ago
github-actions[bot] 6f2f9eb214
Automated submodule synchronization (#1305)
2 years ago
YuliangLiu0306 93a75433df
[hotfix] skip some unittest due to CI environment. (#1301)
2 years ago
lucasliunju 339520c6e0
[NFC] polish build_colossalai_wheel.py code style (#1306)
2 years ago