Commit Graph

94 Commits (d6df19bae7cdb9e116c1f218a4465855623c80b1)

Author SHA1 Message Date
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
1 year ago
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671)
1 year ago
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer
1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479)
1 year ago
Hongxin Liu 26e29d58f0
[devops] add large-scale distributed test marker (#4452)
1 year ago
flybird1111 7a3dfd0c64 [shardformer] update shardformer to use flash attention 2 (#4392)
1 year ago
flybird1111 906426cb44 [Shardformer] Merge flash attention branch to pipeline branch (#4362)
1 year ago
flybird1111 458ae331ad
[kernel] updated unittests for coloattention (#4389)
1 year ago
flybird1111 38b792aab2
[coloattention] fix import error (#4380)
1 year ago
flybird1111 25c57b9fb4
[fix] coloattention support flash attention 2 (#4347)
1 year ago
Hongxin Liu 16bf4c0221
[test] remove useless tests (#4359)
1 year ago
Hongxin Liu dbb32692d2
[lazy] refactor lazy init (#3891)
2 years ago
Frank Lee 615e2e5fc1
[test] fixed lazy init test import error (#3799)
2 years ago
Hongxin Liu afb239bbf8
[devops] update torch version of CI (#3725)
2 years ago
digger-yu b7141c36dd
[CI] fix some spelling errors (#3707)
2 years ago
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
2 years ago
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
2 years ago
YuliangLiu0306 045afa3ea2
[hotfix] skip torchaudio tracing test (#3211)
2 years ago
ver217 f8289d4221
[lazyinit] combine lazy tensor with dtensor (#3204)
2 years ago
zbian 7bc0afc901 updated flash attention usage
2 years ago
ver217 6ae8ed0407
[lazyinit] add correctness verification (#3147)
2 years ago
アマデウス 077a66dd81
updated attention kernel (#2133)
2 years ago
zbian 6877121377 updated flash attention api
2 years ago
ver217 99870726b1
[CheckpointIO] a uniform checkpoint I/O module (#1689)
2 years ago
oahzxl 9639ea88fc
[kernel] more flexible flashatt interface (#1804)
2 years ago
oahzxl 501a9e9cd2
[hotfix] polish flash attention (#1802)
2 years ago
Jiarui Fang c248800359
[kernel] skip tests of flash_attn and triton when they are not available (#1798)
2 years ago
oahzxl 25952b67d7
[feat] add flash attention (#1762)
2 years ago
Boyuan Yao 47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460)
2 years ago
Jiarui Fang 36824a304c
[Doc] add more doc for ColoTensor. (#1458)
2 years ago
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442)
2 years ago
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387)
2 years ago
HELSON 7a8702c06d
[colotensor] add Tensor.view op and its unit test (#1343)
2 years ago
HELSON f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339)
2 years ago
Frank Lee 169954f87e
[test] removed outdated unit test for meta context (#1329)
2 years ago
Frank Lee 250be4d31e
[utils] integrated colotensor with lazy init context (#1324)
2 years ago
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316)
2 years ago
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312)
2 years ago
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310)
2 years ago
Jiarui Fang 3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs (#1297)
2 years ago
Frank Lee 7e8114a8dd
[hotfix] skipped unsafe test cases (#1282)
2 years ago
Jiarui Fang c92f84fcdb
[tensor] distributed checkpointing for parameters (#1240)
2 years ago
Jiarui Fang 9bcd2fd4af
[tensor] a shorter shard and replicate spec (#1245)
2 years ago
Jiarui Fang 20da6e48c8
[checkpoint] save sharded optimizer states (#1237)
2 years ago
Jiarui Fang 3b500984b1
[tensor] fix some unittests (#1234)
2 years ago
Yi Zhao 04537bf83e
[checkpoint]support generalized scheduler (#1222)
2 years ago
Jiarui Fang 52736205d9
[checkpoint] make unitest faster (#1217)
2 years ago
Jiarui Fang f38006ea83
[checkpoint] checkpoint for ColoTensor Model (#1196)
2 years ago
YuliangLiu0306 63d2a93878
[context]support arbitary module materialization. (#1193)
2 years ago