Commit Graph

1235 Commits (e8a9bebc8770b9430f4150a400e6fef43cf02d4f)
 

Author SHA1 Message Date
YuliangLiu0306 33f0744d51
[tensor] add shape consistency feature to support auto spec transform (#1418)
2 years ago
HELSON 4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access (#1423)
2 years ago
HELSON c577ed016e
[zero] add AgChunk (#1417)
2 years ago
Jiarui Fang d209aff684
Add FreqAwareEmbeddingBag (#1421)
2 years ago
ver217 6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad (#1422)
2 years ago
Jiarui Fang 504419d261
[FAW] add cache manager for the cached embedding (#1419)
2 years ago
Kirigaya Kazuto 44fd3c83ab
[communication] add p2p_v2.py to support communication with List[Any] (#1407)
2 years ago
github-actions[bot] 1590f59908
Automated submodule synchronization (#1415)
2 years ago
github-actions[bot] 9b442ecdc3
Automated submodule synchronization (#1404)
2 years ago
YuliangLiu0306 7c96055c68
[tensor]build sharding spec to replace distspec in future. (#1405)
2 years ago
ver217 12b4887097
[hotfix] fix CPUAdam kernel nullptr (#1410)
2 years ago
github-actions[bot] 1e5eb0874c
Automated submodule synchronization (#1396)
2 years ago
YuliangLiu0306 0442f940f0
[device] add DeviceMesh class to support logical device layout (#1394)
2 years ago
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399)
2 years ago
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398)
2 years ago
Jiarui Fang 4f5f8f77d1
update nvme on readme (#1397)
2 years ago
ver217 56b8863b87
[zero] chunk manager allows filtering ex-large params (#1393)
2 years ago
Frank Lee adf5054ff8
[fx] fixed torchaudio conformer tracing (#1392)
2 years ago
Frank Lee 7d6293927f
[fx] patched torch.max and data movement operator (#1391)
2 years ago
fastalgo db89600cf2
Update README.md
2 years ago
Frank Lee 89e60d1505
[fx] fixed indentation error in checkpointing codegen (#1385)
2 years ago
HELSON c7221cb2d4
[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388)
2 years ago
Frank Lee ad678921db
[fx] patched torch.full for huggingface opt (#1386)
2 years ago
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387)
2 years ago
Jiarui Fang f792507ff3
[chunk] add PG check for tensor appending (#1383)
2 years ago
ver217 8dced41ad0
[zero] zero optim state_dict takes only_rank_0 (#1384)
2 years ago
ver217 7d5d628e07
[DDP] test ddp state dict uses more strict threshold (#1382)
2 years ago
YuliangLiu0306 df54481473
[hotfix] fix some bugs during gpt2 testing (#1379)
2 years ago
ver217 828b9e5e0d
[hotfix] fix zero optim save/load state dict (#1381)
2 years ago
HELSON b6fd165f66
[checkpoint] add kwargs for load_state_dict (#1374)
2 years ago
github-actions[bot] 50dec605e1
Automated submodule synchronization (#1380)
2 years ago
ver217 83328329dd
[hotfix] fix zero ddp buffer cast (#1376)
2 years ago
ver217 5d5031e946
fix zero ddp state dict (#1378)
2 years ago
Frank Lee 0c1a16ea5b
[util] standard checkpoint function naming (#1377)
2 years ago
YuliangLiu0306 52bc2dc271
[fx] update split module pass and add customized policy (#1373)
2 years ago
Super Daniel be229217ce
[fx] add torchaudio test (#1369)
2 years ago
github-actions[bot] fb6f085907
Automated submodule synchronization (#1372)
2 years ago
Boyuan Yao bb640ec728
[fx] Add colotracer compatibility test on torchrec (#1370)
2 years ago
ver217 c415240db6
[nvme] CPUAdam and HybridAdam support NVMe offload (#1360)
2 years ago
HELSON 8463290642
[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368)
2 years ago
github-actions[bot] c491c2a948
Automated submodule synchronization (#1364)
2 years ago
YuliangLiu0306 5542816690
[fx]add gpt2 passes for pipeline performance test (#1366)
2 years ago
HELSON 87775a0682
[colotensor] use cpu memory to store state_dict (#1367)
2 years ago
HELSON 943a96323e
[hotfix] fix no optimizer in save/load (#1363)
2 years ago
Frank Lee cd063ac37f
[fx] added activation checkpoint codegen support for torch < 1.12 (#1359)
2 years ago
HELSON 4417804129
[unit test] add megatron init test in zero_optim (#1358)
2 years ago
HELSON 7a065dc9f6
[hotfix] fix megatron_init in test_gpt2.py (#1357)
2 years ago
Frank Lee 644582eee9
[fx] added activation checkpoint codegen (#1355)
2 years ago
ver217 38fd8844c0
[docker] add tensornvme in docker (#1354)
2 years ago
ver217 6b43c789fd
fix zero optim backward_by_grad and save/load (#1353)
2 years ago