Commit Graph

956 Commits (6474e31556ae011410f29d8d0a2d80de67ed6956)
 

Author SHA1 Message Date
Frank Lee 6474e31556
[workflow] added TensorNVMe to compatibility test (#1449)
2 years ago
Geng Zhang 9f3eed66eb
[FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448)
2 years ago
Frank Lee 5a52e21fe3
[test] fixed the activation codegen test (#1447)
2 years ago
YuliangLiu0306 0f3042363c
[tensor] shape consistency generate transform path and communication cost (#1435)
2 years ago
Boyuan Yao 5774fe0270
[fx] Use colossalai checkpoint and add offload recognition in codegen (#1439)
2 years ago
Kirigaya Kazuto e9460b45c8
[engin/schedule] use p2p_v2 to recontruct pipeline_schedule (#1408)
2 years ago
Frank Lee ae1b58cd16
[tensor] added linear implementation for the new sharding spec (#1416)
2 years ago
Super Daniel d40a9392ba
[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. (#1446)
2 years ago
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442)
2 years ago
Frank Lee 74bee5f7e8
[release] update version.txt (#1444)
2 years ago
HELSON b80340168e
[zero] add chunk_managerV2 for all-gather chunk (#1441)
2 years ago
Super Daniel 3b26516c69
[fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433)
2 years ago
Jiarui Fang 30b4dd17c0
[FAW] export FAW in _ops (#1438)
2 years ago
HELSON 9056677b13
[zero] add chunk size searching algorithm for parameters in different groups (#1436)
2 years ago
Jiarui Fang c9427a323f
hotfix #1434 (#1437)
2 years ago
HELSON 039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 (#1432)
2 years ago
Super Daniel f20cb4e893
[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425)
2 years ago
Jiarui Fang 89c434a0a6
[polish] add test_ops directory (#1431)
2 years ago
Jiarui Fang 10b3df65c8
[FAW] move coloparam setting in test code. (#1429)
2 years ago
Jiarui Fang cb98cf5558
[FAW] parallel FreqAwareEmbedding (#1424)
2 years ago
HELSON 0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426)
2 years ago
YuliangLiu0306 33f0744d51
[tensor] add shape consistency feature to support auto spec transform (#1418)
2 years ago
HELSON 4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access (#1423)
2 years ago
HELSON c577ed016e
[zero] add AgChunk (#1417)
2 years ago
Jiarui Fang d209aff684
Add FreqAwareEmbeddingBag (#1421)
2 years ago
ver217 6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad (#1422)
2 years ago
Jiarui Fang 504419d261
[FAW] add cache manager for the cached embedding (#1419)
2 years ago
Kirigaya Kazuto 44fd3c83ab
[communication] add p2p_v2.py to support communication with List[Any] (#1407)
2 years ago
github-actions[bot] 1590f59908
Automated submodule synchronization (#1415)
2 years ago
github-actions[bot] 9b442ecdc3
Automated submodule synchronization (#1404)
2 years ago
YuliangLiu0306 7c96055c68
[tensor]build sharding spec to replace distspec in future. (#1405)
2 years ago
ver217 12b4887097
[hotfix] fix CPUAdam kernel nullptr (#1410)
2 years ago
github-actions[bot] 1e5eb0874c
Automated submodule synchronization (#1396)
2 years ago
YuliangLiu0306 0442f940f0
[device] add DeviceMesh class to support logical device layout (#1394)
2 years ago
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399)
2 years ago
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398)
2 years ago
Jiarui Fang 4f5f8f77d1
update nvme on readme (#1397)
2 years ago
ver217 56b8863b87
[zero] chunk manager allows filtering ex-large params (#1393)
2 years ago
Frank Lee adf5054ff8
[fx] fixed torchaudio conformer tracing (#1392)
2 years ago
Frank Lee 7d6293927f
[fx] patched torch.max and data movement operator (#1391)
2 years ago
fastalgo db89600cf2
Update README.md
2 years ago
Frank Lee 89e60d1505
[fx] fixed indentation error in checkpointing codegen (#1385)
2 years ago
HELSON c7221cb2d4
[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388)
2 years ago
Frank Lee ad678921db
[fx] patched torch.full for huggingface opt (#1386)
2 years ago
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387)
2 years ago
Jiarui Fang f792507ff3
[chunk] add PG check for tensor appending (#1383)
2 years ago
ver217 8dced41ad0
[zero] zero optim state_dict takes only_rank_0 (#1384)
2 years ago
ver217 7d5d628e07
[DDP] test ddp state dict uses more strict threshold (#1382)
2 years ago
YuliangLiu0306 df54481473
[hotfix] fix some bugs during gpt2 testing (#1379)
2 years ago
ver217 828b9e5e0d
[hotfix] fix zero optim save/load state dict (#1381)
2 years ago