Commit Graph

109 Commits (5f8c0a0ac3b52a71b664c3e36dd1a8cef40f428d)

Author SHA1 Message Date
Edenzzzz 5f8c0a0ac3
[Feature] auto-cast optimizers to distributed version (#5746)
6 months ago
botbw 2fc85abf43
[gemini] async grad chunk reduce (all-reduce&reduce-scatter) (#5713)
6 months ago
Edenzzzz 43995ee436
[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694)
7 months ago
flybird11111 77ec773388
[zero]remove registered gradients hooks (#5687)
7 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666)
7 months ago
linsj20 91fa553775 [Feature] qlora support (#5586)
7 months ago
flybird11111 8954a0c2e2 [LowLevelZero] low level zero support lora (#5153)
7 months ago
Baizhou Zhang 14b0d4c7e5 [lora] add lora APIs for booster, support lora for TorchDDP (#4981)
7 months ago
Hongxin Liu 1b387ca9fe
[shardformer] refactor pipeline grad ckpt config (#5646)
7 months ago
Hongxin Liu 4de4e31818
[exampe] update llama example (#5626)
7 months ago
flybird11111 a0ad587c24
[shardformer] refactor embedding resize (#5603)
7 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566)
8 months ago
Zhongkai Zhao 8e412a548e
[shardformer] Sequence Parallelism Optimization (#5533)
8 months ago
Wenhao Chen e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)
8 months ago
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404)
8 months ago
flybird11111 5e16bf7980
[shardformer] fix gathering output when using tensor parallelism (#5431)
8 months ago
Hongxin Liu f2e8b9ef9f
[devops] fix compatibility (#5444)
9 months ago
digger yu 5e1c93d732
[hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335)
9 months ago
flybird11111 29695cf70c
[example]add gpt2 benchmark example script. (#5295)
9 months ago
QinLuo bf34c6fef6
[fsdp] impl save/load shard model/optimizer (#5357)
9 months ago
Frank Lee efef43b53c
Merge pull request #5372 from hpcaitech/exp/mixtral
10 months ago
Hongxin Liu da39d21b71 [moe] support mixtral (#5309)
10 months ago
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl
10 months ago
Hongxin Liu 6c0fa7b9a8
[llama] fix dataloader for hybrid parallel (#5358)
10 months ago
Wenhao Chen 1c790c0877
[fix] remove unnecessary dp_size assert (#5351)
10 months ago
ver217 148469348a Merge branch 'main' into sync/npu
10 months ago
flybird11111 46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246)
10 months ago
flybird11111 e830ef917d
[ci] fix shardformer tests. (#5255)
11 months ago
Frank Lee 9102d655ab
[hotfix] removed unused flag (#5242)
11 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239)
11 months ago
flybird11111 365671be10
fix-test (#5210)
11 months ago
Wenhao Chen d799a3088f
[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214)
11 months ago
Wenhao Chen 4fa689fca1
[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134)
11 months ago
flybird11111 21aa5de00b
[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150)
12 months ago
flybird11111 2a2ec49aa7
[plugin]fix 3d checkpoint load when booster boost without optimizer. (#5135)
1 year ago
github-actions[bot] d10ee42f68
[format] applied code formatting on changed files in pull request 5088 (#5127)
1 year ago
Wenhao Chen 7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088)
1 year ago
Xuanlei Zhao 3acbf6d496
[npu] add npu support for hybrid plugin and llama (#5090)
1 year ago
github-actions[bot] 8921a73c90
[format] applied code formatting on changed files in pull request 5067 (#5072)
1 year ago
Hongxin Liu e5ce4c8ea6
[npu] add npu support for gemini and zero (#5067)
1 year ago
flybird11111 3e02154710
[gemini] gemini support extra-dp (#5043)
1 year ago
flybird11111 576a2f7b10
[gemini] gemini support tensor parallelism. (#4942)
1 year ago
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015)
1 year ago
littsk 1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926)
1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978)
1 year ago
Baizhou Zhang c040d70aa0
[hotfix] fix the bug of repeatedly storing param group (#4951)
1 year ago
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869)
1 year ago
Zhongkai Zhao a0684e7bd6
[feature] support no master weights option for low level zero plugin (#4816)
1 year ago
littsk 83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837)
1 year ago
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872)
1 year ago