1029 Commits (868afdb31191ef7b3fa48d6fa71e7758c8707786)

Author SHA1 Message Date
pre-commit-ci[bot] df612434c9 [pre-commit.ci] auto fixes from pre-commit.com hooks 5 months ago
Wang Binluo 4c69e2dc91 support qwen model 5 months ago
Wenhao Chen 32e642bf40 revert: enable return_outputs when necessary 5 months ago
Wenhao Chen 6fa181ebef feat: add qwen2 to model_zoo 5 months ago
Wenhao Chen 14305c9449 test: add qwen2 shard test 5 months ago
Wenhao Chen 5512bdf1fc fix: modify model config and add Qwen2RMSNorm 5 months ago
Wenhao Chen 6ceaf4f1f8 tests: add `sub_dp_group` test 8 months ago
Wenhao Chen e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508) 8 months ago
Insu Jang 00525f7772
[shardformer] fix pipeline forward error if custom layer distribution is used (#5189) 8 months ago
Hongxin Liu 19e1a5cf16
[shardformer] update colo attention to support custom mask (#5510) 8 months ago
Edenzzzz 61da3fbc52 fixed layout converter caching and updated tester 8 months ago
flybird11111 0688d92e2d
[shardformer]Fix lm parallel. (#5480) 8 months ago
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404) 8 months ago
flybird11111 5e16bf7980
[shardformer] fix gathering output when using tensor parallelism (#5431) 8 months ago
Hongxin Liu f2e8b9ef9f
[devops] fix compatibility (#5444) 8 months ago
flybird11111 29695cf70c
[example]add gpt2 benchmark example script. (#5295) 9 months ago
QinLuo bf34c6fef6
[fsdp] impl save/load shard model/optimizer (#5357) 9 months ago
ver217 06db94fbc9 [moe] fix tests 10 months ago
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl 10 months ago
Hongxin Liu c53ddda88f
[lr-scheduler] fix load state dict and add test (#5369) 10 months ago
Wenhao Chen 1c790c0877
[fix] remove unnecessary dp_size assert (#5351) 10 months ago
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347) 10 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298) 10 months ago
Hongxin Liu d7f8db8e21
[hotfix] fix 3d plugin test (#5292) 10 months ago
Zhongkai Zhao 5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) 10 months ago
flybird11111 46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246) 10 months ago
flybird11111 2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276) 10 months ago
Frank Lee d69cd2eb89
[workflow] fixed oom tests (#5275) 10 months ago
Wenhao Chen ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268) 10 months ago
flybird11111 e830ef917d
[ci] fix shardformer tests. (#5255) 11 months ago
Frank Lee 2b83418719
[ci] fixed ddp test (#5254) 11 months ago
Frank Lee d5eeeb1416
[ci] fixed booster test (#5251) 11 months ago
Frank Lee edf94a35c3
[workflow] fixed build CI (#5240) 11 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239) 11 months ago
Elsa Granger d565df3821
[pipeline] A more general _communicate in p2p (#5062) 11 months ago
Xuanlei Zhao dd2c28a323
[npu] use extension for op builder (#5172) 11 months ago
Wenhao Chen d799a3088f
[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214) 11 months ago
Wenhao Chen 4fa689fca1
[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) 11 months ago
flybird11111 79718fae04
[shardformer] llama support DistCrossEntropy (#5176) 12 months ago
flybird11111 21aa5de00b
[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150) 12 months ago
flybird11111 2a2ec49aa7
[plugin]fix 3d checkpoint load when booster boost without optimizer. (#5135) 12 months ago
github-actions[bot] d10ee42f68
[format] applied code formatting on changed files in pull request 5088 (#5127) 12 months ago
Wenhao Chen 7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) 1 year ago
Zhongkai Zhao 75af66cd81
[Hotfix] Fix model policy matching strategy in ShardFormer (#5064) 1 year ago
Xu Kai fb103cfd6e
[inference] update examples and engine (#5073) 1 year ago
Bin Jia 0c7d8bebd5
[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069) 1 year ago
Hongxin Liu e5ce4c8ea6
[npu] add npu support for gemini and zero (#5067) 1 year ago
Xu Kai fd6482ad8c
[inference] Refactor inference architecture (#5057) 1 year ago
Wenhao Chen 3c08f17348
[hotfix]: modify create_ep_hierarchical_group and add test (#5032) 1 year ago
flybird11111 3e02154710
[gemini] gemini support extra-dp (#5043) 1 year ago