Commit Graph

189 Commits (ckpt)

Author SHA1 Message Date
botbw c54c4fcd15
[hotfix] moe hybrid parallelism benchmark & follow-up fix (#6048)
3 months ago
Wenxuan Tan 8fd25d6e09
[Feature] Split cross-entropy computation in SP (#5959)
3 months ago
Wang Binluo eea37da6fa
[fp8] Merge feature/fp8_comm to main branch of Colossalai (#6016)
3 months ago
flybird11111 0a51319113
[fp8] zero support fp8 linear. (#6006)
3 months ago
Hongxin Liu 8241c0c054
[fp8] support gemini plugin (#5978)
4 months ago
Hanks b480eec738
[Feature]: support FP8 communication in DDP, FSDP, Gemini (#5928)
4 months ago
flybird11111 0c10afd372
[FP8] rebase main (#5963)
4 months ago
BurkeHulk 66018749f3 add fp8_communication flag in the script
5 months ago
Haze188 416580b314
[MoE/ZeRO] Moe refactor with zero refactor (#5821)
5 months ago
botbw 8e718a1421
[gemini] fixes for benchmarking (#5847)
5 months ago
Edenzzzz 2a25a2aff7
[Feature] optimize PP overlap (#5735)
5 months ago
Edenzzzz 8795bb2e80
Support 4d parallel + flash attention (#5789)
5 months ago
hxwang 154720ba6e [chore] refactor profiler utils
6 months ago
genghaozhe 87665d7922 correct argument help message
6 months ago
genghaozhe b9269d962d add args.prefetch_num for benchmark
6 months ago
genghaozhe fba04e857b [bugs] fix args.profile=False DummyProfiler errro
6 months ago
hxwang ca674549e0 [chore] remove unnecessary test & changes
6 months ago
hxwang ff507b755e Merge branch 'main' of github.com:hpcaitech/ColossalAI into prefetch
6 months ago
hxwang 63c057cd8e [example] add profile util for llama
6 months ago
botbw 2fc85abf43
[gemini] async grad chunk reduce (all-reduce&reduce-scatter) (#5713)
6 months ago
hxwang 15d21a077a Merge remote-tracking branch 'origin/main' into prefetch
6 months ago
Yuanheng Zhao 8633c15da9 [sync] Sync feature/colossal-infer with main
6 months ago
genghaozhe a280517dd9 remove unrelated file
6 months ago
genghaozhe df63db7e63 remote comments
6 months ago
hxwang 2e68eebdfe [chore] refactor & sync
7 months ago
Yuanheng Zhao 12e7c28d5e
[hotfix] fix OpenMOE example import path (#5697)
7 months ago
Yuanheng Zhao 55cc7f3df7
[Fix] Fix Inference Example, Tests, and Requirements (#5688)
7 months ago
Edenzzzz c25f83c85f
fix missing pad token (#5690)
7 months ago
Yuanheng Zhao 56ed09aba5 [sync] resolve conflicts of merging main
7 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666)
7 months ago
Tong Li 68ec99e946
[hotfix] add soft link to support required files (#5661)
7 months ago
Hongxin Liu 1b387ca9fe
[shardformer] refactor pipeline grad ckpt config (#5646)
7 months ago
傅剑寒 279300dc5f
[Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)
7 months ago
binmakeswell f4c5aafe29
[example] llama3 (#5631)
7 months ago
Hongxin Liu 4de4e31818
[exampe] update llama example (#5626)
7 months ago
Edenzzzz d83c633ca6
[hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606)
7 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566)
8 months ago
digger yu 341263df48
[hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548)
8 months ago
digger yu a799ca343b
[fix] fix typo s/muiti-node /multi-node etc. (#5448)
8 months ago
Wenhao Chen e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)
8 months ago
Yuanheng Zhao 36c4bb2893
[Fix] Grok-1 use tokenizer from the same pretrained path (#5532)
8 months ago
Insu Jang 00525f7772
[shardformer] fix pipeline forward error if custom layer distribution is used (#5189)
8 months ago
Yuanheng Zhao 131f32a076
[fix] fix grok-1 example typo (#5506)
8 months ago
binmakeswell 34e909256c
[release] grok-1 inference benchmark (#5500)
8 months ago
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404)
8 months ago
Yuanheng Zhao 5fcd7795cd
[example] update Grok-1 inference (#5495)
8 months ago
binmakeswell 6df844b8c4
[release] grok-1 314b inference (#5490)
8 months ago
Hongxin Liu 848a574c26
[example] add grok-1 inference (#5485)
8 months ago
Luo Yihang e239cf9060
[hotfix] fix typo of openmoe model source (#5403)
9 months ago
Hongxin Liu 070df689e6
[devops] fix extention building (#5427)
9 months ago