Commit Graph

3530 Commits (moe_sp)

Author SHA1 Message Date
haze188 befe3100da [bugfix] colo attn bug fix 2024-07-24 08:43:36 +00:00
haze188 2d73efdfdd [bugfix] colo attn bug fix 2024-07-24 06:53:24 +00:00
hxwang e521890d32
[test] add check 2024-07-23 09:38:05 +00:00
haze188 4b6fbaf956 [moe] deepseek moe sp support 2024-07-23 06:39:49 +00:00
botbw 91f84f6a5f [bug] fix: somehow logger hangs the program 2024-07-23 06:17:51 +00:00
hxwang e31d2ebcf7
[test] fix test: test_zero1_2 2024-07-22 05:36:20 +00:00
hxwang c67e553fd3
[moe] remove ops 2024-07-22 04:00:42 +00:00
hxwang 05a78d2f41
[chore] solve moe ckpt test failure and some other arg pass failure 2024-07-22 03:53:02 +00:00
pre-commit-ci[bot] 9f9e268265 [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
2024-07-19 07:54:41 +00:00
hxwang c27f5d9731
[chore] minor fix after rebase 2024-07-19 07:53:40 +00:00
hxwang 783aafa327
[moe] full test for deepseek and mixtral (pp + sp to fix) 2024-07-19 07:32:56 +00:00
hxwang 162e2d935c
[moe] finalize test (no pp) 2024-07-19 07:32:56 +00:00
haze188 b91cdccf2e
moe sp + ep bug fix 2024-07-19 07:32:55 +00:00
hxwang 8e85523a42
[moe] init moe plugin comm setting with sp 2024-07-19 07:32:54 +00:00
hxwang f0599a0c19
[chore] minor fix 2024-07-19 07:32:02 +00:00
Haze188 633849f438
[Feature] MoE Ulysses Support (#5918)
* moe sp support

* moe sp bug solve

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-07-19 07:32:01 +00:00
hxwang c8bf2681e3
[moe] clean legacy code 2024-07-19 07:32:01 +00:00
hxwang 8d3d7f3cbd
[moe] test deepseek 2024-07-19 07:32:00 +00:00
botbw 335ad3c6fb
[moe] implement tp 2024-07-19 07:30:17 +00:00
botbw d4a64e355e
[test] add mixtral modelling test 2024-07-19 07:30:16 +00:00
hxwang 18be903ed9
[chore] arg pass & remove drop token 2024-07-19 07:30:16 +00:00
botbw cbcc818d5a
[chore] trivial fix 2024-07-19 07:30:15 +00:00
botbw 5bc085fc01
[chore] manually revert unintended commit 2024-07-19 07:30:14 +00:00
botbw 1b15cc97f5
[moe] add mixtral dp grad scaling when not all experts are activated 2024-07-19 07:30:14 +00:00
botbw 2f9bce6686
[moe] implement submesh initialization 2024-07-19 07:30:13 +00:00
haze188 a613edd517
solve hang when parallel mode = pp + dp 2024-07-19 07:30:13 +00:00
haze188 0210bead8c
[misc] solve booster hang by rename the variable 2024-07-19 07:29:36 +00:00
botbw b303ffe9f3
[zero] solve hang 2024-07-19 07:29:36 +00:00
botbw 2431694564
[moe] implement transit between non moe tp and ep 2024-07-19 07:29:35 +00:00
botbw dec6e25e99
[test] pass mixtral shardformer test 2024-07-19 07:29:35 +00:00
hxwang 61109c7843
[zero] solve hang 2024-07-19 07:29:07 +00:00
hxwang 000456bf94
[chore] handle non member group 2024-07-19 07:29:07 +00:00
hxwang 4fc6f9aa98
[test] mixtra pp shard test 2024-07-19 07:29:06 +00:00
hxwang 5a9490a46b
[moe] fix plugin 2024-07-19 07:29:06 +00:00
hxwang 6a9164a477
[test] add mixtral transformer test 2024-07-19 07:29:05 +00:00
hxwang 229db4bc16
[test] add mixtral for sequence classification 2024-07-19 07:29:05 +00:00
Tong Li f585d4e38e
[ColossalChat] Hotfix for ColossalChat (#5910)
* add ignore and tiny llama

* fix path issue

* run style

* fix issue

* update bash

* add ignore and tiny llama

* fix path issue

* run style

* fix issue

* update bash

* fix ddp issue

* add Qwen 1.5 32B
2024-07-19 13:40:07 +08:00
Edenzzzz 8cc8f645cd
[Examples] Add lazy init to OPT and GPT examples (#5924)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-07-19 10:10:08 +08:00
Hongxin Liu e86127925a
[plugin] support all-gather overlap for hybrid parallel (#5919)
* [plugin] fixed all-gather overlap support for hybrid parallel
2024-07-18 15:33:03 +08:00
Hongxin Liu 73494de577
[release] update version (#5912) 2024-07-17 17:29:59 +08:00
Hongxin Liu 27a72f0de1 [misc] support torch2.3 (#5893)
* [misc] support torch2.3

* [devops] update compatibility ci

* [devops] update compatibility ci

* [devops] add debug

* [devops] add debug

* [devops] add debug

* [devops] add debug

* [devops] remove debug

* [devops] remove debug
2024-07-16 13:59:25 +08:00
アマデウス 530283dba0 fix object_to_tensor usage when torch>=2.3.0 (#5820) 2024-07-16 13:59:25 +08:00
Guangyao Zhang 2e28c793ce [compatibility] support torch 2.2 (#5875)
* Support Pytorch 2.2.2

* keep build_on_pr file and update .compatibility
2024-07-16 13:59:25 +08:00
YeAnbang d8bf7e09a2
Merge pull request #5901 from hpcaitech/colossalchat
[Chat] fix eval: add in training evaluation, fix orpo sft loss bug
2024-07-16 11:07:32 +08:00
Guangyao Zhang 1c961b20f3
[ShardFormer] fix qwen2 sp (#5903) 2024-07-15 13:58:06 +08:00
Stephan Kö 45c49dde96
[Auto Parallel]: Speed up intra-op plan generation by 44% (#5446)
* Remove unnecessary calls to deepcopy

* Build DimSpec's difference dict only once

This change considerably speeds up construction speed of DimSpec objects. The difference_dict is the same for each DimSpec object, so a single copy of it is enough.

* Fix documentation of DimSpec's difference method
2024-07-15 12:05:06 +08:00
YeAnbang b3594d4d68 fix orpo cross entropy loss 2024-07-15 02:12:05 +00:00
Hongxin Liu c068ef0fa0
[zero] support all-gather overlap (#5898)
* [zero] support all-gather overlap

* [zero] add overlap all-gather flag

* [misc] fix typo

* [zero] update api
2024-07-11 18:59:59 +08:00
YeAnbang 115c4cc5a4 hotfix citation 2024-07-11 06:05:05 +00:00
YeAnbang e7a8634636 fix eval 2024-07-11 03:35:03 +00:00