wangbluo
3fab92166e
fix
2024-09-26 18:03:09 +08:00
duanjunwen
bb0390c90d
[fix] remove duplicate arg; rm comments;
2024-09-26 09:45:44 +00:00
duanjunwen
c5503b0d80
[fix] fix test_pipeline_utils ci;
2024-09-26 07:18:16 +00:00
duanjunwen
45f17fc6cc
[fix] rm comments;
2024-09-26 06:13:56 +00:00
duanjunwen
a92e16719b
[fix] fix zerobubble; support shardformer model type;
2024-09-26 06:11:56 +00:00
binmakeswell
f4daf04270
add funding news ( #6072 )
...
* add funding news
* add funding news
* add funding news
2024-09-26 12:29:27 +08:00
wangbluo
6705dad41b
fix
2024-09-25 19:02:21 +08:00
wangbluo
91ed32c256
fix
2024-09-25 19:00:38 +08:00
wangbluo
6fb1322db1
fix
2024-09-25 18:56:18 +08:00
wangbluo
65c8297710
fix the attn
2024-09-25 18:51:03 +08:00
wangbluo
cfd9eda628
fix the ring attn
2024-09-25 18:34:29 +08:00
duanjunwen
83163fa70c
[fix] fix traverse; traverse dict --> traverse tensor List;
2024-09-25 06:38:11 +00:00
duanjunwen
fc8b016887
[fix] fix stage_indices;
2024-09-25 06:15:45 +00:00
binmakeswell
cbaa104216
release FP8 news ( #6068 )
...
* add FP8 news
* release FP8 news
* release FP8 news
2024-09-25 11:57:16 +08:00
duanjunwen
8501202a35
Merge pull request #6065 from duanjunwen/dev/zero_bubble
...
[Feat] Support zero bubble with shardformer input
2024-09-24 19:17:37 +08:00
duanjunwen
7e6f793c51
[fix] fix detach_output_obj clone;
2024-09-24 08:08:32 +00:00
duanjunwen
6c1e1550ae
[fix] fix dumb clone;
2024-09-23 06:43:49 +00:00
duanjunwen
a875212a42
[fix] fix ci --> oom in 4096 hidden dim;
2024-09-23 05:55:16 +00:00
duanjunwen
c114d1429a
[fix] fix detach clone release order;
2024-09-23 04:00:24 +00:00
duanjunwen
da3220f48c
[fix] fix pipeline util func deallocate --> release_tensor_data; fix bwd_b loss bwd branch;
2024-09-20 09:48:35 +00:00
duanjunwen
1739df423c
[fix] fix fwd branch, fwd pass both micro_batch & internal_inputs'
2024-09-20 07:34:43 +00:00
duanjunwen
b6616f544e
[fix] rm comments;
2024-09-20 07:29:41 +00:00
duanjunwen
c6d6ee39bd
[fix] use tree_flatten replace dict traverse;
2024-09-20 07:18:49 +00:00
duanjunwen
26783776f1
[fix] fix input_tensors buffer append input_obj(dict) --> Tuple (microbatch, input_obj) , and all bwd b related cal logic;
2024-09-20 06:41:19 +00:00
duanjunwen
4753bf7add
[fix] fix mem assert;
2024-09-19 08:27:47 +00:00
duanjunwen
a115106f8d
[fix] fix bwd w input;
2024-09-19 08:10:05 +00:00
duanjunwen
349272c71f
[fix] updatw bwd b&w input; dict --> list[torch.Tensor]
2024-09-19 07:47:01 +00:00
duanjunwen
6ee9584b9a
[fix] fix require_grad & deallocate call;
2024-09-19 05:53:03 +00:00
duanjunwen
1f5c7258aa
Merge remote-tracking branch 'upstream/feature/zerobubble' into dev/zero_bubble
2024-09-19 03:52:13 +00:00
Hongxin Liu
dabc2e7430
[release] update version ( #6062 )
2024-09-19 10:45:32 +08:00
Camille Zhong
f9546ba0be
[ColossalEval] support for vllm ( #6056 )
...
* support vllm
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* modify vllm and update readme
* run pre-commit
* remove dupilicated lines and refine code
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update param name
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refine code
* update readme
* refine code
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-09-18 17:09:45 +08:00
duanjunwen
af2c2f8092
[feat] add more test;
2024-09-18 07:51:54 +00:00
duanjunwen
3dbad102cf
[fix] fix zerobubble pp for shardformer type input;
2024-09-18 07:14:34 +00:00
botbw
4fa6b9509c
[moe] add parallel strategy for shared_expert && fix test for deepseek ( #6063 )
2024-09-18 10:09:01 +08:00
Wang Binluo
63314ce4e4
Merge pull request #6064 from wangbluo/fix_attn
...
[sp] : fix the attention kernel for sp
2024-09-18 10:08:15 +08:00
wangbluo
10e4f7da72
fix
2024-09-16 13:45:04 +08:00
Wang Binluo
37e35230ff
Merge pull request #6061 from wangbluo/sp_fix
...
[sp] : fix the attention kernel for sp
2024-09-14 20:54:35 +08:00
wangbluo
827ef3ee9a
fix
2024-09-14 10:40:35 +00:00
Guangyao Zhang
bdb125f83f
[doc] FP8 training and communication document ( #6050 )
...
* Add FP8 training and communication document
* add fp8 docstring for plugins
* fix typo
* fix typo
2024-09-14 11:01:05 +08:00
Guangyao Zhang
f20b066c59
[fp8] Disable all_gather intranode. Disable Redundant all_gather fp8 ( #6059 )
...
* all_gather only internode, fix pytest
* fix cuda arch <89 compile pytest error
* fix pytest failure
* disable all_gather_into_tensor_flat_fp8
* fix fp8 format
* fix pytest
* fix conversations
* fix chunk tuple to list
2024-09-14 10:40:01 +08:00
wangbluo
b582319273
fix
2024-09-13 10:24:41 +00:00
wangbluo
0ad3129cb9
fix
2024-09-13 09:01:26 +00:00
wangbluo
0b14a5512e
fix
2024-09-13 07:06:14 +00:00
botbw
696fced0d7
[fp8] fix missing fp8_comm flag in mixtral ( #6057 )
2024-09-13 14:30:05 +08:00
wangbluo
dc032172c3
fix
2024-09-13 06:00:58 +00:00
wangbluo
f393867cff
fix
2024-09-13 05:24:52 +00:00
wangbluo
6eb8832366
fix
2024-09-13 05:06:56 +00:00
wangbluo
683179cefd
fix
2024-09-13 03:40:56 +00:00
wangbluo
0a01e2a453
fix the attn
2024-09-13 03:38:35 +00:00
pre-commit-ci[bot]
216d54e374
[pre-commit.ci] auto fixes from pre-commit.com hooks
...
for more information, see https://pre-commit.ci
2024-09-13 02:38:40 +00:00