Hongxin Liu
a15ab139ad
[plugin] support get_grad_norm ( #6115 )
3 weeks ago
Hongxin Liu
c2e8f61592
[checkpointio] fix hybrid plugin model save ( #6106 )
4 weeks ago
Hanks
dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
...
[hotfix] fix lora ckpt saving format
1 month ago
BurkeHulk
6d6cafabe2
pre-commit fix
1 month ago
BurkeHulk
b10339df7c
fix lora ckpt save format (ColoTensor to Tensor)
1 month ago
Hongxin Liu
58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import ( #6093 )
...
* [amp] fit torch's new api
* [amp] fix api call
* [amp] fix api call
* [misc] fit torch pytree api upgrade
* [misc] remove legacy import
* [misc] fit torch amp api
* [misc] fit torch amp api
1 month ago
Hongxin Liu
5ddad486ca
[fp8] add fallback and make compile option configurable ( #6092 )
1 month ago
botbw
3b1d7d1ae8
[chore] refactor
1 month ago
botbw
2bcd0b6844
[ckpt] add safetensors util
1 month ago
Hongxin Liu
cd61353bae
[pipeline] hotfix backward for multiple outputs ( #6090 )
...
* [pipeline] hotfix backward for multiple outputs
* [pipeline] hotfix backward for multiple outputs
1 month ago
Wenxuan Tan
62c13e7969
[Ring Attention] Improve comments ( #6085 )
...
* improve comments
* improve comments
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
1 month ago
Wang Binluo
dcd41d0973
Merge pull request #6071 from wangbluo/ring_attention
...
[Ring Attention] fix the 2d ring attn when using multiple machine
1 month ago
wangbluo
83cf2f84fb
fix
2 months ago
wangbluo
bc7eeade33
fix
2 months ago
wangbluo
fd92789af2
fix
2 months ago
wangbluo
6be9862aaf
fix
2 months ago
wangbluo
3dc08c8a5a
fix
2 months ago
wangbluo
fe9208feac
fix
2 months ago
wangbluo
3201377e94
fix
2 months ago
wangbluo
23199e34cc
fix
2 months ago
wangbluo
d891e50617
fix
2 months ago
wangbluo
e1e86f9f1f
fix
2 months ago
Tong Li
4c8e85ee0d
[Coati] Train DPO using PP ( #6054 )
...
* update dpo
* remove unsupport plugin
* update msg
* update dpo
* remove unsupport plugin
* update msg
* update template
* update dataset
* add pp for dpo
* update dpo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add dpo fn
* update dpo
* update dpo
* update dpo
* update dpo
* minor update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update loss
* update help
* polish code
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2 months ago
wangbluo
1507a7528f
fix
2 months ago
wangbluo
0002ae5956
fix
2 months ago
Hongxin Liu
dc2cdaf3e8
[shardformer] optimize seq parallelism ( #6086 )
...
* [shardformer] optimize seq parallelism
* [shardformer] fix gpt2 fused linear col
* [plugin] update gemini plugin
* [plugin] update moe hybrid plugin
* [test] update gpt2 fused linear test
* [shardformer] fix gpt2 fused linear reduce
2 months ago
wangbluo
efe3042bb2
fix
2 months ago
wangbluo
5ecc27e150
fix
2 months ago
wangbluo
f98384aef6
fix
2 months ago
Hongxin Liu
646b3c5a90
[shardformer] fix linear 1d row and support uneven splits for fused qkv linear ( #6084 )
...
* [tp] hotfix linear row
* [tp] support uneven split for fused linear
* [tp] support sp for fused linear
* [tp] fix gpt2 mlp policy
* [tp] fix gather fused and add fused linear row
2 months ago
wangbluo
b635dd0669
fix
2 months ago
wangbluo
3532f77b90
fix
2 months ago
wangbluo
3fab92166e
fix
2 months ago
wangbluo
6705dad41b
fix
2 months ago
wangbluo
91ed32c256
fix
2 months ago
wangbluo
6fb1322db1
fix
2 months ago
wangbluo
65c8297710
fix the attn
2 months ago
wangbluo
cfd9eda628
fix the ring attn
2 months ago
botbw
4fa6b9509c
[moe] add parallel strategy for shared_expert && fix test for deepseek ( #6063 )
2 months ago
wangbluo
10e4f7da72
fix
2 months ago
Wang Binluo
37e35230ff
Merge pull request #6061 from wangbluo/sp_fix
...
[sp] : fix the attention kernel for sp
3 months ago
wangbluo
827ef3ee9a
fix
3 months ago
Guangyao Zhang
bdb125f83f
[doc] FP8 training and communication document ( #6050 )
...
* Add FP8 training and communication document
* add fp8 docstring for plugins
* fix typo
* fix typo
3 months ago
Guangyao Zhang
f20b066c59
[fp8] Disable all_gather intranode. Disable Redundant all_gather fp8 ( #6059 )
...
* all_gather only internode, fix pytest
* fix cuda arch <89 compile pytest error
* fix pytest failure
* disable all_gather_into_tensor_flat_fp8
* fix fp8 format
* fix pytest
* fix conversations
* fix chunk tuple to list
3 months ago
wangbluo
b582319273
fix
3 months ago
wangbluo
0ad3129cb9
fix
3 months ago
wangbluo
0b14a5512e
fix
3 months ago
botbw
696fced0d7
[fp8] fix missing fp8_comm flag in mixtral ( #6057 )
3 months ago
wangbluo
dc032172c3
fix
3 months ago
wangbluo
f393867cff
fix
3 months ago