ver217
184a653704
[checkpointio] fix pinned state dict
6 days ago
ver217
5fa657f0a1
[checkpointio] fix size compute
6 days ago
flybird11111
eb69e640e5
[async io]supoort async io ( #6137 )
...
* support async optimizer save/load
* fix
* fix
* support pin mem
* Update low_level_zero_plugin.py
* fix
* fix
* fix
* fix
* fix
6 days ago
Hongxin Liu
b90835bd32
[checkpointio] fix performance issue ( #6139 )
6 days ago
Wang Binluo
8e08c27e19
[ckpt] Add async ckpt api ( #6136 )
...
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
6 days ago
Hongxin Liu
d4a436051d
[checkpointio] support async model save ( #6131 )
...
* [checkpointio] support async model save
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
6 days ago
Hongxin Liu
5a03d2696d
[cli] support run as module option ( #6135 )
2 weeks ago
Hanks
cc40fe0e6f
[fix] multi-node backward slowdown ( #6134 )
...
* remove redundant memcpy during backward
* get back record_stream
2 weeks ago
duanjunwen
c2fe3137e2
[hotfix] fix flash attn window_size err ( #6132 )
...
* [fix] fix flash attn
* [hotfix] fix flash-atten version
* [fix] fix flash_atten version
* [fix] fix flash-atten versions
* [fix] fix flash-attn not enough values to unpack error
* [fix] fix test_ring_attn
* [fix] fix test ring attn
2 weeks ago
Hongxin Liu
a2596519fd
[zero] support extra dp ( #6123 )
...
* [zero] support extra dp
* [zero] update checkpoint
* fix bugs
* fix bugs
2 weeks ago
Hongxin Liu
a15ab139ad
[plugin] support get_grad_norm ( #6115 )
3 weeks ago
Hongxin Liu
c2e8f61592
[checkpointio] fix hybrid plugin model save ( #6106 )
4 weeks ago
Hanks
dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
...
[hotfix] fix lora ckpt saving format
1 month ago
BurkeHulk
6d6cafabe2
pre-commit fix
1 month ago
BurkeHulk
b10339df7c
fix lora ckpt save format (ColoTensor to Tensor)
1 month ago
Hongxin Liu
58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import ( #6093 )
...
* [amp] fit torch's new api
* [amp] fix api call
* [amp] fix api call
* [misc] fit torch pytree api upgrade
* [misc] remove legacy import
* [misc] fit torch amp api
* [misc] fit torch amp api
1 month ago
Hongxin Liu
5ddad486ca
[fp8] add fallback and make compile option configurable ( #6092 )
1 month ago
botbw
3b1d7d1ae8
[chore] refactor
1 month ago
botbw
2bcd0b6844
[ckpt] add safetensors util
1 month ago
Hongxin Liu
cd61353bae
[pipeline] hotfix backward for multiple outputs ( #6090 )
...
* [pipeline] hotfix backward for multiple outputs
* [pipeline] hotfix backward for multiple outputs
1 month ago
Wenxuan Tan
62c13e7969
[Ring Attention] Improve comments ( #6085 )
...
* improve comments
* improve comments
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
1 month ago
Wang Binluo
dcd41d0973
Merge pull request #6071 from wangbluo/ring_attention
...
[Ring Attention] fix the 2d ring attn when using multiple machine
1 month ago
wangbluo
83cf2f84fb
fix
1 month ago
wangbluo
bc7eeade33
fix
1 month ago
wangbluo
fd92789af2
fix
1 month ago
wangbluo
6be9862aaf
fix
1 month ago
wangbluo
3dc08c8a5a
fix
1 month ago
wangbluo
fe9208feac
fix
1 month ago
wangbluo
3201377e94
fix
1 month ago
wangbluo
23199e34cc
fix
1 month ago
wangbluo
d891e50617
fix
1 month ago
wangbluo
e1e86f9f1f
fix
1 month ago
Tong Li
4c8e85ee0d
[Coati] Train DPO using PP ( #6054 )
...
* update dpo
* remove unsupport plugin
* update msg
* update dpo
* remove unsupport plugin
* update msg
* update template
* update dataset
* add pp for dpo
* update dpo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add dpo fn
* update dpo
* update dpo
* update dpo
* update dpo
* minor update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update loss
* update help
* polish code
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 month ago
wangbluo
1507a7528f
fix
2 months ago
wangbluo
0002ae5956
fix
2 months ago
Hongxin Liu
dc2cdaf3e8
[shardformer] optimize seq parallelism ( #6086 )
...
* [shardformer] optimize seq parallelism
* [shardformer] fix gpt2 fused linear col
* [plugin] update gemini plugin
* [plugin] update moe hybrid plugin
* [test] update gpt2 fused linear test
* [shardformer] fix gpt2 fused linear reduce
2 months ago
wangbluo
efe3042bb2
fix
2 months ago
wangbluo
5ecc27e150
fix
2 months ago
wangbluo
f98384aef6
fix
2 months ago
Hongxin Liu
646b3c5a90
[shardformer] fix linear 1d row and support uneven splits for fused qkv linear ( #6084 )
...
* [tp] hotfix linear row
* [tp] support uneven split for fused linear
* [tp] support sp for fused linear
* [tp] fix gpt2 mlp policy
* [tp] fix gather fused and add fused linear row
2 months ago
wangbluo
b635dd0669
fix
2 months ago
wangbluo
3532f77b90
fix
2 months ago
wangbluo
3fab92166e
fix
2 months ago
wangbluo
6705dad41b
fix
2 months ago
wangbluo
91ed32c256
fix
2 months ago
wangbluo
6fb1322db1
fix
2 months ago
wangbluo
65c8297710
fix the attn
2 months ago
wangbluo
cfd9eda628
fix the ring attn
2 months ago
botbw
4fa6b9509c
[moe] add parallel strategy for shared_expert && fix test for deepseek ( #6063 )
2 months ago
wangbluo
10e4f7da72
fix
2 months ago