Hongxin Liu
b90835bd32
[checkpointio] fix performance issue ( #6139 )
1 week ago
Wang Binluo
8e08c27e19
[ckpt] Add async ckpt api ( #6136 )
...
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
1 week ago
Hongxin Liu
d4a436051d
[checkpointio] support async model save ( #6131 )
...
* [checkpointio] support async model save
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 week ago
Hongxin Liu
5a03d2696d
[cli] support run as module option ( #6135 )
2 weeks ago
Hanks
cc40fe0e6f
[fix] multi-node backward slowdown ( #6134 )
...
* remove redundant memcpy during backward
* get back record_stream
2 weeks ago
duanjunwen
c2fe3137e2
[hotfix] fix flash attn window_size err ( #6132 )
...
* [fix] fix flash attn
* [hotfix] fix flash-atten version
* [fix] fix flash_atten version
* [fix] fix flash-atten versions
* [fix] fix flash-attn not enough values to unpack error
* [fix] fix test_ring_attn
* [fix] fix test ring attn
2 weeks ago
Hongxin Liu
a2596519fd
[zero] support extra dp ( #6123 )
...
* [zero] support extra dp
* [zero] update checkpoint
* fix bugs
* fix bugs
2 weeks ago
Tong Li
30a9443132
[Coati] Refine prompt for better inference ( #6117 )
...
* refine prompt
* update prompt
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
3 weeks ago
Tong Li
7a60161035
update readme ( #6116 )
3 weeks ago
Hongxin Liu
a15ab139ad
[plugin] support get_grad_norm ( #6115 )
3 weeks ago
Hongxin Liu
13ffa08cfa
[release] update version ( #6109 )
3 weeks ago
pre-commit-ci[bot]
2f583c1549
[pre-commit.ci] pre-commit autoupdate ( #6078 )
...
updates:
- [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](https://github.com/psf/black-pre-commit-mirror/compare/24.8.0...24.10.0 )
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.2 )
- [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0 )
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
4 weeks ago
Hongxin Liu
c2e8f61592
[checkpointio] fix hybrid plugin model save ( #6106 )
4 weeks ago
Tong Li
89a9a600bc
[MCTS] Add self-refined MCTS ( #6098 )
...
* add reasoner
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update code
* delete llama
* update prompts
* update readme
* update readme
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 month ago
binmakeswell
4294ae83bb
[doc] sora solution news ( #6100 )
...
* [doc] sora solution news
* [doc] sora solution news
1 month ago
Hongxin Liu
80a8ca916a
[extension] hotfix compile check ( #6099 )
1 month ago
Hanks
dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
...
[hotfix] fix lora ckpt saving format
1 month ago
BurkeHulk
6d6cafabe2
pre-commit fix
1 month ago
BurkeHulk
b10339df7c
fix lora ckpt save format (ColoTensor to Tensor)
1 month ago
Hongxin Liu
19baab5fd5
[release] update version ( #6094 )
1 month ago
Hongxin Liu
58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import ( #6093 )
...
* [amp] fit torch's new api
* [amp] fix api call
* [amp] fix api call
* [misc] fit torch pytree api upgrade
* [misc] remove legacy import
* [misc] fit torch amp api
* [misc] fit torch amp api
1 month ago
Hongxin Liu
5ddad486ca
[fp8] add fallback and make compile option configurable ( #6092 )
1 month ago
botbw
3b1d7d1ae8
[chore] refactor
1 month ago
botbw
2bcd0b6844
[ckpt] add safetensors util
1 month ago
Hongxin Liu
cd61353bae
[pipeline] hotfix backward for multiple outputs ( #6090 )
...
* [pipeline] hotfix backward for multiple outputs
* [pipeline] hotfix backward for multiple outputs
1 month ago
Wenxuan Tan
62c13e7969
[Ring Attention] Improve comments ( #6085 )
...
* improve comments
* improve comments
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
1 month ago
Wang Binluo
dcd41d0973
Merge pull request #6071 from wangbluo/ring_attention
...
[Ring Attention] fix the 2d ring attn when using multiple machine
1 month ago
wangbluo
83cf2f84fb
fix
1 month ago
wangbluo
bc7eeade33
fix
1 month ago
wangbluo
fd92789af2
fix
1 month ago
wangbluo
6be9862aaf
fix
1 month ago
wangbluo
3dc08c8a5a
fix
1 month ago
wangbluo
8ff7d0c780
fix
2 months ago
wangbluo
fe9208feac
fix
2 months ago
wangbluo
3201377e94
fix
2 months ago
wangbluo
23199e34cc
fix
2 months ago
wangbluo
d891e50617
fix
2 months ago
wangbluo
e1e86f9f1f
fix
2 months ago
Tong Li
4c8e85ee0d
[Coati] Train DPO using PP ( #6054 )
...
* update dpo
* remove unsupport plugin
* update msg
* update dpo
* remove unsupport plugin
* update msg
* update template
* update dataset
* add pp for dpo
* update dpo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add dpo fn
* update dpo
* update dpo
* update dpo
* update dpo
* minor update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update loss
* update help
* polish code
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2 months ago
wangbluo
703bb5c18d
fix the test
2 months ago
wangbluo
4e0e99bb6a
fix the test
2 months ago
wangbluo
1507a7528f
fix
2 months ago
wangbluo
0002ae5956
fix
2 months ago
Hongxin Liu
dc2cdaf3e8
[shardformer] optimize seq parallelism ( #6086 )
...
* [shardformer] optimize seq parallelism
* [shardformer] fix gpt2 fused linear col
* [plugin] update gemini plugin
* [plugin] update moe hybrid plugin
* [test] update gpt2 fused linear test
* [shardformer] fix gpt2 fused linear reduce
2 months ago
wangbluo
efe3042bb2
fix
2 months ago
梁爽
6b2c506fc5
Update README.md ( #6087 )
...
add HPC-AI.COM activity
2 months ago
wangbluo
5ecc27e150
fix
2 months ago
wangbluo
f98384aef6
fix
2 months ago
Hongxin Liu
646b3c5a90
[shardformer] fix linear 1d row and support uneven splits for fused qkv linear ( #6084 )
...
* [tp] hotfix linear row
* [tp] support uneven split for fused linear
* [tp] support sp for fused linear
* [tp] fix gpt2 mlp policy
* [tp] fix gather fused and add fused linear row
2 months ago
wangbluo
b635dd0669
fix
2 months ago