3763 Commits (5caad13055e802e2665f1d70593116103a72395a)
 

Author SHA1 Message Date
Sze-qq 5caad13055
[doc] add hpc cloud intro (#6147) 2 days ago
duanjunwen e0c68ab6d3
[Zerobubble] merge main. (#6142) 3 days ago
ver217 184a653704 [checkpointio] fix pinned state dict 3 days ago
ver217 5fa657f0a1 [checkpointio] fix size compute 3 days ago
flybird11111 eb69e640e5 [async io]supoort async io (#6137) 3 days ago
Hongxin Liu b90835bd32 [checkpointio] fix performance issue (#6139) 3 days ago
Wang Binluo 8e08c27e19 [ckpt] Add async ckpt api (#6136) 3 days ago
Hongxin Liu d4a436051d [checkpointio] support async model save (#6131) 3 days ago
Hongxin Liu 5a03d2696d
[cli] support run as module option (#6135) 1 week ago
Hanks cc40fe0e6f
[fix] multi-node backward slowdown (#6134) 1 week ago
duanjunwen c2fe3137e2
[hotfix] fix flash attn window_size err (#6132) 1 week ago
Hongxin Liu a2596519fd
[zero] support extra dp (#6123) 1 week ago
Tong Li 30a9443132
[Coati] Refine prompt for better inference (#6117) 2 weeks ago
Tong Li 7a60161035
update readme (#6116) 2 weeks ago
Hongxin Liu a15ab139ad
[plugin] support get_grad_norm (#6115) 2 weeks ago
Hongxin Liu 13ffa08cfa
[release] update version (#6109) 3 weeks ago
pre-commit-ci[bot] 2f583c1549
[pre-commit.ci] pre-commit autoupdate (#6078) 3 weeks ago
Hongxin Liu c2e8f61592
[checkpointio] fix hybrid plugin model save (#6106) 3 weeks ago
Tong Li 89a9a600bc
[MCTS] Add self-refined MCTS (#6098) 4 weeks ago
binmakeswell 4294ae83bb
[doc] sora solution news (#6100) 4 weeks ago
Hongxin Liu 80a8ca916a
[extension] hotfix compile check (#6099) 4 weeks ago
Hanks dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt 1 month ago
BurkeHulk 6d6cafabe2 pre-commit fix 1 month ago
BurkeHulk b10339df7c fix lora ckpt save format (ColoTensor to Tensor) 1 month ago
Hongxin Liu 19baab5fd5
[release] update version (#6094) 1 month ago
Hongxin Liu 58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import (#6093) 1 month ago
Hongxin Liu 5ddad486ca
[fp8] add fallback and make compile option configurable (#6092) 1 month ago
botbw 3b1d7d1ae8 [chore] refactor 1 month ago
botbw 2bcd0b6844 [ckpt] add safetensors util 1 month ago
Hongxin Liu cd61353bae
[pipeline] hotfix backward for multiple outputs (#6090) 1 month ago
Wenxuan Tan 62c13e7969
[Ring Attention] Improve comments (#6085) 1 month ago
Wang Binluo dcd41d0973
Merge pull request #6071 from wangbluo/ring_attention 1 month ago
wangbluo 83cf2f84fb fix 1 month ago
wangbluo bc7eeade33 fix 1 month ago
wangbluo fd92789af2 fix 1 month ago
wangbluo 6be9862aaf fix 1 month ago
wangbluo 3dc08c8a5a fix 1 month ago
wangbluo 8ff7d0c780 fix 1 month ago
wangbluo fe9208feac fix 1 month ago
wangbluo 3201377e94 fix 1 month ago
wangbluo 23199e34cc fix 1 month ago
wangbluo d891e50617 fix 1 month ago
wangbluo e1e86f9f1f fix 1 month ago
Tong Li 4c8e85ee0d
[Coati] Train DPO using PP (#6054) 1 month ago
wangbluo 703bb5c18d fix the test 1 month ago
wangbluo 4e0e99bb6a fix the test 1 month ago
wangbluo 1507a7528f fix 1 month ago
wangbluo 0002ae5956 fix 1 month ago
Hongxin Liu dc2cdaf3e8
[shardformer] optimize seq parallelism (#6086) 1 month ago
wangbluo efe3042bb2 fix 1 month ago