Hongxin Liu
ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint ( #5347 )
...
* [checkpointio] fix hybrid parallel optim checkpoint
* [extension] fix cuda extension
* [checkpointio] fix gemini optimizer checkpoint
* polish code
10 months ago
YeAnbang
c5239840e6
[Chat] fix sft loss nan ( #5345 )
...
* fix script
* fix script
* fix chat nan
* fix chat nan
10 months ago
Frank Lee
abd8e77ad8
[extension] fixed exception catch ( #5342 )
10 months ago
digger yu
71321a07cf
fix typo change dosen't to doesn't ( #5308 )
10 months ago
digger yu
6a3086a505
fix typo under extensions/ ( #5330 )
10 months ago
Frank Lee
febed23288
[doc] added docs for extensions ( #5324 )
...
* [doc] added docs for extensions
* polish
* polish
10 months ago
flybird11111
388179f966
[tests] fix t5 test. ( #5322 )
...
* [ci] fix shardformer tests. (#5255 )
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* fix t5 test
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
10 months ago
Frank Lee
a6709afe66
Merge pull request #5321 from FrankLeeeee/hotfix/accelerator-api
...
[accelerator] fixed npu api
10 months ago
FrankLeeeee
087d0cb1fc
[accelerator] fixed npu api
10 months ago
Frank Lee
8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
...
Feature/npu
10 months ago
Frank Lee
73f4dc578e
[workflow] updated CI image ( #5318 )
10 months ago
Frank Lee
7cfed5f076
[feat] refactored extension module ( #5298 )
...
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
10 months ago
digger yu
bce9499ed3
fix some typo ( #5307 )
10 months ago
李文军
ec912b1ba9
[NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/init_tokenizer.py code style ( #5228 )
10 months ago
Desperado-Jia
ddf879e2db
fix bug for mefture ( #5299 )
10 months ago
Hongxin Liu
d7f8db8e21
[hotfix] fix 3d plugin test ( #5292 )
10 months ago
flybird11111
f7e3f82a7e
fix llama pretrain ( #5287 )
10 months ago
Desperado-Jia
6a56967855
[doc] add llama2-13B disyplay ( #5285 )
...
* Update README.md
* fix 13b typo
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
10 months ago
Michelle
32cb74493a
fix auto loading gpt2 tokenizer ( #5279 )
11 months ago
Frank Lee
d66e6988bc
Merge pull request #5278 from ver217/sync/npu
...
[sync] sync npu branch with main
11 months ago
ver217
148469348a
Merge branch 'main' into sync/npu
11 months ago
Zhongkai Zhao
5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism ( #5230 )
11 months ago
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. ( #5246 )
...
* support gradients acc
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
fix
* fix
fix
* fix
fix
fix
11 months ago
flybird11111
2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py ( #5276 )
...
* fix ci
fix
* fix test
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
* fix
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
11 months ago
Frank Lee
d69cd2eb89
[workflow] fixed oom tests ( #5275 )
...
* [workflow] fixed oom tests
* polish
* polish
* polish
11 months ago
Frank Lee
04244aaaf1
[workflow] fixed incomplete bash command ( #5272 )
11 months ago
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg ( #5268 )
...
* fix: fix misleading mbs arg
* feat: add pp sanity check
* fix: fix 1f1b sanity check
11 months ago
binmakeswell
c174c4fc5f
[doc] fix doc typo ( #5256 )
...
* [doc] fix annotation display
* [doc] fix llama2 doc
11 months ago
flybird11111
e830ef917d
[ci] fix shardformer tests. ( #5255 )
...
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
11 months ago
digger yu
756c400ad2
fix typo in applications/ColossalEval/README.md ( #5250 )
11 months ago
Frank Lee
2b83418719
[ci] fixed ddp test ( #5254 )
...
* [ci] fixed ddp test
* polish
11 months ago
Frank Lee
d5eeeb1416
[ci] fixed booster test ( #5251 )
...
* [ci] fixed booster test
* [ci] fixed booster test
* [ci] fixed booster test
11 months ago
Frank Lee
edf94a35c3
[workflow] fixed build CI ( #5240 )
...
* [workflow] fixed build CI
* polish
* polish
* polish
* polish
* polish
11 months ago
digger yu
41e52c1c6e
[doc] fix typo in Colossal-LLaMA-2/README.md ( #5247 )
11 months ago
Frank Lee
9102d655ab
[hotfix] removed unused flag ( #5242 )
11 months ago
Hongxin Liu
d202cc28c0
[npu] change device to accelerator api ( #5239 )
...
* update accelerator
* fix timer
* fix amp
* update
* fix
* update bug
* add error raise
* fix autocast
* fix set device
* remove doc accelerator
* update doc
* update doc
* update doc
* use nullcontext
* update cpu
* update null context
* change time limit for example
* udpate
* update
* update
* update
* [npu] polish accelerator code
---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>
11 months ago
Elsa Granger
d565df3821
[pipeline] A more general _communicate in p2p ( #5062 )
...
* A more general _communicate
* feat: finish tree_flatten version p2p
* fix: update p2p api calls
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
11 months ago
Xuanlei Zhao
dd2c28a323
[npu] use extension for op builder ( #5172 )
...
* update extension
* update cpu adam
* update is
* add doc for cpu adam
* update kernel
* update commit
* update flash
* update memory efficient
* update flash attn
* update flash attention loader
* update api
* fix
* update doc
* update example time limit
* reverse change
* fix doc
* remove useless kernel
* fix
* not use warning
* update
* update
11 months ago
binmakeswell
7bc6969ce6
[doc] SwiftInfer release ( #5236 )
...
* [doc] SwiftInfer release
* [doc] SwiftInfer release
* [doc] SwiftInfer release
* [doc] SwiftInfer release
* [doc] SwiftInfer release
11 months ago
github-actions[bot]
4fb4a22a72
[format] applied code formatting on changed files in pull request 5234 ( #5235 )
...
Co-authored-by: github-actions <github-actions@github.com>
11 months ago
binmakeswell
b9b32b15e6
[doc] add Colossal-LLaMA-2-13B ( #5234 )
...
* [doc] add Colossal-LLaMA-2-13B
* [doc] add Colossal-LLaMA-2-13B
* [doc] add Colossal-LLaMA-2-13B
11 months ago
JIMMY ZHAO
ce651270f1
[doc] Make leaderboard format more uniform and good-looking ( #5231 )
...
* Make leaderboard format more unifeid and good-looking
* Update README.md
* Update README.md
11 months ago
Camille Zhong
915b4652f3
[doc] Update README.md of Colossal-LLAMA2 ( #5233 )
...
* Update README.md
* Update README.md
11 months ago
Tong Li
d992b55968
[Colossal-LLaMA-2] Release Colossal-LLaMA-2-13b-base model ( #5224 )
...
* update readme
* update readme
* update link
* update
* update readme
* update
* update
* update
* update title
* update example
* update example
* fix content
* add conclusion
* add license
* update
* update
* update version
* fix minor
11 months ago
digger yu
b0b53a171c
[nfc] fix typo colossalai/shardformer/ ( #5133 )
11 months ago
flybird11111
451e9142b8
fix flash attn ( #5209 )
11 months ago
flybird11111
365671be10
fix-test ( #5210 )
...
fix-test
fix-test
11 months ago
Hongxin Liu
7f3400b560
[devops] update torch versoin in ci ( #5217 )
11 months ago
Wenhao Chen
d799a3088f
[pipeline]: add p2p fallback order and fix interleaved pp deadlock ( #5214 )
...
* fix: add fallback order option and update 1f1b
* fix: fix deadlock comm in interleaved pp
* test: modify p2p test
11 months ago
Wenhao Chen
3c0d82b19b
[pipeline]: support arbitrary batch size in forward_only mode ( #5201 )
...
* fix: remove drop last in val & test dataloader
* feat: add run_forward_only, support arbitrary bs
* chore: modify ci script
11 months ago