Commit Graph

3096 Commits (785cd9a9c971aa58e6f8c76575111a4aa4d9513b)

Author SHA1 Message Date
Hongxin Liu 7303801854
[llama] fix training and inference scripts (#5384)
* [llama] refactor inference example to fit sft

* [llama] fix training script to fit gemini

* [llama] fix inference script
2024-02-19 16:41:04 +08:00
Hongxin Liu adae123df3
[release] update version (#5380) 2024-02-08 18:50:09 +08:00
Frank Lee efef43b53c
Merge pull request #5372 from hpcaitech/exp/mixtral 2024-02-08 16:30:05 +08:00
Frank Lee 4c03347fc7
Merge pull request #5377 from hpcaitech/example/llama-npu
[llama] support npu for Colossal-LLaMA-2
2024-02-08 14:12:11 +08:00
ver217 06db94fbc9 [moe] fix tests 2024-02-08 12:46:37 +08:00
Hongxin Liu 65e5d6baa5 [moe] fix mixtral optim checkpoint (#5344) 2024-02-07 19:21:02 +08:00
Hongxin Liu 956b561b54 [moe] fix mixtral forward default value (#5329) 2024-02-07 19:21:02 +08:00
Hongxin Liu b60be18dcc [moe] fix mixtral checkpoint io (#5314) 2024-02-07 19:21:02 +08:00
Hongxin Liu da39d21b71 [moe] support mixtral (#5309)
* [moe] add mixtral block for single expert

* [moe] mixtral block fwd support uneven ep

* [moe] mixtral block bwd support uneven ep

* [moe] add mixtral moe layer

* [moe] simplify replace

* [meo] support save sharded mixtral

* [meo] support load sharded mixtral

* [meo] support save sharded optim

* [meo] integrate moe manager into plug

* [meo] fix optimizer load

* [meo] fix mixtral layer
2024-02-07 19:21:02 +08:00
Hongxin Liu c904d2ae99 [moe] update capacity computing (#5253)
* [moe] top2 allow uneven input

* [moe] update capacity computing

* [moe] remove debug info

* [moe] update capacity computing

* [moe] update capacity computing
2024-02-07 19:21:02 +08:00
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl 2024-02-07 19:21:02 +08:00
Hongxin Liu 084c91246c
[llama] fix memory issue (#5371)
* [llama] fix memory issue

* [llama] add comment
2024-02-06 19:02:37 +08:00
Hongxin Liu c53ddda88f
[lr-scheduler] fix load state dict and add test (#5369) 2024-02-06 14:23:32 +08:00
Hongxin Liu eb4f2d90f9
[llama] polish training script and fix optim ckpt (#5368) 2024-02-06 11:52:17 +08:00
Camille Zhong a5756a8720
[eval] update llama npu eval (#5366) 2024-02-06 10:53:03 +08:00
Camille Zhong 44ca61a22b
[llama] fix neftune & pbar with start_step (#5364) 2024-02-05 18:04:23 +08:00
Hongxin Liu a4cec1715b
[llama] add flash attn patch for npu (#5362) 2024-02-05 16:48:34 +08:00
Hongxin Liu 73f9f23fc6
[llama] update training script (#5360)
* [llama] update training script

* [doc] polish docstr
2024-02-05 16:33:18 +08:00
Hongxin Liu 6c0fa7b9a8
[llama] fix dataloader for hybrid parallel (#5358)
* [plugin] refactor prepare dataloader

* [plugin] update train script
2024-02-05 15:14:56 +08:00
Hongxin Liu 2dd01e3a14
[gemini] fix param op hook when output is tuple (#5355)
* [gemini] fix param op hook when output is tuple

* [gemini] fix param op hook
2024-02-04 11:58:26 +08:00
Wenhao Chen 1c790c0877
[fix] remove unnecessary dp_size assert (#5351)
* fix: remove unnecessary assert

* test: add more 3d plugin tests

* fix: add warning
2024-02-02 14:40:20 +08:00
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
* [checkpointio] fix hybrid parallel optim checkpoint

* [extension] fix cuda extension

* [checkpointio] fix gemini optimizer checkpoint

* polish code
2024-02-01 16:13:06 +08:00
YeAnbang c5239840e6
[Chat] fix sft loss nan (#5345)
* fix script

* fix script

* fix chat nan

* fix chat nan
2024-02-01 14:25:16 +08:00
Frank Lee abd8e77ad8
[extension] fixed exception catch (#5342) 2024-01-31 18:09:49 +08:00
digger yu 71321a07cf
fix typo change dosen't to doesn't (#5308) 2024-01-30 09:57:38 +08:00
digger yu 6a3086a505
fix typo under extensions/ (#5330) 2024-01-30 09:55:16 +08:00
Frank Lee febed23288
[doc] added docs for extensions (#5324)
* [doc] added docs for extensions

* polish

* polish
2024-01-29 17:39:23 +08:00
flybird11111 388179f966
[tests] fix t5 test. (#5322)
* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* fix t5 test

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-29 17:38:46 +08:00
Frank Lee a6709afe66
Merge pull request #5321 from FrankLeeeee/hotfix/accelerator-api
[accelerator] fixed npu api
2024-01-29 14:29:58 +08:00
FrankLeeeee 087d0cb1fc [accelerator] fixed npu api 2024-01-29 14:27:52 +08:00
Frank Lee 8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
Feature/npu
2024-01-29 13:49:39 +08:00
Frank Lee 73f4dc578e
[workflow] updated CI image (#5318) 2024-01-29 11:53:07 +08:00
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298)
* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2024-01-25 17:01:48 +08:00
digger yu bce9499ed3
fix some typo (#5307) 2024-01-25 13:56:27 +08:00
李文军 ec912b1ba9
[NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/init_tokenizer.py code style (#5228) 2024-01-25 13:14:48 +08:00
Desperado-Jia ddf879e2db
fix bug for mefture (#5299) 2024-01-22 22:17:54 +08:00
Hongxin Liu d7f8db8e21
[hotfix] fix 3d plugin test (#5292) 2024-01-22 15:19:04 +08:00
flybird11111 f7e3f82a7e
fix llama pretrain (#5287) 2024-01-19 17:49:02 +08:00
Desperado-Jia 6a56967855
[doc] add llama2-13B disyplay (#5285)
* Update README.md

* fix 13b typo

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>
2024-01-19 16:04:08 +08:00
Michelle 32cb74493a
fix auto loading gpt2 tokenizer (#5279) 2024-01-18 14:08:29 +08:00
Frank Lee d66e6988bc
Merge pull request #5278 from ver217/sync/npu
[sync] sync npu branch with main
2024-01-18 13:11:45 +08:00
ver217 148469348a Merge branch 'main' into sync/npu 2024-01-18 12:05:21 +08:00
Zhongkai Zhao 5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) 2024-01-17 17:42:29 +08:00
flybird11111 46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246)
* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix
2024-01-17 15:22:33 +08:00
flybird11111 2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)
* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-17 13:38:55 +08:00
Frank Lee d69cd2eb89
[workflow] fixed oom tests (#5275)
* [workflow] fixed oom tests

* polish

* polish

* polish
2024-01-16 18:55:13 +08:00
Frank Lee 04244aaaf1
[workflow] fixed incomplete bash command (#5272) 2024-01-16 11:54:44 +08:00
Wenhao Chen ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268)
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00
binmakeswell c174c4fc5f
[doc] fix doc typo (#5256)
* [doc] fix annotation display

* [doc] fix llama2 doc
2024-01-11 21:01:11 +08:00
flybird11111 e830ef917d
[ci] fix shardformer tests. (#5255)
* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-11 19:07:45 +08:00