Commit Graph

1761 Commits (df5e9c53cf23d44656470cc319ee0b470c40712f)

Author SHA1 Message Date
Insu Jang 00525f7772
[shardformer] fix pipeline forward error if custom layer distribution is used (#5189)
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-03-27 13:57:00 +08:00
github-actions[bot] e6707a6e8d
[format] applied code formatting on changed files in pull request 5510 (#5517)
Co-authored-by: github-actions <github-actions@github.com>
2024-03-27 11:21:03 +08:00
Hongxin Liu 19e1a5cf16
[shardformer] update colo attention to support custom mask (#5510)
* [feature] refactor colo attention (#5462)

* [extension] update api

* [feature] add colo attention

* [feature] update sdpa

* [feature] update npu attention

* [feature] update flash-attn

* [test] add flash attn test

* [test] update flash attn test

* [shardformer] update modeling to fit colo attention (#5465)

* [misc] refactor folder structure

* [shardformer] update llama flash-attn

* [shardformer] fix llama policy

* [devops] update tensornvme install

* [test] update llama test

* [shardformer] update colo attn kernel dispatch

* [shardformer] update blip2

* [shardformer] update chatglm

* [shardformer] update gpt2

* [shardformer] update gptj

* [shardformer] update opt

* [shardformer] update vit

* [shardformer] update colo attention mask prep

* [shardformer] update whisper

* [test] fix shardformer tests (#5514)

* [test] fix shardformer tests

* [test] fix shardformer tests
2024-03-27 11:19:32 +08:00
Edenzzzz 9a3321e9f4
Merge pull request #5515 from Edenzzzz/fix_layout_convert
Fix layout convertor caching
2024-03-26 19:51:02 +08:00
Edenzzzz 61da3fbc52 fixed layout converter caching and updated tester 2024-03-26 17:22:27 +08:00
Rocky Duan cbe34c557c
Fix ColoTensorSpec for py11 (#5440) 2024-03-26 15:56:49 +08:00
flybird11111 0688d92e2d
[shardformer]Fix lm parallel. (#5480)
* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* fix lm forward distribution

* fix

* test ci

* fix
2024-03-25 17:21:51 +08:00
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404)
* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value
2024-03-25 12:31:09 +08:00
flybird11111 5e16bf7980
[shardformer] fix gathering output when using tensor parallelism (#5431)
* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert
2024-03-18 15:55:11 +08:00
Hongxin Liu f2e8b9ef9f
[devops] fix compatibility (#5444)
* [devops] fix compatibility

* [hotfix] update compatibility test on pr

* [devops] fix compatibility

* [devops] record duration during comp test

* [test] decrease test duration

* fix falcon
2024-03-13 15:24:13 +08:00
digger yu 385e85afd4
[hotfix] fix typo s/keywrods/keywords etc. (#5429) 2024-03-12 11:25:16 +08:00
digger yu 5e1c93d732
[hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335)
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
2024-03-05 21:52:30 +08:00
digger yu 049121d19d
[hotfix] fix typo change enabel to enable under colossalai/shardformer/ (#5317) 2024-03-05 21:48:46 +08:00
digger yu 16c96d4d8c
[hotfix] fix typo change _descrption to _description (#5331) 2024-03-05 21:47:48 +08:00
Hongxin Liu 070df689e6
[devops] fix extention building (#5427) 2024-03-05 15:35:54 +08:00
flybird11111 29695cf70c
[example]add gpt2 benchmark example script. (#5295)
* benchmark gpt2

* fix

fix

fix

fix

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* fix

* fix

* fix

fix

* fix

fix

fix

* fix

fix

* benchmark gpt2

* fix

fix

fix

fix

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* fix

fix

* fix

fix

fix

* fix

* fix

fix

fix

fix

fix

* fix

* Update shardformer.py

---------

Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>
2024-03-04 16:18:13 +08:00
flybird11111 0a25e16e46
[shardformer]gather llama logits (#5398)
* gather llama logits

* fix
2024-02-27 22:44:07 +08:00
QinLuo bf34c6fef6
[fsdp] impl save/load shard model/optimizer (#5357) 2024-02-27 13:51:14 +08:00
Stephan Kölker 5d380a1a21
[hotfix] Fix wrong import in meta_registry (#5392) 2024-02-20 19:24:43 +08:00
Hongxin Liu 7303801854
[llama] fix training and inference scripts (#5384)
* [llama] refactor inference example to fit sft

* [llama] fix training script to fit gemini

* [llama] fix inference script
2024-02-19 16:41:04 +08:00
Frank Lee efef43b53c
Merge pull request #5372 from hpcaitech/exp/mixtral 2024-02-08 16:30:05 +08:00
Frank Lee 4c03347fc7
Merge pull request #5377 from hpcaitech/example/llama-npu
[llama] support npu for Colossal-LLaMA-2
2024-02-08 14:12:11 +08:00
ver217 06db94fbc9 [moe] fix tests 2024-02-08 12:46:37 +08:00
Hongxin Liu da39d21b71 [moe] support mixtral (#5309)
* [moe] add mixtral block for single expert

* [moe] mixtral block fwd support uneven ep

* [moe] mixtral block bwd support uneven ep

* [moe] add mixtral moe layer

* [moe] simplify replace

* [meo] support save sharded mixtral

* [meo] support load sharded mixtral

* [meo] support save sharded optim

* [meo] integrate moe manager into plug

* [meo] fix optimizer load

* [meo] fix mixtral layer
2024-02-07 19:21:02 +08:00
Hongxin Liu c904d2ae99 [moe] update capacity computing (#5253)
* [moe] top2 allow uneven input

* [moe] update capacity computing

* [moe] remove debug info

* [moe] update capacity computing

* [moe] update capacity computing
2024-02-07 19:21:02 +08:00
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl 2024-02-07 19:21:02 +08:00
Hongxin Liu c53ddda88f
[lr-scheduler] fix load state dict and add test (#5369) 2024-02-06 14:23:32 +08:00
Hongxin Liu eb4f2d90f9
[llama] polish training script and fix optim ckpt (#5368) 2024-02-06 11:52:17 +08:00
Hongxin Liu 6c0fa7b9a8
[llama] fix dataloader for hybrid parallel (#5358)
* [plugin] refactor prepare dataloader

* [plugin] update train script
2024-02-05 15:14:56 +08:00
Hongxin Liu 2dd01e3a14
[gemini] fix param op hook when output is tuple (#5355)
* [gemini] fix param op hook when output is tuple

* [gemini] fix param op hook
2024-02-04 11:58:26 +08:00
Wenhao Chen 1c790c0877
[fix] remove unnecessary dp_size assert (#5351)
* fix: remove unnecessary assert

* test: add more 3d plugin tests

* fix: add warning
2024-02-02 14:40:20 +08:00
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
* [checkpointio] fix hybrid parallel optim checkpoint

* [extension] fix cuda extension

* [checkpointio] fix gemini optimizer checkpoint

* polish code
2024-02-01 16:13:06 +08:00
digger yu 71321a07cf
fix typo change dosen't to doesn't (#5308) 2024-01-30 09:57:38 +08:00
flybird11111 388179f966
[tests] fix t5 test. (#5322)
* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* fix t5 test

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-29 17:38:46 +08:00
FrankLeeeee 087d0cb1fc [accelerator] fixed npu api 2024-01-29 14:27:52 +08:00
Frank Lee 8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
Feature/npu
2024-01-29 13:49:39 +08:00
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298)
* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2024-01-25 17:01:48 +08:00
digger yu bce9499ed3
fix some typo (#5307) 2024-01-25 13:56:27 +08:00
ver217 148469348a Merge branch 'main' into sync/npu 2024-01-18 12:05:21 +08:00
flybird11111 46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246)
* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix
2024-01-17 15:22:33 +08:00
Wenhao Chen ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268)
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00
binmakeswell c174c4fc5f
[doc] fix doc typo (#5256)
* [doc] fix annotation display

* [doc] fix llama2 doc
2024-01-11 21:01:11 +08:00
flybird11111 e830ef917d
[ci] fix shardformer tests. (#5255)
* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-11 19:07:45 +08:00
Frank Lee 9102d655ab
[hotfix] removed unused flag (#5242) 2024-01-09 14:57:07 +08:00
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239)
* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------

Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>
2024-01-09 10:20:05 +08:00
Elsa Granger d565df3821
[pipeline] A more general _communicate in p2p (#5062)
* A more general _communicate

* feat: finish tree_flatten version p2p

* fix: update p2p api calls

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-08 15:37:27 +08:00
Xuanlei Zhao dd2c28a323
[npu] use extension for op builder (#5172)
* update extension

* update cpu adam

* update is

* add doc for cpu adam

* update kernel

* update commit

* update flash

* update memory efficient

* update flash attn

* update flash attention loader

* update api

* fix

* update doc

* update example time limit

* reverse change

* fix doc

* remove useless kernel

* fix

* not use warning

* update

* update
2024-01-08 11:39:16 +08:00
digger yu b0b53a171c
[nfc] fix typo colossalai/shardformer/ (#5133) 2024-01-04 16:21:55 +08:00
flybird11111 451e9142b8
fix flash attn (#5209) 2024-01-03 14:39:53 +08:00
flybird11111 365671be10
fix-test (#5210)
fix-test

fix-test
2024-01-03 14:26:13 +08:00