Commit Graph

185 Commits (544b7a38a167cb05cdc7590cfc100e23c0ed5ab7)

Author SHA1 Message Date
Edenzzzz 936d0b0f7b
[doc] Update llama + sp compatibility; fix dist optim table
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-07-01 17:07:22 +08:00
flybird11111 773d9f964a
[shardformer]delete xformers (#5859)
* delete xformers

* fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-06-28 11:20:04 +08:00
binmakeswell 4ccaaaab63
[doc] add GPU cloud playground (#5851)
* [doc] add GPU cloud playground

* [doc] add GPU cloud playground

* [doc] add GPU cloud playground

* [doc] add GPU cloud playground

* [doc] add GPU cloud playground

* [doc] add GPU cloud playground

* [doc] add GPU cloud playground

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-06-25 11:03:16 +08:00
binmakeswell 7266f82d03
[doc] fix open sora model weight link (#5848)
* [doc] fix open sora model weight link

* [doc] fix open sora model weight link
2024-06-21 22:48:34 +08:00
binmakeswell 8f445729a4
[doc] opensora v1.2 news (#5846)
* [doc] opensora v1.2 news

* [doc] opensora v1.2 news
2024-06-21 14:20:45 +08:00
Edenzzzz 7f9ec599be
[misc] Add dist optim to doc sidebar (#5806)
* add to sidebar

* fix chinese
2024-06-18 13:52:47 +08:00
Edenzzzz 5f8c0a0ac3
[Feature] auto-cast optimizers to distributed version (#5746)
* auto-cast optimizers to distributed

* fix galore casting

* logger

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-05-24 17:24:16 +08:00
binmakeswell 4647ec28c8
[inference] release (#5747)
* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release
2024-05-23 17:44:06 +08:00
binmakeswell 2011b1356a
[misc] Update PyTorch version in docs (#5724)
* [misc] Update PyTorch version in docs

* [misc] Update PyTorch version in docs
2024-05-16 13:54:32 +08:00
Edenzzzz 43995ee436
[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694)
* [feat] Add distributed lamb; minor fixes in DeviceMesh (#5476)

* init: add dist lamb; add debiasing for lamb

* dist lamb tester mostly done

* all tests passed

* add comments

* all tests passed. Removed debugging statements

* moved setup_distributed inside plugin. Added dist layout caching

* organize better

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [hotfix] Improve tester precision by removing ZeRO on vanilla lamb (#5576)

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [optim] add distributed came (#5526)

* test CAME under LowLevelZeroOptimizer wrapper

* test CAME TP row and col pass

* test CAME zero pass

* came zero add master and worker param id convert

* came zero test pass

* came zero test pass

* test distributed came passed

* reform code, Modify some expressions and add comments

* minor fix of test came

* minor fix of dist_came and test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix of dist_came and test

* rebase dist-optim

* rebase dist-optim

* fix remaining comments

* add test dist came using booster api

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [optim] Distributed Adafactor (#5484)

* [feature] solve conflict; update optimizer readme;

* [feature] update optimize readme;

* [fix] fix testcase;

* [feature] Add transformer-bert to testcase;solve a bug related to indivisible shape (induction in use_zero and tp is row parallel);

* [feature] Add transformers_bert model zoo in testcase;

* [feature] add user documentation to docs/source/feature.

* [feature] add API Reference & Sample to optimizer Readme; add state check for bert exam;

* [feature] modify user documentation;

* [fix] fix readme format issue;

* [fix] add zero=0 in testcase; cached augment in dict;

* [fix] fix percision issue;

* [feature] add distributed rms;

* [feature] remove useless comment in testcase;

* [fix] Remove useless test; open zero test; remove fp16 test in bert exam;

* [feature] Extract distributed rms function;

* [feature] add booster + lowlevelzeroPlugin in test;

* [feature] add Start_with_booster_API case in md; add Supporting Information in md;

* [fix] Also remove state movement in base adafactor;

* [feature] extract factor function;

* [feature] add LowLevelZeroPlugin test;

* [fix] add tp=False and zero=True in logic;

* [fix] fix use zero logic;

* [feature] add row residue logic in column parallel factor;

* [feature] add check optim state func;

* [feature] Remove duplicate logic;

* [feature] update optim state check func and percision test bug;

* [fix] update/fix optim state; Still exist percision issue;

* [fix] Add use_zero check in _rms; Add plugin support info in Readme; Add Dist Adafactor init Info;

* [feature] removed print & comments in utils;

* [feature] uodate Readme;

* [feature] add LowLevelZeroPlugin test with Bert model zoo;

* [fix] fix logic in _rms;

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [fix] remove comments in testcase;

* [feature] add zh-Han Readme;

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; (#5676)

* [feature] daily update;

* [fix] fix dist came;

* [feature] refractor dist came; fix percision error; add low level zero test with bert model zoo;

* [fix] open rms; fix low level zero test; fix dist came test function name;

* [fix] remove redundant test;

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit (#5570)

* init: add dist lamb; add debiasing for lamb

* dist lamb tester mostly done

* all tests passed

* add comments

* all tests passed. Removed debugging statements

* moved setup_distributed inside plugin. Added dist layout caching

* organize better

* update comments

* add initial distributed galore

* add initial distributed galore

* add galore set param utils; change setup_distributed interface

* projected grad precision passed

* basic precision tests passed

* tests passed; located svd precision issue in fwd-bwd; banned these tests

* Plugin DP + TP tests passed

* move get_shard_dim to d_tensor

* add comments

* remove useless files

* remove useless files

* fix zero typo

* improve interface

* remove moe changes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix import

* fix deepcopy

* update came & adafactor to main

* fix param map

* fix typo

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Hotfix] Remove one buggy test case from dist_adafactor for now (#5692)


Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: chongqichuizi875 <107315010+chongqichuizi875@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: duanjunwen <54985467+duanjunwen@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
2024-05-14 13:52:45 +08:00
Edenzzzz 785cd9a9c9
[misc] Update PyTorch version in docs (#5711)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-05-13 12:02:52 +08:00
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666)
* [misc] remove config arg from initialize

* [misc] remove old tensor contrusctor

* [plugin] add npu support for ddp

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [devops] fix doc test ci

* [test] fix test launch

* [doc] update launch doc

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-29 10:40:11 +08:00
binmakeswell b8a711aa2d
[news] llama3 and open-sora v1.1 (#5655)
* [news] llama3 and open-sora v1.1

* [news] llama3 and open-sora v1.1
2024-04-26 15:36:37 +08:00
Hongxin Liu bbb2c21f16
[shardformer] fix chatglm implementation (#5644)
* [shardformer] fix chatglm policy

* [shardformer] fix chatglm flash attn

* [shardformer] update readme

* [shardformer] fix chatglm init

* [shardformer] fix chatglm test

* [pipeline] fix chatglm merge batch
2024-04-25 14:41:17 +08:00
binmakeswell f4c5aafe29
[example] llama3 (#5631)
* release llama3

* [release] llama3

* [release] llama3

* [release] llama3

* [release] llama3
2024-04-23 18:48:07 +08:00
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566)
* [devops] remove post commit ci

* [misc] run pre-commit on all files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-08 15:09:40 +08:00
binmakeswell 34e909256c
[release] grok-1 inference benchmark (#5500)
* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark
2024-03-25 14:42:51 +08:00
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404)
* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value
2024-03-25 12:31:09 +08:00
binmakeswell 6df844b8c4
[release] grok-1 314b inference (#5490)
* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference
2024-03-22 15:48:12 +08:00
binmakeswell d158fc0e64
[doc] update open-sora demo (#5479)
* [doc] update open-sora demo

* [doc] update open-sora demo

* [doc] update open-sora demo
2024-03-20 16:08:41 +08:00
binmakeswell bd998ced03
[doc] release Open-Sora 1.0 with model weights (#5468)
* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights
2024-03-18 18:31:18 +08:00
digger yu 70cce5cbed
[doc] update some translations with README-zh-Hans.md (#5382) 2024-03-05 21:45:55 +08:00
Hongxin Liu 070df689e6
[devops] fix extention building (#5427) 2024-03-05 15:35:54 +08:00
binmakeswell 822241a99c
[doc] sora release (#5425)
* [doc] sora release

* [doc] sora release

* [doc] sora release

* [doc] sora release
2024-03-05 12:08:58 +08:00
binmakeswell a1c6cdb189 [doc] fix blog link 2024-02-29 15:01:43 +08:00
Frank Lee 705a62a565
[doc] updated installation command (#5389) 2024-02-19 16:54:03 +08:00
yixiaoer 69e3ad01ed
[doc] Fix typo (#5361) 2024-02-19 16:53:28 +08:00
Frank Lee 8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
Feature/npu
2024-01-29 13:49:39 +08:00
digger yu bce9499ed3
fix some typo (#5307) 2024-01-25 13:56:27 +08:00
ver217 148469348a Merge branch 'main' into sync/npu 2024-01-18 12:05:21 +08:00
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239)
* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------

Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>
2024-01-09 10:20:05 +08:00
binmakeswell 7bc6969ce6
[doc] SwiftInfer release (#5236)
* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release
2024-01-08 09:55:12 +08:00
binmakeswell b9b32b15e6
[doc] add Colossal-LLaMA-2-13B (#5234)
* [doc] add Colossal-LLaMA-2-13B

* [doc] add Colossal-LLaMA-2-13B

* [doc] add Colossal-LLaMA-2-13B
2024-01-07 20:53:12 +08:00
flybird11111 681d9b12ef
[doc] update pytorch version in documents. (#5177)
* fix

aaa

fix

fix

fix

* fix

* fix

* test ci

* fix ci

fix

* update pytorch version in documents
2023-12-15 18:16:48 +08:00
binmakeswell 177c79f2d1
[doc] add moe news (#5128)
* [doc] add moe news

* [doc] add moe news

* [doc] add moe news
2023-11-28 17:44:06 +08:00
Wenhao Chen 7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088)
* [shardformer] implement policy for all GPT-J models and test

* [shardformer] support interleaved pipeline parallel for bert finetune

* [shardformer] shardformer support falcon (#4883)

* [shardformer]: fix interleaved pipeline for bert model (#5048)

* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)

* Add Mistral support for Shardformer (#5103)

* [shardformer] add tests to mistral (#5105)

---------

Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>
2023-11-28 16:54:42 +08:00
digger yu d5661f0f25
[nfc] fix typo change directoty to directory (#5111) 2023-11-27 18:25:53 +08:00
digger yu 2bdf76f1f2
fix typo change lazy_iniy to lazy_init (#5099) 2023-11-24 19:15:59 +08:00
digger yu 0d482302a1
[nfc] fix typo and author name (#5089) 2023-11-22 10:39:01 +08:00
digger yu fd3567e089
[nfc] fix typo in docs/ (#4972) 2023-11-21 22:06:20 +08:00
ppt0011 335cb105e2
[doc] add supported feature diagram for hybrid parallel plugin (#4996) 2023-10-31 19:56:42 +08:00
digger yu 11009103be
[nfc] fix some typo with colossalai/ docs/ etc. (#4920) 2023-10-18 15:44:04 +08:00
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869)
* add test

* fix no_sync bug in low level zero plugin

* fix test

* add argument for grad accum

* add grad accum in backward hook for gemini

* finish implementation, rewrite tests

* fix test

* skip stuck model in low level zero test

* update doc

* optimize communication & fix gradient checkpoint

* modify doc

* cleaning codes

* update cpu adam fp16 case
2023-10-17 14:07:21 +08:00
flybird11111 6a21f96a87
[doc] update advanced tutorials, training gpt with hybrid parallelism (#4866)
* [doc]update advanced tutorials, training gpt with hybrid parallelism

* [doc]update advanced tutorials, training gpt with hybrid parallelism

* update vit tutorials

* update vit tutorials

* update vit tutorials

* update vit tutorials

* update en/train_vit_with_hybrid_parallel.py

* fix

* resolve comments

* fix
2023-10-10 08:18:55 +00:00
Zhongkai Zhao db40e086c8 [test] modify model supporting part of low_level_zero plugin (including correspoding docs) 2023-10-05 15:10:31 +08:00
binmakeswell 822051d888
[doc] update slack link (#4823) 2023-09-27 17:37:39 +08:00
Hongxin Liu da15fdb9ca
[doc] add lazy init docs (#4808) 2023-09-27 10:24:04 +08:00
Baizhou Zhang 64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774)
* support unsharded saving/loading for model

* support optimizer unsharded saving

* update doc

* support unsharded loading for optimizer

* small fix
2023-09-26 10:58:03 +08:00
Baizhou Zhang a2db75546d
[doc] polish shardformer doc (#4779)
* fix example format in docstring

* polish shardformer doc
2023-09-26 10:57:47 +08:00
binmakeswell d512a4d38d
[doc] add llama2 domain-specific solution news (#4789)
* [doc] add llama2 domain-specific solution news
2023-09-25 10:44:15 +08:00