Commit Graph

127 Commits (4c4482f3adb56943a150b8b7ed886e2218fc98d5)

Author SHA1 Message Date
Hongxin Liu 21e29e2212
[doc] add tutorial for booster plugins (#3758)
* [doc] add en booster plugins doc

* [doc] add booster plugins doc in sidebar

* [doc] add zh booster plugins doc

* [doc] fix zh booster plugin translation

* [doc] reoganize tutorials order of basic section

* [devops] force sync to test ci
2023-05-19 12:12:42 +08:00
Hongxin Liu 5ce6c9d86f
[doc] add tutorial for cluster utils (#3763)
* [doc] add en cluster utils doc

* [doc] add zh cluster utils doc

* [doc] add cluster utils doc in sidebar
2023-05-19 12:12:20 +08:00
jiangmingyan 48bd056761
[doc] update hybrid parallelism doc (#3770) 2023-05-18 14:16:13 +08:00
jiangmingyan d449525acf
[doc] update booster tutorials (#3718)
* [booster] update booster tutorials#3717

* [booster] update booster tutorials#3717, fix

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, rename colossalai booster.md

* [booster] update booster tutorials#3717, rename colossalai booster.md

* [booster] update booster tutorials#3717, rename colossalai booster.md

* [booster] update booster tutorials#3717, fix

* [booster] update booster tutorials#3717, fix

* [booster] update tutorials#3717, update booster api doc

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3713

* [booster] update tutorials#3713, modify file
2023-05-18 11:41:56 +08:00
Hongxin Liu 5dd573c6b6
[devops] fix ci for document check (#3751)
* [doc] add test info

* [devops] update doc check ci

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] remove debug info and update invalid doc

* [devops] add essential comments
2023-05-17 11:24:22 +08:00
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
2023-04-26 11:38:43 +08:00
digger-yu 9edeadfb24
[doc] Update 1D_tensor_parallel.md (#3573)
Display format optimization , same as fix#3562
Simultaneous modification of en version
2023-04-17 12:19:53 +08:00
digger-yu 1c7734bc94
[doc] Update 1D_tensor_parallel.md (#3563)
Display format optimization, fix bug#3562
Specific changes
1. "This is called a column-parallel fashion" Translate to Chinese
2. use the ```math code block syntax to display a math expression as a block, No modification of formula content

Please check that the math formula is displayed correctly
If OK, I will change the format of the English version of the formula in parallel
2023-04-14 22:12:32 +08:00
digger-yu a3ac48ef3d
[doc] Update README-zh-Hans.md (#3541)
Fixing document link errors using absolute paths
2023-04-12 23:09:30 +08:00
binmakeswell 0c0455700f
[doc] add requirement and highlight application (#3516)
* [doc] add requirement and highlight application

* [doc] link example and application
2023-04-10 17:37:16 +08:00
Frank Lee 4e9989344d
[doc] updated contributor list (#3474) 2023-04-06 17:47:59 +08:00
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-04-06 14:51:35 +08:00
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import
2023-04-04 13:48:16 +08:00
binmakeswell 15a74da79c
[doc] add Intel cooperation news (#3333)
* [doc] add Intel cooperation news

* [doc] add Intel cooperation news
2023-03-30 11:45:01 +08:00
binmakeswell 31c78f2be3
[doc] add ColossalChat news (#3304)
* [doc] add ColossalChat news

* [doc] add ColossalChat news
2023-03-29 09:27:55 +08:00
binmakeswell 682af61396
[doc] add ColossalChat (#3297)
* [doc] add ColossalChat
2023-03-29 02:35:10 +08:00
Saurav Maheshkar 20d1c99444
[refactor] update docs (#3174)
* refactor: README-zh-Hans

* refactor: REFERENCE

* docs: update paths in README
2023-03-20 10:52:01 +08:00
Frank Lee 3213347b49
[doc] fixed typos in docs/README.md (#3082) 2023-03-10 10:32:14 +08:00
Frank Lee 416a50dbd7
[doc] moved doc test command to bottom (#3075) 2023-03-09 18:10:45 +08:00
Frank Lee ea0b52c12e
[doc] specified operating system requirement (#3019)
* [doc] specified operating system requirement

* polish code
2023-03-07 18:04:10 +08:00
ver217 378d827c6b
[doc] update nvme offload doc (#3014)
* [doc] update nvme offload doc

* [doc] add doc testing cmd and requirements

* [doc] add api reference

* [doc] add dependencies
2023-03-07 17:49:01 +08:00
Frank Lee 8fedc8766a
[workflow] supported conda package installation in doc test (#3028)
* [workflow] supported conda package installation in doc test

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-03-07 14:21:26 +08:00
Frank Lee e0a1c1321c
[doc] added reference to related works (#2994)
* [doc] added reference to related works

* polish code
2023-03-04 17:32:22 +08:00
github-actions[bot] dca98937f8
[format] applied code formatting on changed files in pull request 2933 (#2939)
Co-authored-by: github-actions <github-actions@github.com>
2023-02-28 15:41:52 +08:00
binmakeswell 8264cd7ef1
[doc] add env scope (#2933) 2023-02-28 15:39:51 +08:00
Frank Lee b8804aa60c
[doc] added readme for documentation (#2935) 2023-02-28 14:04:52 +08:00
Frank Lee 9e3b8b7aff
[doc] removed read-the-docs (#2932) 2023-02-28 11:28:24 +08:00
Frank Lee 77b88a3849
[workflow] added auto doc test on PR (#2929)
* [workflow] added auto doc test on PR

* [workflow] added doc test workflow

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-02-28 11:10:38 +08:00
binmakeswell 0afb55fc5b
[doc] add os scope, update tutorial install and tips (#2914) 2023-02-27 14:59:27 +08:00
YuliangLiu0306 cf6409dd40
Hotfix/auto parallel zh doc (#2820)
* [hotfix] fix autoparallel zh docs

* polish

* polish
2023-02-19 15:57:14 +08:00
YuliangLiu0306 2059fdd6b0
[hotfix] add copyright for solver and device mesh (#2803)
* [hotfix] add copyright for solver and device mesh

* add readme

* add alpa license

* polish
2023-02-18 21:14:38 +08:00
Frank Lee e376954305
[doc] add opt service doc (#2747) 2023-02-16 15:45:26 +08:00
Frank Lee 5479fdd5b8
[doc] updated documentation version list (#2730) 2023-02-15 17:39:50 +08:00
Frank Lee 2045d45ab7
[doc] updated documentation version list (#2715) 2023-02-15 11:24:18 +08:00
Frank Lee 0966008839
[dooc] fixed the sidebar itemm key (#2672) 2023-02-13 10:45:16 +08:00
Frank Lee 6d60634433
[doc] added documentation sidebar translation (#2670) 2023-02-13 10:10:12 +08:00
Frank Lee 81ea66d25d
[release] v0.2.3 (#2669)
* [release] v0.2.3

* polish code
2023-02-13 09:51:25 +08:00
YuliangLiu0306 8de85051b3
[Docs] layout converting management (#2665) 2023-02-10 18:38:32 +08:00
Frank Lee b673e5f78b
[release] v0.2.2 (#2661) 2023-02-10 11:01:24 +08:00
Frank Lee cd4f02bed8
[doc] fixed compatiblity with docusaurus (#2657) 2023-02-09 17:06:29 +08:00
Frank Lee a4ae43f071
[doc] added docusaurus-based version control (#2656) 2023-02-09 16:38:49 +08:00
Frank Lee 85b2303b55
[doc] migrate the markdown files (#2652) 2023-02-09 14:21:38 +08:00
Frank Lee d3480396f8
[doc] updated the sphinx theme (#2635) 2023-02-08 13:48:08 +08:00
binmakeswell a01278e810
Update requirements.txt 2022-11-18 18:57:18 +08:00
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
Ziyue Jiang 63f250bbd4
fix file name (#1759)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-10-25 16:48:48 +08:00
ver217 d068af81a3
[doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
ver217 f69507dd22
update rst (#615) 2022-04-01 15:46:38 +08:00
Liang Bowen 2c45efc398
html refactor (#555) 2022-03-31 11:36:56 +08:00
LuGY c44d797072
[docs] updatad docs of hybrid adam and cpu adam (#552) 2022-03-30 18:14:59 +08:00
ver217 ffca99d187
[doc] update apidoc (#530) 2022-03-25 18:29:43 +08:00
ver217 9caa8b6481
docs get correct release version (#489) 2022-03-22 14:24:41 +08:00
ver217 7e30068a22
[doc] update rst (#470)
* update rst

* remove empty rst
2022-03-21 10:52:45 +08:00
binmakeswell ce7b2c9ae3 update README and images path (#384) 2022-03-11 15:50:28 +08:00
binmakeswell 08eccfe681 add community group and update issue template(#271) 2022-03-11 15:50:28 +08:00
Sze-qq 3312d716a0 update experimental visualization (#253) 2022-03-11 15:50:28 +08:00
binmakeswell 753035edd3 add Chinese README 2022-03-11 15:50:28 +08:00
WANG-CR 6fb550acdb update logo 2022-01-21 12:31:07 +08:00
ver217 1949d3a889
update doc requirements and rtd conf (#165) 2022-01-19 19:46:43 +08:00
Frank Lee be85a0f366 removed tutorial markdown and refreshed rst files for consistency 2022-01-19 17:01:37 +08:00
binmakeswell 17ce8569a8
add logo at homepage, add forum in issue template (#161) 2022-01-19 14:29:31 +08:00
puck_WCR 9473a1b9c8
AMP docstring/markdown update (#160) 2022-01-18 18:33:36 +08:00
ver217 96780e6ee4
Optimize pipeline schedule (#94)
* add pipeline shared module wrapper and update load batch

* added model parallel process group for amp and clip grad (#86)

* added model parallel process group for amp and clip grad

* update amp and clip with model parallel process group

* remove pipeline_prev/next group (#88)

* micro batch offload

* optimize pipeline gpu memory usage

* pipeline can receive tensor shape (#93)

* optimize pipeline gpu memory usage

* fix grad accumulation step counter

* rename classes and functions

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
2021-12-30 15:56:46 +08:00
ver217 8f02a88db2
add interleaved pipeline, fix naive amp and update pipeline model initializer (#80) 2021-12-20 23:26:19 +08:00
Frank Lee 35813ed3c4
update examples and sphnix docs for the new api (#63) 2021-12-13 22:07:01 +08:00
ver217 7d3711058f
fix zero3 fp16 and add zero3 model context (#62) 2021-12-10 17:48:50 +08:00
Frank Lee 9a0466534c
update markdown docs (english) (#60) 2021-12-10 14:37:33 +08:00
Frank Lee da01c234e1
Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000

* Integrate 1d tensor parallel in Colossal-AI (#39)

* fixed 1D and 2D convergence (#38)

* optimized 2D operations

* fixed 1D ViT convergence problem

* Feature/ddp (#49)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* support torch ddp

* fix loss accumulation

* add log for ddp

* change seed

* modify timing hook

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* Feature/pipeline (#40)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* optimize communication of pipeline parallel

* fix grad clip for pipeline

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)

* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset

* update api for better usability (#58)

update api for better usability

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 15:08:29 +08:00
Frank Lee 3defa32aee
Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
2021-11-18 19:45:06 +08:00
ver217 2b05de4c64
use env to control the language of doc (#24) (#25) 2021-11-15 16:53:56 +08:00
binmakeswell 05e7069a5b fixed some typos in the documents, added blog link and paper author information in README 2021-11-03 17:18:43 +08:00
Fan Cui 18ba66e012 added Chinese documents and fixed some typos in English documents 2021-11-02 23:28:44 +08:00
ver217 50982c0b7d reoder parallelization methods in parallelization documentation 2021-11-01 14:31:55 +08:00
ver217 3c7604ba30 update documentation 2021-10-29 09:29:20 +08:00
zbian 404ecbdcc6 Migrated project 2021-10-28 18:21:23 +02:00