Xuanlei Zhao
dc003c304c
[moe] merge moe into main ( #4978 )
...
* update moe module
* support openmoe
1 year ago
Hongxin Liu
079bf3cb26
[misc] update pre-commit and run all files ( #4752 )
...
* [misc] update pre-commit
* [misc] run pre-commit
* [misc] remove useless configuration files
* [misc] ignore cuda for clang-format
1 year ago
Hongxin Liu
b5f9e37c70
[legacy] clean up legacy code ( #4743 )
...
* [legacy] remove outdated codes of pipeline (#4692 )
* [legacy] remove cli of benchmark and update optim (#4690 )
* [legacy] remove cli of benchmark and update optim
* [doc] fix cli doc test
* [legacy] fix engine clip grad norm
* [legacy] remove outdated colo tensor (#4694 )
* [legacy] remove outdated colo tensor
* [test] fix test import
* [legacy] move outdated zero to legacy (#4696 )
* [legacy] clean up utils (#4700 )
* [legacy] clean up utils
* [example] update examples
* [legacy] clean up amp
* [legacy] fix amp module
* [legacy] clean up gpc (#4742 )
* [legacy] clean up context
* [legacy] clean core, constants and global vars
* [legacy] refactor initialize
* [example] fix examples ci
* [example] fix examples ci
* [legacy] fix tests
* [example] fix gpt example
* [example] fix examples ci
* [devops] fix ci installation
* [example] fix examples ci
1 year ago
Hongxin Liu
ac178ca5c1
[legacy] move builder and registry to legacy ( #4603 )
1 year ago
digger-yu
b7141c36dd
[CI] fix some spelling errors ( #3707 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
2 years ago
digger-yu
b9a8dff7e5
[doc] Fix typo under colossalai and doc( #3618 )
...
* Fixed several spelling errors under colossalai
* Fix the spelling error in colossalai and docs directory
* Cautious Changed the spelling error under the example folder
* Update runtime_preparation_pass.py
revert autograft to autograd
* Update search_chunk.py
utile to until
* Update check_installation.py
change misteach to mismatch in line 91
* Update 1D_tensor_parallel.md
revert to perceptron
* Update 2D_tensor_parallel.md
revert to perceptron in line 73
* Update 2p5D_tensor_parallel.md
revert to perceptron in line 71
* Update 3D_tensor_parallel.md
revert to perceptron in line 80
* Update README.md
revert to resnet in line 42
* Update reorder_graph.py
revert to indice in line 7
* Update p2p.py
revert to megatron in line 94
* Update initialize.py
revert to torchrun in line 198
* Update routers.py
change to detailed in line 63
* Update routers.py
change to detailed in line 146
* Update README.md
revert random number in line 402
2 years ago
yuxuan-lou
198a74b9fd
[NFC] polish colossalai/context/random/__init__.py code style ( #3327 )
2 years ago
RichardoLuo
1ce9d0c531
[NFC] polish initializer_data.py code style ( #3287 )
2 years ago
Kai Wang (Victor Kai)
964a28678f
[NFC] polish initializer_3d.py code style ( #3279 )
2 years ago
Arsmart1
8af977f223
[NFC] polish colossalai/context/parallel_context.py code style ( #3276 )
2 years ago
Zirui Zhu
c9e3ee389e
[NFC] polish colossalai/context/process_group_initializer/initializer_2d.py code style ( #2726 )
2 years ago
Ziyue Jiang
4603538ddd
[NFC] posh colossalai/context/process_group_initializer/initializer_sequence.py code style ( #2712 )
...
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2 years ago
アマデウス
534f68c83c
[NFC] polish pipeline process group code style ( #2694 )
2 years ago
LuGY
56ff1921e9
[NFC] polish colossalai/context/moe_context.py code style ( #2693 )
2 years ago
アマデウス
99d9713b02
Revert "Update parallel_context.py ( #2408 )"
...
This reverts commit 7d5640b9db
.
2 years ago
Haofan Wang
7d5640b9db
Update parallel_context.py ( #2408 )
2 years ago
Tongping Liu
8e22c38b89
[hotfix] Fixing the bug related to ipv6 support
...
Co-authored-by: ByteDance <tongping.liu@bytedance.com>
2 years ago
kurisusnowdeng
0b8161fab8
updated tp layers
2 years ago
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2 years ago
HELSON
95c35f73bd
[moe] initialize MoE groups by ProcessGroup ( #1640 )
2 years ago
Frank Lee
27fe8af60c
[autoparallel] refactored shape consistency to remove redundancy ( #1591 )
...
* [autoparallel] refactored shape consistency to remove redundancy
* polish code
* polish code
* polish code
2 years ago
ver217
d068af81a3
[doc] update rst and docstring ( #1351 )
...
* update rst
* add zero docstr
* fix docstr
* remove fx.tracer.meta_patch
* fix docstr
* fix docstr
* update fx rst
* fix fx docstr
* remove useless rst
2 years ago
Frank Lee
2238758c2e
[usability] improved error messages in the context module ( #856 )
3 years ago
Frank Lee
920fe31526
[compatibility] used backward-compatible API for global process group ( #758 )
3 years ago
Frank Lee
04ff5ea546
[utils] support detection of number of processes on current node ( #723 )
3 years ago
Cautiousss
055d0270c8
[NFC] polish colossalai/context/process_group_initializer/initializer_sequence.py colossalai/context/process_group_initializer initializer_tensor.py code style ( #639 )
...
Co-authored-by: 何晓昕 <cautious@r-236-100-25-172.comp.nus.edu.sg>
3 years ago
Jiang Zhuo
0a96338b13
[NFC] polish <colossalai/context/process_group_initializer/initializer_data.py> code stype ( #626 )
...
Co-authored-by: 姜卓 <jiangzhuo@jiangzhuodeMacBook-Pro.local>
3 years ago
ziyu huang
701bad439b
[NFC] polish colossalai/context/process_group_initializer/process_group_initializer.py code stype ( #617 )
...
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
3 years ago
アマデウス
297b8baae2
[model checkpoint] add gloo groups for cpu tensor communication ( #589 )
3 years ago
Liang Bowen
2c45efc398
html refactor ( #555 )
3 years ago
Liang Bowen
ec5086c49c
Refactored docstring to google style
3 years ago
Jiarui Fang
a445e118cf
[polish] polish singleton and global context ( #500 )
3 years ago
HELSON
f24b5ed201
[MOE] remove old MoE legacy ( #493 )
3 years ago
Jiarui Fang
65c0f380c2
[format] polish name format for MOE ( #481 )
3 years ago
HELSON
7544347145
[MOE] add unitest for MOE experts layout, gradient handler and kernel ( #469 )
3 years ago
HELSON
84fd7c1d4d
add moe context, moe utilities and refactor gradient handler ( #455 )
3 years ago
Frank Lee
b72b8445c6
optimized context test time consumption ( #446 )
3 years ago
Frank Lee
1e4bf85cdb
fixed bug in activation checkpointing test ( #387 )
3 years ago
RichardoLuo
8539898ec6
flake8 style change ( #363 )
3 years ago
ziyu huang
a77d73f22b
fix format parallel_context.py ( #359 )
...
Co-authored-by: huangziyu <202476410arsmart@gmail.com>
3 years ago
Maruyama_Aya
e83970e3dc
fix format ColossalAI\colossalai\context\process_group_initializer
3 years ago
アマデウス
9ee197d0e9
moved env variables to global variables; ( #215 )
...
added branch context;
added vocab parallel layers;
moved split_batch from load_batch to tensor parallel embedding layers;
updated gpt model;
updated unit test cases;
fixed few collective communicator bugs
3 years ago
HELSON
0f8c7f9804
Fixed docstring in colossalai ( #171 )
3 years ago
Frank Lee
e2089c5c15
adapted for sequence parallel ( #163 )
3 years ago
HELSON
dceae85195
Added MoE parallel ( #127 )
3 years ago
ver217
a951bc6089
update default logger ( #100 ) ( #101 )
3 years ago
ver217
96780e6ee4
Optimize pipeline schedule ( #94 )
...
* add pipeline shared module wrapper and update load batch
* added model parallel process group for amp and clip grad (#86 )
* added model parallel process group for amp and clip grad
* update amp and clip with model parallel process group
* remove pipeline_prev/next group (#88 )
* micro batch offload
* optimize pipeline gpu memory usage
* pipeline can receive tensor shape (#93 )
* optimize pipeline gpu memory usage
* fix grad accumulation step counter
* rename classes and functions
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
3 years ago
アマデウス
01a80cd86d
Hotfix/Colossalai layers ( #92 )
...
* optimized 1d layer apis; reorganized nn.layer modules; fixed tests
* fixed 2.5d runtime issue
* reworked split batch, now called in trainer.schedule.load_batch
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
3 years ago
アマデウス
0fedef4f3c
Layer integration ( #83 )
...
* integrated parallel layers for ease of building models
* integrated 2.5d layers
* cleaned codes and unit tests
* added log metric by step hook; updated imagenet benchmark; fixed some bugs
* reworked initialization; cleaned codes
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
3 years ago
ver217
8f02a88db2
add interleaved pipeline, fix naive amp and update pipeline model initializer ( #80 )
3 years ago