binmakeswell
089c365fa0
[doc] add Series A Funding and NeurIPS news ( #4377 )
...
* [doc] add Series A Funding and NeurIPS news
* [kernal] fix mha kernal
* [CI] skip moe
* [CI] fix requirements
1 year ago
flybird1111
f40b718959
[doc] Fix gradient accumulation doc. ( #4349 )
...
* [doc] fix gradient accumulation doc
* [doc] fix gradient accumulation doc
1 year ago
flybird1111
38b792aab2
[coloattention] fix import error ( #4380 )
...
fixed an import error
1 year ago
flybird1111
25c57b9fb4
[fix] coloattention support flash attention 2 ( #4347 )
...
Improved ColoAttention interface to support flash attention 2. Solved #4322
1 year ago
Wenhao Chen
da4f7b855f
[chat] fix bugs and add unit tests ( #4213 )
...
* style: rename replay buffer
Experience replay is typically for off policy algorithms.
Use this name in PPO maybe misleading.
* fix: fix wrong zero2 default arg
* test: update experience tests
* style: rename zero_pad fn
* fix: defer init in CycledDataLoader
* test: add benchmark test
* style: rename internal fn of generation
* style: rename internal fn of lora
* fix: remove unused loss fn
* fix: remove unused utils fn
* refactor: remove generate_with_actor fn
* fix: fix type annotation
* test: add models tests
* fix: skip llama due to long execution time
* style: modify dataset
* style: apply formatter
* perf: update reward dataset
* fix: fix wrong IGNORE_INDEX in sft dataset
* fix: remove DataCollatorForSupervisedDataset
* test: add dataset tests
* style: apply formatter
* style: rename test_ci to test_train
* feat: add llama in inference
* test: add inference tests
* test: change test scripts directory
* fix: update ci
* fix: fix typo
* fix: skip llama due to oom
* fix: fix file mod
* style: apply formatter
* refactor: remove duplicated llama_gptq
* style: apply formatter
* to: update rm test
* feat: add tokenizer arg
* feat: add download model script
* test: update train tests
* fix: modify gemini load and save pretrained
* test: update checkpoint io test
* to: modify nproc_per_node
* fix: do not remove existing dir
* fix: modify save path
* test: add random choice
* fix: fix sft path
* fix: enlarge nproc_per_node to avoid oom
* fix: add num_retry
* fix: make lora config of rm and critic consistent
* fix: add warning about lora weights
* fix: skip some gpt2 tests
* fix: remove grad ckpt in rm and critic due to errors
* refactor: directly use Actor in train_sft
* test: add more arguments
* fix: disable grad ckpt when using lora
* fix: fix save_pretrained and related tests
* test: enable zero2 tests
* revert: remove useless fn
* style: polish code
* test: modify test args
1 year ago
Hongxin Liu
16bf4c0221
[test] remove useless tests ( #4359 )
...
* [test] remove legacy zero test
* [test] remove lazy distribute test
* [test] remove outdated checkpoint io
1 year ago
caption
16c0acc01b
[hotfix] update gradio 3.11 to 3.34.0 ( #4329 )
1 year ago
Hongxin Liu
806477121d
[release] update version ( #4332 )
...
* [release] update version
* [devops] hotfix cuda extension building
* [devops] pytest ignore useless folders
1 year ago
Wenhao Chen
75c5389037
[chat] fix compute_approx_kl ( #4338 )
1 year ago
LuGY
03654c0ce2
fix localhost measurement ( #4320 )
1 year ago
LuGY
45b08f08cb
[zero] optimize the optimizer step time ( #4221 )
...
* optimize the optimizer step time
* fix corner case
* polish
* replace all-reduce with all-gather
* set comm device to cuda
1 year ago
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
1 year ago
LuGY
dd7cc58299
[zero] add state dict for low level zero ( #4179 )
...
* add state dict for zero
* fix unit test
* polish
1 year ago
LuGY
c668801d36
[zero] allow passing process group to zero12 ( #4153 )
...
* allow passing process group to zero12
* union tp-zero and normal-zero
* polish code
1 year ago
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
1 year ago
LuGY
c6ab96983a
[zero] refactor low level zero for shard evenly ( #4030 )
...
* refactor low level zero
* fix zero2 and support cpu offload
* avg gradient and modify unit test
* refactor grad store, support layer drop
* refactor bucket store, support grad accumulation
* fix and update unit test of zero and ddp
* compatible with tp, ga and unit test
* fix memory leak and polish
* add zero layer drop unittest
* polish code
* fix import err in unit test
* support diffenert comm dtype, modify docstring style
* polish code
* test padding and fix
* fix unit test of low level zero
* fix pad recording in bucket store
* support some models
* polish
1 year ago
Yuanchen
5187c96b7c
support session-based training ( #4313 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
1 year ago
binmakeswell
ef4b99ebcd
add llama example CI
1 year ago
yuxuan-lou
0991405361
[NFC] polish applications/Chat/coati/models/utils.py codestyle ( #4277 )
...
* [NFC] polish colossalai/context/random/__init__.py code style
* [NFC] polish applications/Chat/coati/models/utils.py code style
1 year ago
Zirui Zhu
9e512938f6
[NFC] polish applications/Chat/coati/trainer/strategies/base.py code style ( #4278 )
1 year ago
Ziheng Qin
c972d65311
applications/Chat/.gitignore ( #4279 )
...
Co-authored-by: henryqin1997 <henryqin1997@gamil.com>
1 year ago
RichardoLuo
709e121cd5
[NFC] polish applications/Chat/coati/models/generation.py code style ( #4275 )
1 year ago
Yuanchen
dc1b6127f9
[NFC] polish applications/Chat/inference/server.py code style ( #4274 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
1 year ago
アマデウス
caa4433072
[NFC] fix format of application/Chat/coati/trainer/utils.py ( #4273 )
1 year ago
Xu Kai
1ce997daaf
[NFC] polish applications/Chat/examples/train_reward_model.py code style ( #4271 )
1 year ago
dayellow
a50d39a143
[NFC] fix: format ( #4270 )
...
* [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style
* [NFC] polish colossalai/communication/utils.py code style
---------
Co-authored-by: Minghao Huang <huangminghao@luchentech.com>
1 year ago
Wenhao Chen
fee553288b
[NFC] polish runtime_preparation_pass style ( #4266 )
1 year ago
YeAnbang
3883db452c
[NFC] polish unary_elementwise_generator.py code style ( #4267 )
...
Co-authored-by: aye42 <aye42@gatech.edu>
1 year ago
shenggan
798cb72907
[NFC] polish applications/Chat/coati/trainer/base.py code style ( #4260 )
1 year ago
Zheng Zangwei (Alex Zheng)
b2debdc09b
[NFC] polish applications/Chat/coati/dataset/sft_dataset.py code style ( #4259 )
1 year ago
梁爽
abe4f971e0
[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style ( #4256 )
...
Co-authored-by: supercooledith <893754954@qq.com>
1 year ago
Yanjia0
c614a99d28
[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style ( #4255 )
1 year ago
ocd_with_naming
85774f0c1f
[NFC] polish colossalai/cli/benchmark/utils.py code style ( #4254 )
1 year ago
CZYCW
dee1c96344
[NFC] policy applications/Chat/examples/ray/mmmt_prompt.py code style ( #4250 )
1 year ago
Junming Wu
77c469e1ba
[NFC] polish applications/Chat/coati/models/base/actor.py code style ( #4248 )
1 year ago
Camille Zhong
915ed8bed1
[NFC] polish applications/Chat/inference/requirements.txt code style ( #4265 )
1 year ago
Michelle
86cf6aed5b
Fix/format ( #4261 )
...
* revise shardformer readme (#4246 )
* [example] add llama pretraining (#4257 )
* [NFC] polish colossalai/communication/p2p.py code style
---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Qianran Ma <qianranm@luchentech.com>
1 year ago
Jianghai
b366f1d99f
[NFC] Fix format for mixed precision ( #4253 )
...
* [NFC] polish colossalai/booster/mixed_precision/mixed_precision_base.py code style
1 year ago
Hongxin Liu
02192a632e
[ci] support testmon core pkg change detection ( #4305 )
1 year ago
Baizhou Zhang
c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin ( #4302 )
...
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
1 year ago
Hongxin Liu
fc5cef2c79
[lazy] support init on cuda ( #4269 )
...
* [lazy] support init on cuda
* [test] update lazy init test
* [test] fix transformer version
1 year ago
Cuiqing Li
4b977541a8
[Kernels] added triton-implemented of self attention for colossal-ai ( #4241 )
...
* added softmax kernel
* added qkv_kernel
* added ops
* adding tests
* upload tets
* fix tests
* debugging
* debugging tests
* debugging
* added
* fixed errors
* added softmax kernel
* clean codes
* added tests
* update tests
* update tests
* added attention
* add
* fixed pytest checking
* add cuda check
* fix cuda version
* fix typo
1 year ago
binmakeswell
7ff11b5537
[example] add llama pretraining ( #4257 )
1 year ago
Jianghai
9a4842c571
revise shardformer readme ( #4246 )
1 year ago
github-actions[bot]
4e9b09c222
Automated submodule synchronization ( #4217 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Frank Lee
c1cf752021
[docker] fixed ninja build command ( #4203 )
...
* [docker] fixed ninja build command
* polish code
1 year ago
Baizhou Zhang
58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin ( #4141 )
...
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin
* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
1 year ago
Frank Lee
fee32a3b78
[docker] added ssh and rdma support for docker ( #4192 )
1 year ago
Frank Lee
190a6ea9c2
[dtensor] fixed readme file name and removed deprecated file ( #4162 )
1 year ago
Frank Lee
cc3cbe9f6f
[workflow] show test duration ( #4159 )
1 year ago