ColossalAI

Commit Graph

Author	SHA1	Message	Date
LuGY	d86ddd9b29	[hotfix] fix unsafe async comm in zero (#4404 ) * improve stablility of zero * fix wrong index * add record stream	2023-08-11 15:09:24 +08:00
Baizhou Zhang	6ccecc0c69	[gemini] fix tensor storage cleaning in state dict collection (#4396 )	2023-08-10 15:36:46 +08:00
flybird1111	458ae331ad	[kernel] updated unittests for coloattention (#4389 ) Updated coloattention tests of checking outputs and gradients	2023-08-09 14:24:45 +08:00
binmakeswell	089c365fa0	[doc] add Series A Funding and NeurIPS news (#4377 ) * [doc] add Series A Funding and NeurIPS news * [kernal] fix mha kernal * [CI] skip moe * [CI] fix requirements	2023-08-04 17:42:07 +08:00
flybird1111	f40b718959	[doc] Fix gradient accumulation doc. (#4349 ) * [doc] fix gradient accumulation doc * [doc] fix gradient accumulation doc	2023-08-04 17:24:35 +08:00
flybird1111	38b792aab2	[coloattention] fix import error (#4380 ) fixed an import error	2023-08-04 16:28:41 +08:00
flybird1111	25c57b9fb4	[fix] coloattention support flash attention 2 (#4347 ) Improved ColoAttention interface to support flash attention 2. Solved #4322	2023-08-04 13:46:22 +08:00
Wenhao Chen	da4f7b855f	[chat] fix bugs and add unit tests (#4213 ) * style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args	2023-08-02 10:17:36 +08:00
Hongxin Liu	16bf4c0221	[test] remove useless tests (#4359 ) * [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io	2023-08-01 18:52:14 +08:00
caption	16c0acc01b	[hotfix] update gradio 3.11 to 3.34.0 (#4329 )	2023-08-01 16:25:25 +08:00
Hongxin Liu	806477121d	[release] update version (#4332 ) * [release] update version * [devops] hotfix cuda extension building * [devops] pytest ignore useless folders	2023-08-01 15:01:19 +08:00
Wenhao Chen	75c5389037	[chat] fix compute_approx_kl (#4338 )	2023-08-01 10:21:45 +08:00
LuGY	03654c0ce2	fix localhost measurement (#4320 )	2023-08-01 10:14:00 +08:00
LuGY	45b08f08cb	[zero] optimize the optimizer step time (#4221 ) * optimize the optimizer step time * fix corner case * polish * replace all-reduce with all-gather * set comm device to cuda	2023-07-31 22:13:29 +08:00
LuGY	1a49a5ea00	[zero] support shard optimizer state dict of zero (#4194 ) * support shard optimizer of zero * polish code * support sync grad manually	2023-07-31 22:13:29 +08:00
LuGY	dd7cc58299	[zero] add state dict for low level zero (#4179 ) * add state dict for zero * fix unit test * polish	2023-07-31 22:13:29 +08:00
LuGY	c668801d36	[zero] allow passing process group to zero12 (#4153 ) * allow passing process group to zero12 * union tp-zero and normal-zero * polish code	2023-07-31 22:13:29 +08:00
LuGY	79cf1b5f33	[zero]support no_sync method for zero1 plugin (#4138 ) * support no sync for zero1 plugin * polish * polish	2023-07-31 22:13:29 +08:00
LuGY	c6ab96983a	[zero] refactor low level zero for shard evenly (#4030 ) * refactor low level zero * fix zero2 and support cpu offload * avg gradient and modify unit test * refactor grad store, support layer drop * refactor bucket store, support grad accumulation * fix and update unit test of zero and ddp * compatible with tp, ga and unit test * fix memory leak and polish * add zero layer drop unittest * polish code * fix import err in unit test * support diffenert comm dtype, modify docstring style * polish code * test padding and fix * fix unit test of low level zero * fix pad recording in bucket store * support some models * polish	2023-07-31 22:13:29 +08:00
Yuanchen	5187c96b7c	support session-based training (#4313 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-07-28 11:29:55 +08:00
binmakeswell	ef4b99ebcd	add llama example CI	2023-07-26 14:12:57 +08:00
yuxuan-lou	0991405361	[NFC] polish applications/Chat/coati/models/utils.py codestyle (#4277 ) * [NFC] polish colossalai/context/random/__init__.py code style * [NFC] polish applications/Chat/coati/models/utils.py code style	2023-07-26 14:12:57 +08:00
Zirui Zhu	9e512938f6	[NFC] polish applications/Chat/coati/trainer/strategies/base.py code style (#4278 )	2023-07-26 14:12:57 +08:00
Ziheng Qin	c972d65311	applications/Chat/.gitignore (#4279 ) Co-authored-by: henryqin1997 <henryqin1997@gamil.com>	2023-07-26 14:12:57 +08:00
RichardoLuo	709e121cd5	[NFC] polish applications/Chat/coati/models/generation.py code style (#4275 )	2023-07-26 14:12:57 +08:00
Yuanchen	dc1b6127f9	[NFC] polish applications/Chat/inference/server.py code style (#4274 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-07-26 14:12:57 +08:00
アマデウス	caa4433072	[NFC] fix format of application/Chat/coati/trainer/utils.py (#4273 )	2023-07-26 14:12:57 +08:00
Xu Kai	1ce997daaf	[NFC] polish applications/Chat/examples/train_reward_model.py code style (#4271 )	2023-07-26 14:12:57 +08:00
dayellow	a50d39a143	[NFC] fix: format (#4270 ) * [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style * [NFC] polish colossalai/communication/utils.py code style --------- Co-authored-by: Minghao Huang <huangminghao@luchentech.com>	2023-07-26 14:12:57 +08:00
Wenhao Chen	fee553288b	[NFC] polish runtime_preparation_pass style (#4266 )	2023-07-26 14:12:57 +08:00
YeAnbang	3883db452c	[NFC] polish unary_elementwise_generator.py code style (#4267 ) Co-authored-by: aye42 <aye42@gatech.edu>	2023-07-26 14:12:57 +08:00
shenggan	798cb72907	[NFC] polish applications/Chat/coati/trainer/base.py code style (#4260 )	2023-07-26 14:12:57 +08:00
Zheng Zangwei (Alex Zheng)	b2debdc09b	[NFC] polish applications/Chat/coati/dataset/sft_dataset.py code style (#4259 )	2023-07-26 14:12:57 +08:00
梁爽	abe4f971e0	[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style (#4256 ) Co-authored-by: supercooledith <893754954@qq.com>	2023-07-26 14:12:57 +08:00
Yanjia0	c614a99d28	[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style (#4255 )	2023-07-26 14:12:57 +08:00
ocd_with_naming	85774f0c1f	[NFC] polish colossalai/cli/benchmark/utils.py code style (#4254 )	2023-07-26 14:12:57 +08:00
CZYCW	dee1c96344	[NFC] policy applications/Chat/examples/ray/mmmt_prompt.py code style (#4250 )	2023-07-26 14:12:57 +08:00
Junming Wu	77c469e1ba	[NFC] polish applications/Chat/coati/models/base/actor.py code style (#4248 )	2023-07-26 14:12:57 +08:00
Camille Zhong	915ed8bed1	[NFC] polish applications/Chat/inference/requirements.txt code style (#4265 )	2023-07-26 14:12:57 +08:00
Michelle	86cf6aed5b	Fix/format (#4261 ) * revise shardformer readme (#4246) * [example] add llama pretraining (#4257) * [NFC] polish colossalai/communication/p2p.py code style --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Qianran Ma <qianranm@luchentech.com>	2023-07-26 14:12:57 +08:00
Jianghai	b366f1d99f	[NFC] Fix format for mixed precision (#4253 ) * [NFC] polish colossalai/booster/mixed_precision/mixed_precision_base.py code style	2023-07-26 14:12:57 +08:00
Hongxin Liu	02192a632e	[ci] support testmon core pkg change detection (#4305 )	2023-07-21 18:36:35 +08:00
Baizhou Zhang	c6f6005990	[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302 ) * sharded optimizer checkpoint for gemini plugin * modify test to reduce testing time * update doc * fix bug when keep_gatherd is true under GeminiPlugin	2023-07-21 14:39:01 +08:00
Hongxin Liu	fc5cef2c79	[lazy] support init on cuda (#4269 ) * [lazy] support init on cuda * [test] update lazy init test * [test] fix transformer version	2023-07-19 16:43:01 +08:00
Cuiqing Li	4b977541a8	[Kernels] added triton-implemented of self attention for colossal-ai (#4241 ) * added softmax kernel * added qkv_kernel * added ops * adding tests * upload tets * fix tests * debugging * debugging tests * debugging * added * fixed errors * added softmax kernel * clean codes * added tests * update tests * update tests * added attention * add * fixed pytest checking * add cuda check * fix cuda version * fix typo	2023-07-18 23:53:38 +08:00
binmakeswell	7ff11b5537	[example] add llama pretraining (#4257 )	2023-07-17 21:07:44 +08:00
Jianghai	9a4842c571	revise shardformer readme (#4246 )	2023-07-17 17:30:57 +08:00
github-actions[bot]	4e9b09c222	Automated submodule synchronization (#4217 ) Co-authored-by: github-actions <github-actions@github.com>	2023-07-12 17:35:58 +08:00
Frank Lee	c1cf752021	[docker] fixed ninja build command (#4203 ) * [docker] fixed ninja build command * polish code	2023-07-10 11:48:27 +08:00
Baizhou Zhang	58913441a1	Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141 ) * [checkpointio] unsharded optimizer checkpoint for Gemini plugin * [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather	2023-07-07 16:33:06 +08:00

1 2 3 4 5 ...

2689 Commits (376533a56411d3826df2a5b3aabc5471016496bf) All Branches Search

2689 Commits (376533a56411d3826df2a5b3aabc5471016496bf)

All Branches