ColossalAI

Commit Graph

Author	SHA1	Message	Date
Baizhou Zhang	2a2eacfaf1	[pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 (#4245 ) * change for transformers loggers * add forward for GPT2ForQuestionAnswering * fix assert * fix torchrec test	2023-08-15 23:25:14 +08:00
Jianghai	d9be0472ef	[bugs] hot fix some testing bugs for new models (#4268 ) * hot fix * hot fx tracer	2023-08-15 23:25:14 +08:00
Jianghai	34f0e34a4c	[pipeline] finish bloom models pipeline and tests (#4223 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * finish bloom model * test shard gpt2 * clear cache * support all bloom models * add bloom models policies * finish bloom pipeline and tests * add set pipeline * finish bloom	2023-08-15 23:25:14 +08:00
Jianghai	e7cc62d735	[pipeline] All bert models (#4233 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision * add pure pipeline test * finish some bert models * finish all bert models * finish bert tests * fix bugs * fix bugs * fix test pipeline * fix data gen for qa * update the set pipeline forward * shared params * fix bugs	2023-08-15 23:25:14 +08:00
Baizhou Zhang	a14d352088	[pipeline] add pipeline forward for variants of gpt2 (#4238 ) * add forward for GPTLMHeadModel * add test for gpt_lm * arranging get_held_layers method * arrange forward replacement * add forward for GPT2ForTokenClassification * add forward for GPT2ForSequenceClassification * fix test_shard_gpt2.py * add GPT2DoubleHeadsmodel & fix bugs * add id checking in get_shared_params	2023-08-15 23:25:14 +08:00
Hongxin Liu	7e4de520e1	[shardformer] fix base policy (#4229 )	2023-08-15 23:25:14 +08:00
Baizhou Zhang	208ac8f2ba	[pipeline] Add Pipeline Forward for GPT2Model Shardformer (#4224 ) * * fix typehint & docstring in sharder.py * * update pipeline forward for GPT2Model * * add test for pipeline forward of GPT2Model * * add cache cleaning in gpt2 test * * change assert to raise command	2023-08-15 23:25:14 +08:00
Jianghai	37d22f6878	[pipeline] add bloom model pipeline (#4210 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * finish bloom model * test shard gpt2 * clear cache	2023-08-15 23:25:14 +08:00
Jianghai	31bcf867ae	[pipeline] Llama causal lm and llama for sequence classification pipeline (#4208 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision	2023-08-15 23:25:14 +08:00
Jianghai	1622031058	[pipeline] Llama pipeline (#4205 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt	2023-08-15 23:25:14 +08:00
Jianghai	1094e0f0d3	[pipeline] Bert pipeline for shardformer and its tests (#4197 ) * add pipeline forward * complete pipeline forward check * fix bert forward without pipeline * fix comments * discard useless line * add todo * clean prints * fix distribute layers	2023-08-15 23:25:14 +08:00
Hongxin Liu	890774b2fb	[shardformer] support lazy init (#4202 ) * [shardformer] support lazy init * [shardformer] linear support lazy init * [shardformer] embedding support lazy init * [shardformer] norm support lazy init * [shardformer] fused linear support lazy init * [test] update shardformer test layer * [test] shardformer with lazy init fit ddp * [lazy] hotfix deepcopy of param * [shardformer] fix bert policy and update test * [shardformer] fix bloom policy and update test * [shardformer] fix opt policy and update test * [shardformer] fix t5 policy and update test * [shardformer] fix gpt2 policy and update test * [shardformer] fix llama policy and update test	2023-08-15 23:25:14 +08:00
Jianghai	f3bcc292c8	[pipeline] move bert related pipeline components to shardformer (#4187 ) * move bert related pipeline components to shardformer * fix bugs * revision * fix bert model tests * fix bert_lm_head model tests * fix tests * fix tests * done checks * skip bloom	2023-08-15 23:25:14 +08:00
Jianghai	c5ea728016	[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params	2023-08-15 23:25:14 +08:00
ver217	d35bd7d0e6	[shardformer] fix type hint	2023-08-15 23:25:14 +08:00
ver217	1ed3f8a24f	[shardformer] rename policy file name	2023-08-15 23:25:14 +08:00
ver217	5fc60a3a04	[test] add shard util tests	2023-08-15 23:25:14 +08:00
ver217	2d6cc07feb	[test] update shardformer tests	2023-08-15 23:25:14 +08:00
ver217	b0b8ad2823	[pipeline] update shardformer docstring	2023-08-15 23:25:14 +08:00
ver217	59f6f573f1	[pipeline] update shardformer policy	2023-08-15 23:25:14 +08:00
Jianghai	90a65ea682	[pipeline] build bloom model and policy , revise the base class of policy (#4161 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining	2023-08-15 23:25:14 +08:00
Jianghai	c552cefa93	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	2023-08-15 23:25:14 +08:00
Hongxin Liu	5c897ddb94	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	2023-08-15 23:25:14 +08:00
Jianghai	e8e7e49243	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	2023-08-15 23:25:14 +08:00
Hongxin Liu	f51ce1bc8e	[pipeline] refactor 1f1b schedule (#4115 ) * [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import	2023-08-15 23:25:14 +08:00
Hongxin Liu	45fdc9b42c	[pipeline] implement p2p communication (#4100 ) * [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict	2023-08-15 23:25:14 +08:00
Hongxin Liu	422544222f	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	2023-08-15 23:25:14 +08:00
Hongxin Liu	5e1a9d48dd	[cluster] add process group mesh (#4039 ) * [cluster] add process group mesh * [test] add process group mesh test * force sync	2023-08-15 23:25:14 +08:00
Tian Siyuan	ff836790ae	[doc] fix a typo in examples/tutorial/auto_parallel/README.md (#4430 ) Co-authored-by: Siyuan Tian <siyuant@vmware.com>	2023-08-15 00:22:57 +08:00
Wenhao Chen	6d41c3f2aa	[doc] update Coati README (#4405 ) * style: apply formatter * fix: add outdated warnings * docs: add dataset format and polish * docs: polish README * fix: fix json format * fix: fix typos * revert: revert 7b example	2023-08-14 15:26:27 +08:00
LuGY	d86ddd9b29	[hotfix] fix unsafe async comm in zero (#4404 ) * improve stablility of zero * fix wrong index * add record stream	2023-08-11 15:09:24 +08:00
Baizhou Zhang	6ccecc0c69	[gemini] fix tensor storage cleaning in state dict collection (#4396 )	2023-08-10 15:36:46 +08:00
flybird1111	458ae331ad	[kernel] updated unittests for coloattention (#4389 ) Updated coloattention tests of checking outputs and gradients	2023-08-09 14:24:45 +08:00
binmakeswell	089c365fa0	[doc] add Series A Funding and NeurIPS news (#4377 ) * [doc] add Series A Funding and NeurIPS news * [kernal] fix mha kernal * [CI] skip moe * [CI] fix requirements	2023-08-04 17:42:07 +08:00
flybird1111	f40b718959	[doc] Fix gradient accumulation doc. (#4349 ) * [doc] fix gradient accumulation doc * [doc] fix gradient accumulation doc	2023-08-04 17:24:35 +08:00
flybird1111	38b792aab2	[coloattention] fix import error (#4380 ) fixed an import error	2023-08-04 16:28:41 +08:00
flybird1111	25c57b9fb4	[fix] coloattention support flash attention 2 (#4347 ) Improved ColoAttention interface to support flash attention 2. Solved #4322	2023-08-04 13:46:22 +08:00
Wenhao Chen	da4f7b855f	[chat] fix bugs and add unit tests (#4213 ) * style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args	2023-08-02 10:17:36 +08:00
Hongxin Liu	16bf4c0221	[test] remove useless tests (#4359 ) * [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io	2023-08-01 18:52:14 +08:00
caption	16c0acc01b	[hotfix] update gradio 3.11 to 3.34.0 (#4329 )	2023-08-01 16:25:25 +08:00
Hongxin Liu	806477121d	[release] update version (#4332 ) * [release] update version * [devops] hotfix cuda extension building * [devops] pytest ignore useless folders	2023-08-01 15:01:19 +08:00
Wenhao Chen	75c5389037	[chat] fix compute_approx_kl (#4338 )	2023-08-01 10:21:45 +08:00
LuGY	03654c0ce2	fix localhost measurement (#4320 )	2023-08-01 10:14:00 +08:00
LuGY	45b08f08cb	[zero] optimize the optimizer step time (#4221 ) * optimize the optimizer step time * fix corner case * polish * replace all-reduce with all-gather * set comm device to cuda	2023-07-31 22:13:29 +08:00
LuGY	1a49a5ea00	[zero] support shard optimizer state dict of zero (#4194 ) * support shard optimizer of zero * polish code * support sync grad manually	2023-07-31 22:13:29 +08:00
LuGY	dd7cc58299	[zero] add state dict for low level zero (#4179 ) * add state dict for zero * fix unit test * polish	2023-07-31 22:13:29 +08:00
LuGY	c668801d36	[zero] allow passing process group to zero12 (#4153 ) * allow passing process group to zero12 * union tp-zero and normal-zero * polish code	2023-07-31 22:13:29 +08:00
LuGY	79cf1b5f33	[zero]support no_sync method for zero1 plugin (#4138 ) * support no sync for zero1 plugin * polish * polish	2023-07-31 22:13:29 +08:00
LuGY	c6ab96983a	[zero] refactor low level zero for shard evenly (#4030 ) * refactor low level zero * fix zero2 and support cpu offload * avg gradient and modify unit test * refactor grad store, support layer drop * refactor bucket store, support grad accumulation * fix and update unit test of zero and ddp * compatible with tp, ga and unit test * fix memory leak and polish * add zero layer drop unittest * polish code * fix import err in unit test * support diffenert comm dtype, modify docstring style * polish code * test padding and fix * fix unit test of low level zero * fix pad recording in bucket store * support some models * polish	2023-07-31 22:13:29 +08:00
Yuanchen	5187c96b7c	support session-based training (#4313 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-07-28 11:29:55 +08:00

... 5 6 7 8 9 ...

2919 Commits (af952673f758c71126b27de8b32bdf5df8f74b69) All Branches Search

2919 Commits (af952673f758c71126b27de8b32bdf5df8f74b69)

All Branches