ColossalAI

Commit Graph

Author	SHA1	Message	Date
Hongxin Liu	411cf1d2db	[hotfix] fix gemini and zero test (#4333 ) * [hotfix] fix gemini and zero test * [hotfix] fix lazy init test * [hotfix] fix lazy init test	1 year ago
Hongxin Liu	261eab02fb	[plugin] add 3d parallel plugin (#4295 ) * [amp] add mixed precision optimizer * [plugin] add 3d parallel plugin * [booster] support pipeline * [plugin] 3d parallel plugin support clip grad norm * [shardformer] fix sharder and add plugin test * [plugin] rename 3d parallel plugin * [ci] support testmon core pkg change detection (#4305) * [hotfix] debug testmon * [hotfix] fix llama * [hotfix] fix p2p bugs * [hotfix] fix requirements	1 year ago
FoolPlayer	b3f5d7a3ba	[shardformer] support pipeline base vit model (#4284 ) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * support base vit pipeline * support vit downstream model * fix vit shard test * modify hidden states return type --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>	1 year ago
Baizhou Zhang	083d7da33d	[pipeline] add pipeline support for all T5 models (#4310 ) * complete policy for T5Model & T5ForConditionalGeneration * modify function signature in forwards * add forward for T5model * add forward for T5ForConditionalGeneration * fix a bug * fix hidden_states transporting in decoder * fix the passing of encoder_outputs	1 year ago
Jianghai	d0807122e2	[pipeline] test pure pipeline process using llama (#4218 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision * add pure pipeline test * fixed version * fixed version * pure pipeline	1 year ago
Baizhou Zhang	36e546b2cc	[pipeline] add pipeline support for T5Stack/T5EncoderModel (#4300 ) * modify t5 policy & add test * pipeline stage distribution for t5 * complete t5 base policy * t5 stack: halfway * modify gpt2 pipeline test * complete pipeline forward for T5Stack/T5EncoderModel * fix docstring * move t5 util tests to test_pipeline	1 year ago
Jianghai	18ebcf406a	[pipeline] reformat for unified design (#4283 ) * bert_reformat * reformat * reformat * fix a typo * format * format * fix bug	1 year ago
Jianghai	0a8f3c851a	[hotfix] fix opt pipeline (#4293 ) * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * fix opt * set transformers version * refactor the test pipeline * fix bug	1 year ago
Jianghai	d8408d185c	[pipeline] OPT model pipeline (#4258 ) * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * fix opt * set transformers version * refactor the test pipeline	1 year ago
Baizhou Zhang	b774d5ea0f	[pipeline] refactor gpt2 pipeline forwards (#4287 ) * move gpt2 pipeline forwards to modeling folder * check pipeline status when adding replacing policy * fix typehint * fix arguments processing in gpt2_model_forward	1 year ago
Hongxin Liu	d921ce8391	[shardformer] support inplace sharding (#4251 ) * [shardformer] embedding support inplace sharding * [shardformer] linear support inplace sharding * [shardformer] layernorm support inplace sharding * [shardformer] qkv support inplace sharding * [test] update shardformer layer test * [shardformer] fix shared param sharding * [shardformer] fix bert policy * [shardformer] fix bloom policy * [shardformer] fix llama policy * [shardformer] fix opt policy * [shardformer] fix t5 policy * [shardformer] fix fused qkv linear * [shardformer] fix bugs * force sync * [test] fix bugs * [test] fix transformer version	1 year ago
Baizhou Zhang	2a2eacfaf1	[pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 (#4245 ) * change for transformers loggers * add forward for GPT2ForQuestionAnswering * fix assert * fix torchrec test	1 year ago
Jianghai	d9be0472ef	[bugs] hot fix some testing bugs for new models (#4268 ) * hot fix * hot fx tracer	1 year ago
Jianghai	34f0e34a4c	[pipeline] finish bloom models pipeline and tests (#4223 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * finish bloom model * test shard gpt2 * clear cache * support all bloom models * add bloom models policies * finish bloom pipeline and tests * add set pipeline * finish bloom	1 year ago
Jianghai	e7cc62d735	[pipeline] All bert models (#4233 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision * add pure pipeline test * finish some bert models * finish all bert models * finish bert tests * fix bugs * fix bugs * fix test pipeline * fix data gen for qa * update the set pipeline forward * shared params * fix bugs	1 year ago
Baizhou Zhang	a14d352088	[pipeline] add pipeline forward for variants of gpt2 (#4238 ) * add forward for GPTLMHeadModel * add test for gpt_lm * arranging get_held_layers method * arrange forward replacement * add forward for GPT2ForTokenClassification * add forward for GPT2ForSequenceClassification * fix test_shard_gpt2.py * add GPT2DoubleHeadsmodel & fix bugs * add id checking in get_shared_params	1 year ago
Hongxin Liu	7e4de520e1	[shardformer] fix base policy (#4229 )	1 year ago
Baizhou Zhang	208ac8f2ba	[pipeline] Add Pipeline Forward for GPT2Model Shardformer (#4224 ) * * fix typehint & docstring in sharder.py * * update pipeline forward for GPT2Model * * add test for pipeline forward of GPT2Model * * add cache cleaning in gpt2 test * * change assert to raise command	1 year ago
Jianghai	37d22f6878	[pipeline] add bloom model pipeline (#4210 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * finish bloom model * test shard gpt2 * clear cache	1 year ago
Jianghai	31bcf867ae	[pipeline] Llama causal lm and llama for sequence classification pipeline (#4208 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision	1 year ago
Jianghai	1622031058	[pipeline] Llama pipeline (#4205 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt	1 year ago
Jianghai	1094e0f0d3	[pipeline] Bert pipeline for shardformer and its tests (#4197 ) * add pipeline forward * complete pipeline forward check * fix bert forward without pipeline * fix comments * discard useless line * add todo * clean prints * fix distribute layers	1 year ago
Hongxin Liu	890774b2fb	[shardformer] support lazy init (#4202 ) * [shardformer] support lazy init * [shardformer] linear support lazy init * [shardformer] embedding support lazy init * [shardformer] norm support lazy init * [shardformer] fused linear support lazy init * [test] update shardformer test layer * [test] shardformer with lazy init fit ddp * [lazy] hotfix deepcopy of param * [shardformer] fix bert policy and update test * [shardformer] fix bloom policy and update test * [shardformer] fix opt policy and update test * [shardformer] fix t5 policy and update test * [shardformer] fix gpt2 policy and update test * [shardformer] fix llama policy and update test	1 year ago
Jianghai	f3bcc292c8	[pipeline] move bert related pipeline components to shardformer (#4187 ) * move bert related pipeline components to shardformer * fix bugs * revision * fix bert model tests * fix bert_lm_head model tests * fix tests * fix tests * done checks * skip bloom	1 year ago
Jianghai	c5ea728016	[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params	1 year ago
ver217	d35bd7d0e6	[shardformer] fix type hint	1 year ago
ver217	1ed3f8a24f	[shardformer] rename policy file name	1 year ago
ver217	5fc60a3a04	[test] add shard util tests	1 year ago
ver217	2d6cc07feb	[test] update shardformer tests	1 year ago
ver217	b0b8ad2823	[pipeline] update shardformer docstring	1 year ago
ver217	59f6f573f1	[pipeline] update shardformer policy	1 year ago
Jianghai	90a65ea682	[pipeline] build bloom model and policy , revise the base class of policy (#4161 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining	1 year ago
Jianghai	c552cefa93	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	1 year ago
Hongxin Liu	5c897ddb94	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	1 year ago
Jianghai	e8e7e49243	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	1 year ago
Hongxin Liu	f51ce1bc8e	[pipeline] refactor 1f1b schedule (#4115 ) * [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import	1 year ago
Hongxin Liu	45fdc9b42c	[pipeline] implement p2p communication (#4100 ) * [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict	1 year ago
Hongxin Liu	422544222f	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	1 year ago
Hongxin Liu	5e1a9d48dd	[cluster] add process group mesh (#4039 ) * [cluster] add process group mesh * [test] add process group mesh test * force sync	1 year ago
Tian Siyuan	ff836790ae	[doc] fix a typo in examples/tutorial/auto_parallel/README.md (#4430 ) Co-authored-by: Siyuan Tian <siyuant@vmware.com>	1 year ago
Wenhao Chen	6d41c3f2aa	[doc] update Coati README (#4405 ) * style: apply formatter * fix: add outdated warnings * docs: add dataset format and polish * docs: polish README * fix: fix json format * fix: fix typos * revert: revert 7b example	1 year ago
LuGY	d86ddd9b29	[hotfix] fix unsafe async comm in zero (#4404 ) * improve stablility of zero * fix wrong index * add record stream	1 year ago
Baizhou Zhang	6ccecc0c69	[gemini] fix tensor storage cleaning in state dict collection (#4396 )	1 year ago
flybird1111	458ae331ad	[kernel] updated unittests for coloattention (#4389 ) Updated coloattention tests of checking outputs and gradients	1 year ago
binmakeswell	089c365fa0	[doc] add Series A Funding and NeurIPS news (#4377 ) * [doc] add Series A Funding and NeurIPS news * [kernal] fix mha kernal * [CI] skip moe * [CI] fix requirements	1 year ago
flybird1111	f40b718959	[doc] Fix gradient accumulation doc. (#4349 ) * [doc] fix gradient accumulation doc * [doc] fix gradient accumulation doc	1 year ago
flybird1111	38b792aab2	[coloattention] fix import error (#4380 ) fixed an import error	1 year ago
flybird1111	25c57b9fb4	[fix] coloattention support flash attention 2 (#4347 ) Improved ColoAttention interface to support flash attention 2. Solved #4322	1 year ago
Wenhao Chen	da4f7b855f	[chat] fix bugs and add unit tests (#4213 ) * style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args	1 year ago
Hongxin Liu	16bf4c0221	[test] remove useless tests (#4359 ) * [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io	1 year ago

... 6 7 8 9 10 ...

2980 Commits (c904d2ae997b161a5c6ddbf2057a7e194472c525) All Branches Search

2980 Commits (c904d2ae997b161a5c6ddbf2057a7e194472c525)

All Branches