Baizhou Zhang
b774d5ea0f
[pipeline] refactor gpt2 pipeline forwards ( #4287 )
...
* move gpt2 pipeline forwards to modeling folder
* check pipeline status when adding replacing policy
* fix typehint
* fix arguments processing in gpt2_model_forward
2023-08-15 23:25:14 +08:00
Hongxin Liu
d921ce8391
[shardformer] support inplace sharding ( #4251 )
...
* [shardformer] embedding support inplace sharding
* [shardformer] linear support inplace sharding
* [shardformer] layernorm support inplace sharding
* [shardformer] qkv support inplace sharding
* [test] update shardformer layer test
* [shardformer] fix shared param sharding
* [shardformer] fix bert policy
* [shardformer] fix bloom policy
* [shardformer] fix llama policy
* [shardformer] fix opt policy
* [shardformer] fix t5 policy
* [shardformer] fix fused qkv linear
* [shardformer] fix bugs
* force sync
* [test] fix bugs
* [test] fix transformer version
2023-08-15 23:25:14 +08:00
Baizhou Zhang
2a2eacfaf1
[pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 ( #4245 )
...
* change for transformers loggers
* add forward for GPT2ForQuestionAnswering
* fix assert
* fix torchrec test
2023-08-15 23:25:14 +08:00
Jianghai
d9be0472ef
[bugs] hot fix some testing bugs for new models ( #4268 )
...
* hot fix
* hot fx tracer
2023-08-15 23:25:14 +08:00
Jianghai
34f0e34a4c
[pipeline] finish bloom models pipeline and tests ( #4223 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* finish bloom model
* test shard gpt2
* clear cache
* support all bloom models
* add bloom models policies
* finish bloom pipeline and tests
* add set pipeline
* finish bloom
2023-08-15 23:25:14 +08:00
Jianghai
e7cc62d735
[pipeline] All bert models ( #4233 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* finish llama
* causal lm and sequence classification
* revision
* add pure pipeline test
* finish some bert models
* finish all bert models
* finish bert tests
* fix bugs
* fix bugs
* fix test pipeline
* fix data gen for qa
* update the set pipeline forward
* shared params
* fix bugs
2023-08-15 23:25:14 +08:00
Baizhou Zhang
a14d352088
[pipeline] add pipeline forward for variants of gpt2 ( #4238 )
...
* add forward for GPTLMHeadModel
* add test for gpt_lm
* arranging get_held_layers method
* arrange forward replacement
* add forward for GPT2ForTokenClassification
* add forward for GPT2ForSequenceClassification
* fix test_shard_gpt2.py
* add GPT2DoubleHeadsmodel & fix bugs
* add id checking in get_shared_params
2023-08-15 23:25:14 +08:00
Hongxin Liu
7e4de520e1
[shardformer] fix base policy ( #4229 )
2023-08-15 23:25:14 +08:00
Baizhou Zhang
208ac8f2ba
[pipeline] Add Pipeline Forward for GPT2Model Shardformer ( #4224 )
...
* * fix typehint & docstring in sharder.py
* * update pipeline forward for GPT2Model
* * add test for pipeline forward of GPT2Model
* * add cache cleaning in gpt2 test
* * change assert to raise command
2023-08-15 23:25:14 +08:00
Jianghai
37d22f6878
[pipeline] add bloom model pipeline ( #4210 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* finish bloom model
* test shard gpt2
* clear cache
2023-08-15 23:25:14 +08:00
Jianghai
31bcf867ae
[pipeline] Llama causal lm and llama for sequence classification pipeline ( #4208 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* finish llama
* causal lm and sequence classification
* revision
2023-08-15 23:25:14 +08:00
Jianghai
1622031058
[pipeline] Llama pipeline ( #4205 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
2023-08-15 23:25:14 +08:00
Jianghai
1094e0f0d3
[pipeline] Bert pipeline for shardformer and its tests ( #4197 )
...
* add pipeline forward
* complete pipeline forward check
* fix bert forward without pipeline
* fix comments
* discard useless line
* add todo
* clean prints
* fix distribute layers
2023-08-15 23:25:14 +08:00
Hongxin Liu
890774b2fb
[shardformer] support lazy init ( #4202 )
...
* [shardformer] support lazy init
* [shardformer] linear support lazy init
* [shardformer] embedding support lazy init
* [shardformer] norm support lazy init
* [shardformer] fused linear support lazy init
* [test] update shardformer test layer
* [test] shardformer with lazy init fit ddp
* [lazy] hotfix deepcopy of param
* [shardformer] fix bert policy and update test
* [shardformer] fix bloom policy and update test
* [shardformer] fix opt policy and update test
* [shardformer] fix t5 policy and update test
* [shardformer] fix gpt2 policy and update test
* [shardformer] fix llama policy and update test
2023-08-15 23:25:14 +08:00
Jianghai
f3bcc292c8
[pipeline] move bert related pipeline components to shardformer ( #4187 )
...
* move bert related pipeline components to shardformer
* fix bugs
* revision
* fix bert model tests
* fix bert_lm_head model tests
* fix tests
* fix tests
* done checks
* skip bloom
2023-08-15 23:25:14 +08:00
Jianghai
c5ea728016
[pipeline] add bert_for_pretraining bert_lmhead forward and policy ( #4172 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
* add bert_for_pretraining forward and policy
* fix typos
* cancel warning
* change the imediate output to default dict
* change the default output of get_shared_params
2023-08-15 23:25:14 +08:00
ver217
d35bd7d0e6
[shardformer] fix type hint
2023-08-15 23:25:14 +08:00
ver217
1ed3f8a24f
[shardformer] rename policy file name
2023-08-15 23:25:14 +08:00
ver217
5fc60a3a04
[test] add shard util tests
2023-08-15 23:25:14 +08:00
ver217
2d6cc07feb
[test] update shardformer tests
2023-08-15 23:25:14 +08:00
ver217
b0b8ad2823
[pipeline] update shardformer docstring
2023-08-15 23:25:14 +08:00
ver217
59f6f573f1
[pipeline] update shardformer policy
2023-08-15 23:25:14 +08:00
Jianghai
90a65ea682
[pipeline] build bloom model and policy , revise the base class of policy ( #4161 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
2023-08-15 23:25:14 +08:00
Jianghai
c552cefa93
[pipeline]add pipeline policy and bert forward ( #4130 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
2023-08-15 23:25:14 +08:00
Hongxin Liu
5c897ddb94
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
2023-08-15 23:25:14 +08:00
Jianghai
e8e7e49243
[pipeline]add pipeline policy and bert forward ( #4130 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
2023-08-15 23:25:14 +08:00
Hongxin Liu
f51ce1bc8e
[pipeline] refactor 1f1b schedule ( #4115 )
...
* [api] update optimizer wrapper to fit pipeline
* [pipeline] add base schedule
* [pipeline] add 1f1b schedule
* [test] add pipeline schedule utils test
* [pipeline] fix import
2023-08-15 23:25:14 +08:00
Hongxin Liu
45fdc9b42c
[pipeline] implement p2p communication ( #4100 )
...
* [pipeline] add p2p communication
* [test] add p2p communication test
* [test] add rerun decorator
* [test] rename to avoid conflict
2023-08-15 23:25:14 +08:00
Hongxin Liu
422544222f
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
2023-08-15 23:25:14 +08:00
Hongxin Liu
5e1a9d48dd
[cluster] add process group mesh ( #4039 )
...
* [cluster] add process group mesh
* [test] add process group mesh test
* force sync
2023-08-15 23:25:14 +08:00
Tian Siyuan
ff836790ae
[doc] fix a typo in examples/tutorial/auto_parallel/README.md ( #4430 )
...
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
2023-08-15 00:22:57 +08:00
Wenhao Chen
6d41c3f2aa
[doc] update Coati README ( #4405 )
...
* style: apply formatter
* fix: add outdated warnings
* docs: add dataset format and polish
* docs: polish README
* fix: fix json format
* fix: fix typos
* revert: revert 7b example
2023-08-14 15:26:27 +08:00
LuGY
d86ddd9b29
[hotfix] fix unsafe async comm in zero ( #4404 )
...
* improve stablility of zero
* fix wrong index
* add record stream
2023-08-11 15:09:24 +08:00
Baizhou Zhang
6ccecc0c69
[gemini] fix tensor storage cleaning in state dict collection ( #4396 )
2023-08-10 15:36:46 +08:00
flybird1111
458ae331ad
[kernel] updated unittests for coloattention ( #4389 )
...
Updated coloattention tests of checking outputs and gradients
2023-08-09 14:24:45 +08:00
binmakeswell
089c365fa0
[doc] add Series A Funding and NeurIPS news ( #4377 )
...
* [doc] add Series A Funding and NeurIPS news
* [kernal] fix mha kernal
* [CI] skip moe
* [CI] fix requirements
2023-08-04 17:42:07 +08:00
flybird1111
f40b718959
[doc] Fix gradient accumulation doc. ( #4349 )
...
* [doc] fix gradient accumulation doc
* [doc] fix gradient accumulation doc
2023-08-04 17:24:35 +08:00
flybird1111
38b792aab2
[coloattention] fix import error ( #4380 )
...
fixed an import error
2023-08-04 16:28:41 +08:00
flybird1111
25c57b9fb4
[fix] coloattention support flash attention 2 ( #4347 )
...
Improved ColoAttention interface to support flash attention 2. Solved #4322
2023-08-04 13:46:22 +08:00
Wenhao Chen
da4f7b855f
[chat] fix bugs and add unit tests ( #4213 )
...
* style: rename replay buffer
Experience replay is typically for off policy algorithms.
Use this name in PPO maybe misleading.
* fix: fix wrong zero2 default arg
* test: update experience tests
* style: rename zero_pad fn
* fix: defer init in CycledDataLoader
* test: add benchmark test
* style: rename internal fn of generation
* style: rename internal fn of lora
* fix: remove unused loss fn
* fix: remove unused utils fn
* refactor: remove generate_with_actor fn
* fix: fix type annotation
* test: add models tests
* fix: skip llama due to long execution time
* style: modify dataset
* style: apply formatter
* perf: update reward dataset
* fix: fix wrong IGNORE_INDEX in sft dataset
* fix: remove DataCollatorForSupervisedDataset
* test: add dataset tests
* style: apply formatter
* style: rename test_ci to test_train
* feat: add llama in inference
* test: add inference tests
* test: change test scripts directory
* fix: update ci
* fix: fix typo
* fix: skip llama due to oom
* fix: fix file mod
* style: apply formatter
* refactor: remove duplicated llama_gptq
* style: apply formatter
* to: update rm test
* feat: add tokenizer arg
* feat: add download model script
* test: update train tests
* fix: modify gemini load and save pretrained
* test: update checkpoint io test
* to: modify nproc_per_node
* fix: do not remove existing dir
* fix: modify save path
* test: add random choice
* fix: fix sft path
* fix: enlarge nproc_per_node to avoid oom
* fix: add num_retry
* fix: make lora config of rm and critic consistent
* fix: add warning about lora weights
* fix: skip some gpt2 tests
* fix: remove grad ckpt in rm and critic due to errors
* refactor: directly use Actor in train_sft
* test: add more arguments
* fix: disable grad ckpt when using lora
* fix: fix save_pretrained and related tests
* test: enable zero2 tests
* revert: remove useless fn
* style: polish code
* test: modify test args
2023-08-02 10:17:36 +08:00
Hongxin Liu
16bf4c0221
[test] remove useless tests ( #4359 )
...
* [test] remove legacy zero test
* [test] remove lazy distribute test
* [test] remove outdated checkpoint io
2023-08-01 18:52:14 +08:00
caption
16c0acc01b
[hotfix] update gradio 3.11 to 3.34.0 ( #4329 )
2023-08-01 16:25:25 +08:00
Hongxin Liu
806477121d
[release] update version ( #4332 )
...
* [release] update version
* [devops] hotfix cuda extension building
* [devops] pytest ignore useless folders
2023-08-01 15:01:19 +08:00
Wenhao Chen
75c5389037
[chat] fix compute_approx_kl ( #4338 )
2023-08-01 10:21:45 +08:00
LuGY
03654c0ce2
fix localhost measurement ( #4320 )
2023-08-01 10:14:00 +08:00
LuGY
45b08f08cb
[zero] optimize the optimizer step time ( #4221 )
...
* optimize the optimizer step time
* fix corner case
* polish
* replace all-reduce with all-gather
* set comm device to cuda
2023-07-31 22:13:29 +08:00
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
2023-07-31 22:13:29 +08:00
LuGY
dd7cc58299
[zero] add state dict for low level zero ( #4179 )
...
* add state dict for zero
* fix unit test
* polish
2023-07-31 22:13:29 +08:00
LuGY
c668801d36
[zero] allow passing process group to zero12 ( #4153 )
...
* allow passing process group to zero12
* union tp-zero and normal-zero
* polish code
2023-07-31 22:13:29 +08:00
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
2023-07-31 22:13:29 +08:00