Bin Jia
86d22581e4
[shardformer] Add overlap optional for HybridParallelPlugin ( #4615 )
...
* add optional overlap for plugin
* remove fixed todo
1 year ago
Hongxin Liu
a39a5c66fe
Merge branch 'main' into feature/shardformer
1 year ago
Baizhou Zhang
e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins ( #4606 )
1 year ago
flybird11111
0a94fcd351
[shardformer] update bert finetune example with HybridParallelPlugin ( #4584 )
...
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] fix epoch change
* [shardformer] broadcast add pp group
* [shardformer] fix opt test hanging
* fix
* test
* test
* [shardformer] zero1+pp and the corresponding tests (#4517 )
* pause
* finish pp+zero1
* Update test_shard_vit.py
* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516 )
* fix overlap bug and support bert, add overlap as an option in shardconfig
* support overlap for chatglm and bloom
* [shardformer] fix emerged bugs after updating transformers (#4526 )
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] Add overlap support for gpt2 (#4535 )
* add overlap support for gpt2
* remove unused code
* remove unused code
* [shardformer] support pp+tp+zero1 tests (#4531 )
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] fix submodule replacement bug when enabling pp (#4544 )
* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540 )
* implement sharded optimizer saving
* add more param info
* finish implementation of sharded optimizer saving
* fix bugs in optimizer sharded saving
* add pp+zero test
* param group loading
* greedy loading of optimizer
* fix bug when loading
* implement optimizer sharded saving
* add optimizer test & arrange checkpointIO utils
* fix gemini sharding state_dict
* add verbose option
* add loading of master params
* fix typehint
* fix master/working mapping in fp16 amp
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] add bert finetune example
* [shardformer] fix epoch change
* [shardformer] broadcast add pp group
* rebase feature/shardformer
* update pipeline
* [shardformer] fix
* [shardformer] fix
* [shardformer] bert finetune fix
* [shardformer] add all_reduce operation to loss
add all_reduce operation to loss
* [shardformer] make compatible with pytree.
make compatible with pytree.
* [shardformer] disable tp
disable tp
* [shardformer] add 3d plugin to ci test
* [shardformer] update num_microbatches to None
* [shardformer] update microbatchsize
* [shardformer] update assert
* update scheduler
* update scheduler
---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
1 year ago
Jianghai
24c0768795
[shardformer] Pytree fix ( #4533 )
...
* pytree test
* test bert
* test bert
* test bert
* revise
* add register
* add register
1 year ago
yingliu-hpc
aaeb520ce3
Merge pull request #4542 from hpcaitech/chatglm
...
[coati] Add chatglm in coati
1 year ago
binmakeswell
8d7b02290f
[doc] add llama2 benchmark ( #4604 )
...
* [doc] add llama2 benchmark
* [doc] add llama2 benchmark
1 year ago
binmakeswell
7a978eb3d0
[DOC] hotfix/llama2news ( #4595 )
...
* [doc] add llama2 news
* [doc] add llama2 news
* [doc] add llama2 news
1 year ago
Hongxin Liu
63ecafb1fb
[checkpointio] optimize zero optim checkpoint io ( #4591 )
...
* [zero] update checkpoint io to save memory
* [checkpointio] add device map to save memory
1 year ago
Hongxin Liu
508ca36fe3
[pipeline] 1f1b schedule receive microbatch size ( #4589 )
1 year ago
Mashiro
cfa607080f
[Fix] Fix compile error ( #4357 )
1 year ago
栾鹏
eb952ea88d
Update Dockerfile ( #4499 )
...
fix dockerfile build
1 year ago
LuGY
cbac782254
[zero]fix zero ckptIO with offload ( #4529 )
...
* fix zero ckptio with offload
* fix load device
* saved tensors in ckpt should be on CPU
* fix unit test
* fix unit test
* add clear cache
* save memory for CI
1 year ago
Baizhou Zhang
38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin ( #4575 )
...
* hybrid plugin support huggingface from_pretrained
* add huggingface compatibility tests
* add folder cleaning
* fix bugs
1 year ago
Baizhou Zhang
c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin ( #4540 )
...
* implement sharded optimizer saving
* add more param info
* finish implementation of sharded optimizer saving
* fix bugs in optimizer sharded saving
* add pp+zero test
* param group loading
* greedy loading of optimizer
* fix bug when loading
* implement optimizer sharded saving
* add optimizer test & arrange checkpointIO utils
* fix gemini sharding state_dict
* add verbose option
* add loading of master params
* fix typehint
* fix master/working mapping in fp16 amp
1 year ago
Baizhou Zhang
2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp ( #4544 )
1 year ago
Hongxin Liu
c7b60f7547
[devops] cancel previous runs in the PR ( #4546 )
1 year ago
Tian Siyuan
f1ae8c9104
[example] change accelerate version ( #4431 )
...
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
1 year ago
ChengDaqi2023
8e2e1992b8
[example] update streamlit 0.73.1 to 1.11.1 ( #4386 )
1 year ago
flybird11111
ec18fc7340
[shardformer] support pp+tp+zero1 tests ( #4531 )
...
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
* [shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
[shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
* [shardformer] pp+tp+zero1
1 year ago
Lufang Chen
12c95a9fed
fix runtime prepare pass ( #4502 )
...
Co-authored-by: lufang.chen <lufang.chen@nio.com>
1 year ago
Ying Liu
9f852f2489
keep requirements same with main branch
1 year ago
flybird11111
d367b88785
[shardformer] fix opt test hanging ( #4521 )
...
* [shardformer] fix opt test hanging
* fix
* test
* test
* test
* fix test
* fix test
* remove print
* add fix
1 year ago
Ying Liu
c648dc093f
fix colossalai version in coati examples
1 year ago
yingliu-hpc
661a1ef712
Merge pull request #4541 from ver217/coati/chatglm
...
[coati] update ci
1 year ago
ver217
1c43bfd54e
[coati] update ci
1 year ago
Bin Jia
e241b74f24
[shardformer] Add overlap support for gpt2 ( #4535 )
...
* add overlap support for gpt2
* remove unused code
* remove unused code
1 year ago
yingliu-hpc
1467e3b41b
[coati] add chatglm model ( #4539 )
...
* update configuration of chatglm and add support in coati
* add unit test & update chatglm default config & fix bos index issue
* remove chatglm due to oom
* add dataset pkg in requirement-text
* fix parameter issue in test_models
* add ref in tokenize & rm unnessary parts
* separate source & target tokenization in chatglm
* add unit test to chatglm
* fix test dataset issue
* update truncation of chatglm
* fix Colossalai version
* fix colossal ai version in test
1 year ago
Baizhou Zhang
0387a47e63
[shardformer] fix emerged bugs after updating transformers ( #4526 )
1 year ago
Hongxin Liu
0b00def881
[example] add llama2 example ( #4527 )
...
* [example] transfer llama-1 example
* [example] fit llama-2
* [example] refactor scripts folder
* [example] fit new gemini plugin
* [cli] fix multinode runner
* [example] fit gemini optim checkpoint
* [example] refactor scripts
* [example] update requirements
* [example] update requirements
* [example] rename llama to llama2
* [example] update readme and pretrain script
* [example] refactor scripts
1 year ago
Bin Jia
c554b7f559
[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… ( #4516 )
...
* fix overlap bug and support bert, add overlap as an option in shardconfig
* support overlap for chatglm and bloom
1 year ago
Jianghai
376533a564
[shardformer] zero1+pp and the corresponding tests ( #4517 )
...
* pause
* finish pp+zero1
* Update test_shard_vit.py
1 year ago
Baizhou Zhang
44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin ( #4506 )
...
* add APIs
* implement save_sharded_model
* add test for hybrid checkpointio
* implement naive loading for sharded model
* implement efficient sharded model loading
* open a new file for hybrid checkpoint_io
* small fix
* fix circular importing
* fix docstring
* arrange arguments and apis
* small fix
1 year ago
flybird11111
de8a65babc
[shardformer] opt fix. ( #4514 )
...
* [shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
* fix
fix
fix
fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* activate checks
* [Test] test ci
* test ci
* test ci
* test ci
* test ci
* test ci
* test ci
* fix
1 year ago
LuGY
839847b7d7
[zero]support zero2 with gradient accumulation ( #4511 )
...
* support gradient accumulation with zero2
* fix type
1 year ago
github-actions[bot]
c0efc3ebcb
[format] applied code formatting on changed files in pull request 4479 ( #4504 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
flybird11111
3353e55c80
[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. ( #4498 )
...
* [shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
* fix
fix
fix
fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* activate checks
1 year ago
Hongxin Liu
27061426f7
[gemini] improve compatibility and add static placement policy ( #4479 )
...
* [gemini] remove distributed-related part from colotensor (#4379 )
* [gemini] remove process group dependency
* [gemini] remove tp part from colo tensor
* [gemini] patch inplace op
* [gemini] fix param op hook and update tests
* [test] remove useless tests
* [test] remove useless tests
* [misc] fix requirements
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [misc] update requirements
* [gemini] refactor gemini optimizer and gemini ddp (#4398 )
* [gemini] update optimizer interface
* [gemini] renaming gemini optimizer
* [gemini] refactor gemini ddp class
* [example] update gemini related example
* [example] update gemini related example
* [plugin] fix gemini plugin args
* [test] update gemini ckpt tests
* [gemini] fix checkpoint io
* [example] fix opt example requirements
* [example] fix opt example
* [example] fix opt example
* [example] fix opt example
* [gemini] add static placement policy (#4443 )
* [gemini] add static placement policy
* [gemini] fix param offload
* [test] update gemini tests
* [plugin] update gemini plugin
* [plugin] update gemini plugin docstr
* [misc] fix flash attn requirement
* [test] fix gemini checkpoint io test
* [example] update resnet example result (#4457 )
* [example] update bert example result (#4458 )
* [doc] update gemini doc (#4468 )
* [example] update gemini related examples (#4473 )
* [example] update gpt example
* [example] update dreambooth example
* [example] update vit
* [example] update opt
* [example] update palm
* [example] update vit and opt benchmark
* [hotfix] fix bert in model zoo (#4480 )
* [hotfix] fix bert in model zoo
* [test] remove chatglm gemini test
* [test] remove sam gemini test
* [test] remove vit gemini test
* [hotfix] fix opt tutorial example (#4497 )
* [hotfix] fix opt tutorial example
* [hotfix] fix opt tutorial example
1 year ago
Jianghai
e04436a82a
[shardformer] tests for 3d parallel ( #4493 )
1 year ago
flybird11111
59e252ecdb
[shardformer] chatglm support sequence parallel ( #4482 )
...
* [shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
* fix
fix
fix
fix
1 year ago
Bin Jia
351351a36e
[shardformer/sequence parallel] not support opt of seq-parallel, add warning and fix a bug in gpt2 pp ( #4488 )
1 year ago
Jianghai
5545114fd8
rename chatglm to chatglm2 ( #4484 )
1 year ago
Michelle
285fe7ba71
[chat] update config and prompt ( #4139 )
...
* update config and prompt
* update config
---------
Co-authored-by: Qianran Ma <qianranm@luchentech.com>
1 year ago
Baizhou Zhang
1c7df566e2
[shardformer] support tp+zero for shardformer ( #4472 )
...
* support tp+zero/input type cast for hybridplugin
* add tp+zero tests
* fix bucket arguments
1 year ago
Jianghai
8739aa7fa0
[shardformer] Pipeline/whisper ( #4456 )
...
* add some base tests and policies
* finish whisper base model
* add conditional generation
* finish basic tests
* whisper
* finish whisper
* finish whisper
* del useless whisper test
* fix
* add argmin to replace
* finish revision
1 year ago
flybird11111
a27e0bb494
[shardformer] bert support sequence parallel. ( #4455 )
...
* [shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
* [shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
[shardformer] bert support sequence parallel
* [shardformer] bert support sequence parallel
1 year ago
flybird11111
0ecd71e041
[shardformer] bloom support sequence parallel ( #4465 )
...
[shardformer] bloom support sequence parallel
1 year ago
Bin Jia
7c8be77081
[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp ( #4460 )
...
* support gpt2 seq parallel with pp/dp/tp
* fix a bug when waiting for stream done
* delete unused gpt2_seq file
1 year ago
LuGY
a78daf6180
[shardformer] support interleaved pipeline ( #4448 )
...
* support interleaved pipeline
* fix unit test
* remove virtual stage test in stage mgr
* add droped type hint and updated bwd
1 year ago
Hongxin Liu
26e29d58f0
[devops] add large-scale distributed test marker ( #4452 )
...
* [test] remove cpu marker
* [test] remove gpu marker
* [test] update pytest markers
* [ci] update unit test ci
1 year ago