Commit Graph

2788 Commits (d512a4d38df375990591d58dad282481b6cfab05)

Author SHA1 Message Date
Hongxin Liu a686f9ddc8
[devops] fix concurrency group and compatibility test (#4665)
* [devops] fix concurrency group

* [devops] fix compatibility test

* [devops] fix tensornvme install

* [devops] fix tensornvme install

* [devops] fix colossalai install
2023-09-08 13:49:40 +08:00
Baizhou Zhang 295b38fecf
[example] update vit example for hybrid parallel plugin (#4641)
* update vit example for hybrid plugin

* reset tp/pp size

* fix dataloader iteration bug

* update optimizer passing in evaluation/add grad_accum

* change criterion

* wrap tqdm

* change grad_accum to grad_checkpoint

* fix pbar
2023-09-07 17:38:45 +08:00
Baizhou Zhang 660eed9124
[pipeline] set optimizer to optional in execute_pipeline (#4630)
* set optimizer to optional in execute_pipeline

* arrange device and mixed precision in booster init

* fix execute_pipeline in booster.py
2023-09-07 10:42:59 +08:00
eric8607242 c3d5fa3bac
[shardformer] Support customized policy for llamav2 based model with HybridParallelPlugin (#4624)
* Enable policy assignment in HybridPlugin and enable llama policy for llamav2

* Remove Policy from Plugin

* revert changes of plugin

HybridParallelModule

* revert changes in plugin

* upgrade transformers

* revert transformers version

---------

Co-authored-by: flybird11111 <1829166702@qq.com>
2023-09-07 10:15:13 +08:00
Hongxin Liu 9709b8f502
[release] update version (#4623) 2023-09-06 23:41:04 +08:00
Hongxin Liu efba0f44b9
Merge pull request #4612 from hpcaitech/feature/shardformer
[shardformer] update hybrid parallel plugin and fix bugs
2023-09-05 23:20:00 +08:00
Hongxin Liu fae6c92ead
Merge branch 'main' into feature/shardformer 2023-09-05 21:54:08 +08:00
Hongxin Liu ac178ca5c1 [legacy] move builder and registry to legacy (#4603) 2023-09-05 21:53:10 +08:00
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560)
* [legacy] move engine to legacy

* [example] fix seq parallel example

* [example] fix seq parallel example

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [example] update seq parallel requirements
2023-09-05 21:53:10 +08:00
Hongxin Liu 89fe027787 [legacy] move trainer to legacy (#4545)
* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test
2023-09-05 21:53:10 +08:00
Hongxin Liu bd18678478
[test] fix gemini checkpoint and gpt test (#4620) 2023-09-05 16:02:23 +08:00
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618)
* [zero] add method to update master params

* [zero] update zero plugin

* [plugin] update low level zero plugin
2023-09-05 15:04:02 +08:00
Hongxin Liu e71d245293
[test] ignore gpt2 shardformer test (#4619) 2023-09-05 14:21:31 +08:00
flybird11111 ec0866804c
[shardformer] update shardformer readme (#4617)
[shardformer] update shardformer readme

[shardformer] update shardformer readme
2023-09-05 13:14:41 +08:00
Bin Jia 86d22581e4
[shardformer] Add overlap optional for HybridParallelPlugin (#4615)
* add optional overlap for plugin

* remove fixed todo
2023-09-05 11:52:23 +08:00
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer 2023-09-04 23:43:13 +08:00
Baizhou Zhang e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins (#4606) 2023-09-04 23:25:01 +08:00
flybird11111 0a94fcd351
[shardformer] update bert finetune example with HybridParallelPlugin (#4584)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* [shardformer] fix opt test hanging

* fix

* test

* test

* [shardformer] zero1+pp and the corresponding tests (#4517)

* pause

* finish pp+zero1

* Update test_shard_vit.py

* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)

* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom

* [shardformer] fix emerged bugs after updating transformers (#4526)

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] Add overlap support for gpt2 (#4535)

* add overlap support for gpt2

* remove unused code

* remove unused code

* [shardformer] support pp+tp+zero1 tests (#4531)

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] fix submodule replacement bug when enabling pp (#4544)

* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* rebase feature/shardformer

* update pipeline

* [shardformer] fix

* [shardformer] fix

* [shardformer] bert finetune fix

* [shardformer] add all_reduce operation to loss

add all_reduce operation to loss

* [shardformer] make compatible with pytree.

make compatible with pytree.

* [shardformer] disable tp

disable tp

* [shardformer] add 3d plugin to ci test

* [shardformer] update num_microbatches to None

* [shardformer] update microbatchsize

* [shardformer] update assert

* update scheduler

* update scheduler

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
2023-09-04 21:46:29 +08:00
Jianghai 24c0768795
[shardformer] Pytree fix (#4533)
* pytree test

* test bert

* test bert

* test bert

* revise

* add register

* add register
2023-09-04 17:52:23 +08:00
yingliu-hpc aaeb520ce3
Merge pull request #4542 from hpcaitech/chatglm
[coati] Add chatglm in coati
2023-09-04 16:09:45 +08:00
binmakeswell 8d7b02290f
[doc] add llama2 benchmark (#4604)
* [doc] add llama2 benchmark

* [doc] add llama2 benchmark
2023-09-04 13:49:33 +08:00
binmakeswell 7a978eb3d0
[DOC] hotfix/llama2news (#4595)
* [doc] add llama2 news

* [doc] add llama2 news

* [doc] add llama2 news
2023-09-04 11:50:27 +08:00
Hongxin Liu 63ecafb1fb
[checkpointio] optimize zero optim checkpoint io (#4591)
* [zero] update checkpoint io to save memory

* [checkpointio] add device map to save memory
2023-09-04 11:26:45 +08:00
Hongxin Liu 508ca36fe3
[pipeline] 1f1b schedule receive microbatch size (#4589) 2023-09-01 21:45:14 +08:00
Mashiro cfa607080f
[Fix] Fix compile error (#4357) 2023-09-01 18:12:58 +08:00
栾鹏 eb952ea88d
Update Dockerfile (#4499)
fix dockerfile build
2023-09-01 18:12:34 +08:00
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529)
* fix zero ckptio with offload

* fix load device

* saved tensors in ckpt should be on CPU

* fix unit test

* fix unit test

* add clear cache

* save memory for CI
2023-09-01 17:41:19 +08:00
Baizhou Zhang 38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575)
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
2023-09-01 17:40:01 +08:00
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)
* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp
2023-08-31 14:50:47 +08:00
Baizhou Zhang 2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp (#4544) 2023-08-31 09:57:18 +08:00
Hongxin Liu c7b60f7547
[devops] cancel previous runs in the PR (#4546) 2023-08-30 23:07:21 +08:00
Tian Siyuan f1ae8c9104
[example] change accelerate version (#4431)
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
2023-08-30 22:56:13 +08:00
ChengDaqi2023 8e2e1992b8
[example] update streamlit 0.73.1 to 1.11.1 (#4386) 2023-08-30 22:54:45 +08:00
flybird11111 ec18fc7340
[shardformer] support pp+tp+zero1 tests (#4531)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1
2023-08-30 21:29:18 +08:00
Lufang Chen 12c95a9fed
fix runtime prepare pass (#4502)
Co-authored-by: lufang.chen <lufang.chen@nio.com>
2023-08-30 17:29:38 +08:00
Ying Liu 9f852f2489 keep requirements same with main branch 2023-08-30 16:27:12 +08:00
flybird11111 d367b88785
[shardformer] fix opt test hanging (#4521)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix
2023-08-30 14:50:34 +08:00
Ying Liu c648dc093f fix colossalai version in coati examples 2023-08-30 11:14:19 +08:00
yingliu-hpc 661a1ef712
Merge pull request #4541 from ver217/coati/chatglm
[coati] update ci
2023-08-30 11:01:41 +08:00
ver217 1c43bfd54e [coati] update ci 2023-08-30 10:55:56 +08:00
Bin Jia e241b74f24
[shardformer] Add overlap support for gpt2 (#4535)
* add overlap support for gpt2

* remove unused code

* remove unused code
2023-08-29 18:30:50 +08:00
yingliu-hpc 1467e3b41b
[coati] add chatglm model (#4539)
* update configuration of chatglm and add support in coati

* add unit test & update chatglm default config & fix bos index issue

* remove chatglm due to oom

* add dataset pkg in requirement-text

* fix parameter issue in test_models

* add ref in tokenize & rm unnessary parts

* separate source & target tokenization in chatglm

* add unit test to chatglm

* fix test dataset issue

* update truncation of chatglm

* fix Colossalai version

* fix colossal ai version in test
2023-08-29 17:58:51 +08:00
Baizhou Zhang 0387a47e63
[shardformer] fix emerged bugs after updating transformers (#4526) 2023-08-29 11:25:05 +08:00
Hongxin Liu 0b00def881
[example] add llama2 example (#4527)
* [example] transfer llama-1 example

* [example] fit llama-2

* [example] refactor scripts folder

* [example] fit new gemini plugin

* [cli] fix multinode runner

* [example] fit gemini optim checkpoint

* [example] refactor scripts

* [example] update requirements

* [example] update requirements

* [example] rename llama to llama2

* [example] update readme and pretrain script

* [example] refactor scripts
2023-08-28 17:59:11 +08:00
Bin Jia c554b7f559
[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)
* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom
2023-08-28 17:16:40 +08:00
Jianghai 376533a564
[shardformer] zero1+pp and the corresponding tests (#4517)
* pause

* finish pp+zero1

* Update test_shard_vit.py
2023-08-28 10:51:16 +08:00
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506)
* add APIs

* implement save_sharded_model

* add test for hybrid checkpointio

* implement naive loading for sharded model

* implement efficient sharded model loading

* open a new file for hybrid checkpoint_io

* small fix

* fix circular importing

* fix docstring

* arrange arguments and apis

* small fix
2023-08-25 22:04:57 +08:00
flybird11111 de8a65babc
[shardformer] opt fix. (#4514)
* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks

* [Test] test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* fix
2023-08-25 19:41:24 +08:00
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511)
* support gradient accumulation with zero2

* fix type
2023-08-25 13:44:07 +08:00
github-actions[bot] c0efc3ebcb
[format] applied code formatting on changed files in pull request 4479 (#4504)
Co-authored-by: github-actions <github-actions@github.com>
2023-08-25 10:00:53 +08:00