ColossalAI/examples
flybird11111 0a94fcd351
[shardformer] update bert finetune example with HybridParallelPlugin (#4584)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* [shardformer] fix opt test hanging

* fix

* test

* test

* [shardformer] zero1+pp and the corresponding tests (#4517)

* pause

* finish pp+zero1

* Update test_shard_vit.py

* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)

* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom

* [shardformer] fix emerged bugs after updating transformers (#4526)

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] Add overlap support for gpt2 (#4535)

* add overlap support for gpt2

* remove unused code

* remove unused code

* [shardformer] support pp+tp+zero1 tests (#4531)

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] fix submodule replacement bug when enabling pp (#4544)

* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* rebase feature/shardformer

* update pipeline

* [shardformer] fix

* [shardformer] fix

* [shardformer] bert finetune fix

* [shardformer] add all_reduce operation to loss

add all_reduce operation to loss

* [shardformer] make compatible with pytree.

make compatible with pytree.

* [shardformer] disable tp

disable tp

* [shardformer] add 3d plugin to ci test

* [shardformer] update num_microbatches to None

* [shardformer] update microbatchsize

* [shardformer] update assert

* update scheduler

* update scheduler

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
2023-09-04 21:46:29 +08:00
..
community [hotfix]fix argument naming in docs and examples (#4083) 2023-06-26 23:50:04 +08:00
images [hotfix] update gradio 3.11 to 3.34.0 (#4329) 2023-08-01 16:25:25 +08:00
language [shardformer] update bert finetune example with HybridParallelPlugin (#4584) 2023-09-04 21:46:29 +08:00
tutorial [doc] fix a typo in examples/tutorial/auto_parallel/README.md (#4430) 2023-08-15 00:22:57 +08:00
README.md [example] reorganize for community examples (#3557) 2023-04-14 16:27:48 +08:00

README.md

Colossal-AI Examples

Table of Contents

Overview

This folder provides several examples accelerated by Colossal-AI. Folders such as images and language include a wide range of deep learning tasks and applications. The community folder aim to create a collaborative platform for developers to contribute exotic features built on top of Colossal-AI. The tutorial folder is for everyone to quickly try out the different features in Colossal-AI.

You can find applications such as Chatbot, AIGC and Biomedicine in the Applications directory.

Folder Structure

└─ examples
  └─ images
      └─ vit
        └─ test_ci.sh
        └─ train.py
        └─ README.md
      └─ ...
  └─ ...

Invitation to open-source contribution

Referring to the successful attempts of BLOOM and Stable Diffusion, any and all developers and partners with computing powers, datasets, models are welcome to join and build the Colossal-AI community, making efforts towards the era of big AI models!

You may contact us or participate in the following ways:

  1. Leaving a Star to show your like and support. Thanks!
  2. Posting an issue, or submitting a PR on GitHub follow the guideline in Contributing.
  3. Join the Colossal-AI community on Slack, and WeChat(微信) to share your ideas.
  4. Send your official proposal to email contact@hpcaitech.com

Thanks so much to all of our amazing contributors!

Integrate Your Example With Testing

Regular checks are important to ensure that all examples run without apparent bugs and stay compatible with the latest API. Colossal-AI runs workflows to check for examples on a on-pull-request and weekly basis. When a new example is added or changed, the workflow will run the example to test whether it can run. Moreover, Colossal-AI will run testing for examples every week.

Therefore, it is essential for the example contributors to know how to integrate your example with the testing workflow. Simply, you can follow the steps below.

  1. Create a script called test_ci.sh in your example folder
  2. Configure your testing parameters such as number steps, batch size in test_ci.sh, e.t.c. Keep these parameters small such that each example only takes several minutes.
  3. Export your dataset path with the prefix /data and make sure you have a copy of the dataset in the /data/scratch/examples-data directory on the CI machine. Community contributors can contact us via slack to request for downloading the dataset on the CI machine.
  4. Implement the logic such as dependency setup and example execution

Community Dependency

We are happy to introduce the following nice community dependency repos that are powered by Colossal-AI: