Commit Graph

320 Commits (1e0e080837478e95bc2d835c58ccd025a0013c00)

Author SHA1 Message Date
github-actions[bot] 353566c198
Automated submodule synchronization (#483)
Co-authored-by: github-actions <github-actions@github.com>
2022-03-22 09:34:26 +08:00
github-actions[bot] cfcc8271f3
[Bot] Automated submodule synchronization (#451)
Co-authored-by: github-actions <github-actions@github.com>
2022-03-18 09:51:43 +08:00
github-actions 6098bc4cce Automated submodule synchronization 2022-03-14 00:01:12 +00:00
github-actions b9f8521f8c Automated submodule synchronization 2022-02-15 11:35:37 +08:00
github-actions[bot] 5420809f43
Automated submodule synchronization (#203)
Co-authored-by: github-actions <github-actions@github.com>
2022-02-04 10:19:38 +08:00
Frank Lee ca4ae52d6b
Set examples as submodule (#162)
* remove examples folder

* added examples as submodule

* update .gitmodules
2022-01-19 16:35:36 +08:00
LuGY_mac d143396cac Added rand augment and update the dataloader 2022-01-18 16:14:46 +08:00
HELSON 1ff5be36c2
Added moe parallel example (#140) 2022-01-17 15:34:04 +08:00
ver217 f03bcb359b
update vit example for new API (#98) (#99) 2022-01-04 20:35:33 +08:00
アマデウス 0fedef4f3c
Layer integration (#83)
* integrated parallel layers for ease of building models

* integrated 2.5d layers

* cleaned codes and unit tests

* added log metric by step hook; updated imagenet benchmark; fixed some bugs

* reworked initialization; cleaned codes

Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-27 15:04:32 +08:00
Xin Zhang 648f806315
add example of self-supervised SimCLR training - V2 (#50)
* add example of self-supervised SimCLR training

* simclr v2, replace nvidia dali dataloader

* updated

* sync to latest code writing style

* sync to latest code writing style and modify README

* detail README & standardize dataset path
2021-12-21 08:07:18 +08:00
Frank Lee 35813ed3c4
update examples and sphnix docs for the new api (#63) 2021-12-13 22:07:01 +08:00
Frank Lee da01c234e1
Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000

* Integrate 1d tensor parallel in Colossal-AI (#39)

* fixed 1D and 2D convergence (#38)

* optimized 2D operations

* fixed 1D ViT convergence problem

* Feature/ddp (#49)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* support torch ddp

* fix loss accumulation

* add log for ddp

* change seed

* modify timing hook

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* Feature/pipeline (#40)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* optimize communication of pipeline parallel

* fix grad clip for pipeline

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)

* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset

* update api for better usability (#58)

update api for better usability

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 15:08:29 +08:00
ver217 eb2f8b1f6b
add how to build tfrecord dataset (#48) 2021-12-02 16:31:23 +08:00
ver217 4da256a584
add some details in vit-b16 example (#46) 2021-12-02 09:29:27 +08:00
ver217 e67dab92a9
add some details in vit-b16 example (#43) (#44) 2021-12-02 08:55:11 +08:00
binmakeswell 2528adc62f
add explanation for ViT example (#35) (#36) 2021-11-29 10:25:38 +08:00
ver217 dbe62c67b8
add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) 2021-11-18 23:45:09 +08:00
Frank Lee 3defa32aee
Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
2021-11-18 19:45:06 +08:00
zbian 404ecbdcc6 Migrated project 2021-10-28 18:21:23 +02:00