Commit Graph

418 Commits (955463e542e2b986da3280be7daad7ebb760a7ea)

Author SHA1 Message Date
HELSON 425b4a96b8
[gemini] polish stateful_tensor_mgr (#876) 2022-04-26 15:05:03 +08:00
Jiarui Fang e43f83aa5c
[Tensor] get named parameters for model using ColoTensors (#874) 2022-04-26 14:08:01 +08:00
Jiarui Fang 96211c2cc8
[tensor] customized op returns ColoTensor (#875)
* [tensor] customized op returns ColoTensor

* polish

* polish code
2022-04-26 13:23:59 +08:00
Ziyue Jiang 26d4ab8b03
[Tensor] Add function to spec and update linear 1Drow and unit tests (#869) 2022-04-26 10:15:26 +08:00
Frank Lee 11f54c7b6b
[doc] improved docstring and assertion messages for the engine module (#871) 2022-04-26 10:00:18 +08:00
Frank Lee 1c34382678
[doc] improved assertion messages in trainer (#873) 2022-04-26 10:00:12 +08:00
Frank Lee 7a64fae33a
[doc] improved error messages in initialize (#872) 2022-04-26 10:00:03 +08:00
Jiarui Fang 1190b2c4a4
[tensor] add cross_entrophy_loss (#868) 2022-04-25 16:01:52 +08:00
HELSON 3107817172
[gemini] add stateful tensor container (#867) 2022-04-25 14:58:16 +08:00
Jiarui Fang d01d3b8cb0
colo init context add device attr. (#866) 2022-04-25 14:24:26 +08:00
Frank Lee 2238758c2e
[usability] improved error messages in the context module (#856) 2022-04-25 13:42:31 +08:00
Frank Lee 9fdebadd69
[doc] improved docstring in the amp module (#857) 2022-04-25 13:42:17 +08:00
Frank Lee b862d89d00
[doc] improved docstring in the logging module (#861) 2022-04-25 13:42:00 +08:00
Frank Lee 8004c8e938
[doc] improved docstring in the communication module (#863) 2022-04-25 13:41:43 +08:00
Jiarui Fang 8af5f7423d
[tensor] an initial dea of tensor spec (#865)
* a initial dea of tensor spec

* polish

* polish
2022-04-25 13:33:52 +08:00
Jiarui Fang 126ba573a8
[Tensor] add layer norm Op (#852) 2022-04-25 11:49:20 +08:00
Frank Lee a82da26f7e
[cli] refactored micro-benchmarking cli and added more metrics (#858) 2022-04-25 11:48:07 +08:00
Frank Lee ee222dfbf3
[usability] added assertion message in registry (#864) 2022-04-25 11:45:15 +08:00
HELSON f0e654558f
[gemini] polish code (#855) 2022-04-25 10:40:14 +08:00
Jiarui Fang 29159d9b5b
hotfix tensor unittest bugs (#862) 2022-04-25 10:06:53 +08:00
YuliangLiu0306 c6930d8ddf
[pipelinable]use ColoTensor to replace dummy tensor. (#853) 2022-04-24 18:31:22 +08:00
Ziyue Jiang bcc8655021
[Tensor ] Add 1Drow weight reshard by spec (#854) 2022-04-24 18:30:20 +08:00
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850) 2022-04-24 17:17:22 +08:00
ver217 232142f402
[utils] refactor profiler (#837)
* add model data profiler

* add a subclass of torch.profiler.profile

* refactor folder structure

* remove redundant codes

* polish code

* use GeminiMemoryManager

* fix import path

* fix stm profiler ext

* polish comments

* remove useless file
2022-04-24 17:03:59 +08:00
Jiarui Fang 62f059251b
[Tensor] init a tp network training unittest (#849) 2022-04-24 16:43:44 +08:00
ver217 0dea140760
[hotfix] add deconstructor for stateful tensor (#848)
* add deconstructor for stateful tensor

* fix colo init context
2022-04-24 15:03:04 +08:00
ver217 0f7ed8c192
fix _post_init_method of zero init ctx (#847) 2022-04-24 14:16:50 +08:00
Ziyue Jiang 2a0a427e04
[tensor]add assert for colo_tensor 1Drow (#846) 2022-04-24 14:12:45 +08:00
Ziyue Jiang 05023ecfee
[Tensor] TP Linear 1D row (#843) 2022-04-24 13:43:12 +08:00
Frank Lee cf6d1c9284
[CLI] refactored the launch CLI and fixed bugs in multi-node launching (#844)
* [cli] fixed multi-node job launching

* [cli] fixed a bug in version comparison

* [cli] support launching with env var

* [cli] fixed multi-node job launching

* [cli] fixed a bug in version comparison

* [cli] support launching with env var

* added docstring

* [cli] added extra launch arguments

* [cli] added default launch rdzv args

* [cli] fixed version comparison

* [cli] added docstring examples and requierment

* polish docstring

* polish code

* polish code
2022-04-24 13:26:26 +08:00
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [pipeline]add module lazy init feature to support large model initization.

* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model

* refactor the module structure

* polish

* [pipelinable]add unit test for pipelinable

* polish

* polish

* Fix CodeFactor issues.
2022-04-24 13:03:12 +08:00
Jiarui Fang ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor (#845) 2022-04-24 12:32:10 +08:00
LuGY c1e8d2001e
modefied the pp build for ckpt adaptation (#803) 2022-04-24 12:23:16 +08:00
Jiarui Fang 8789850eea
Init Conext supports lazy allocate model memory (#842) 2022-04-22 18:03:35 +08:00
Jiarui Fang 4575a3298b
[hotfix] ColoTensor pin_memory (#840) 2022-04-22 17:07:46 +08:00
Frank Lee 01e9f834f5
[dependency] removed torchvision (#833)
* [dependency] removed torchvision

* fixed transforms
2022-04-22 15:24:35 +08:00
Jiarui Fang cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
This reverts commit ac88de6dfc.
2022-04-22 14:45:57 +08:00
Jiarui Fang ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
* revert zero tensors back

* [tensor] init row 1d linear
2022-04-22 14:03:26 +08:00
Jiarui Fang 595bedf767
revert zero tensors back (#829) 2022-04-22 12:12:35 +08:00
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.

* [tensor] ZeRO use ColoTensor as the base class.

* polish
2022-04-22 12:00:48 +08:00
Ziyue Jiang 8e6fdb4f29
[tensor]fix test_linear (#826) 2022-04-21 17:18:56 +08:00
Ziyue Jiang 1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion (#825) 2022-04-21 16:47:35 +08:00
Jiarui Fang eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824) 2022-04-21 16:03:18 +08:00
Jiarui Fang 2ecc3d7a55
[tensor] lazy init (#823) 2022-04-21 15:40:23 +08:00
Jiarui Fang 68dcd51d41
[Tensor] update ColoTensor torch_function (#822)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* [tensor] renaming and reorganize directory structure.

* rm useless dir

* polish

* polish

* [tensor] hander the function not wrapped

* polish
2022-04-21 14:25:27 +08:00
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820) 2022-04-21 14:15:48 +08:00
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2022-04-21 11:42:37 +08:00
FrankLeeeee 70ed11d07e [cli] added check installation cli 2022-04-20 12:13:27 +08:00
YuliangLiu0306 c7eca40f51
Merge pull request #812 from FrankLeeeee/feature/cli
[cli] fixed single-node process launching
2022-04-20 11:40:07 +08:00
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813) 2022-04-20 11:29:48 +08:00
FrankLeeeee d522cb704e [cli] fixed single-node process launching 2022-04-20 10:46:51 +08:00
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish
2022-04-20 10:05:39 +08:00
ver217 dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
* init fp16 param directly

* polish code
2022-04-19 16:16:48 +08:00
Jiarui Fang 227d1cd4b3
[gemini] APIs to set cpu memory capacity (#809) 2022-04-19 16:05:22 +08:00
FrankLeeeee f63e91d280 [cli] fixed a bug in user args and refactored the module structure 2022-04-19 15:15:16 +08:00
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Frank Lee 05d9ae5999
[cli] add missing requirement (#805) 2022-04-19 13:56:59 +08:00
YuliangLiu0306 de2f581d43
[cli] added micro benchmarking for tp (#789)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [CLI]add cli benchmark feature

* fix CodeFactor issues.

* refactor the module structure.
2022-04-19 12:08:28 +08:00
YuliangLiu0306 cfadc9df8e
[cli] added distributed launcher command (#791)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [CLI]add cli launcher feature

* remove testing message used during developing

* refactor the module structure.
2022-04-19 10:59:44 +08:00
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:58:21 +08:00
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:07:39 +08:00
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
Ziyue Jiang 4b01da24cd
[TP] change the check assert in split batch 2d (#772) 2022-04-16 21:29:57 +08:00
ver217 846406a07a
[gemini] fix auto tensor placement policy (#775) 2022-04-16 21:29:31 +08:00
HELSON a65cbb7e4e
[zero] refactor shard and gather operation (#773) 2022-04-15 14:41:31 +08:00
ver217 6e553748a7
polish sharded optim docstr and warning (#770) 2022-04-14 21:03:59 +08:00
LuGY 80e37eec42
fix the ckpt bugs when using DDP (#769) 2022-04-14 21:03:24 +08:00
Frank Lee 920fe31526
[compatibility] used backward-compatible API for global process group (#758) 2022-04-14 17:20:35 +08:00
Frank Lee 4ea49cb536
[test] added a decorator for address already in use error with backward compatibility (#760)
* [test] added a decorator for address already in use error with backward compatibility

* [test] added a decorator for address already in use error with backward compatibility
2022-04-14 16:48:44 +08:00
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217 dcca614eee
[hotfix] fix test_stateful_tensor_mgr (#762) 2022-04-14 15:50:09 +08:00
ver217 a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard

* disable test stm

* polish code
2022-04-14 14:56:46 +08:00
ver217 8f7ce94b8e
[hotfix] fix auto tensor placement policy (#753) 2022-04-14 12:04:45 +08:00
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
アマデウス b8899e0905
[TP] allow layernorm without bias (#750) 2022-04-14 11:43:56 +08:00
Jiarui Fang 3d7dc46d33
[zero] use factory pattern for tensor_placement_policy (#752) 2022-04-14 11:07:29 +08:00
ver217 4b048a8728
fix prepare grads in sharded optim (#749) 2022-04-13 22:36:11 +08:00
ver217 097772546e fix initialize about zero 2022-04-13 19:10:21 +08:00
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742) 2022-04-13 14:54:26 +08:00
HELSON 340e59f968
[utils] add synchronized cuda memory monitor (#740) 2022-04-13 10:50:54 +08:00
ver217 e6212f56cd
[hotfix] fix memory leak in backward of sharded model (#741) 2022-04-13 09:59:05 +08:00
Frank Lee a4e91bc87f
[bug] fixed grad scaler compatibility with torch 1.8 (#735) 2022-04-12 16:04:21 +08:00
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726) 2022-04-12 14:57:54 +08:00
Jiarui Fang 7db3ccc79b
[hotfix] remove duplicated param register to stateful tensor manager (#728) 2022-04-12 13:55:25 +08:00
Frank Lee 1cb7bdad3b
[util] fixed communication API depth with PyTorch 1.9 (#721) 2022-04-12 09:44:40 +08:00
Frank Lee 2412429d54
[util] fixed activation checkpointing on torch 1.9 (#719) 2022-04-12 09:35:45 +08:00
Frank Lee 04ff5ea546
[utils] support detection of number of processes on current node (#723) 2022-04-12 09:28:19 +08:00
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712) 2022-04-11 15:40:13 +08:00
ver217 715b86eadd
[hotfix] fix stm cuda model data size (#710) 2022-04-11 15:10:39 +08:00
LuGY 140263a394
[hotfix]fixed bugs of assigning grad states to non leaf nodes (#711)
* fixed bugs of assigning grad states to non leaf nodes

* use detach()
2022-04-11 14:04:58 +08:00
Frank Lee eda30a058e
[compatibility] fixed tensor parallel compatibility with torch 1.9 (#700) 2022-04-11 13:44:50 +08:00
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00