* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [autoparallel] add getattr haandler
* polish code
* add extra processes for Parameters
* add unit test for param resharding cost
* add docstring and polish test
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/pytree] add pytree to process args and kwargs | provide to process args and kwargs after forward
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [fx] add input activation offload to codegen
* [fx] modify unit test
* [fx] remove two skips in torch11
* [fx] use all_input_nodes instead of _input_nodes
* [fx] add some comment and docstrings.
* [fx] add dataflow analysis for an autograd graph.
* add intepretation for graph analysis.
* [fx] before doing save_tensor_hooks.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] a very accurate version on GPT-2.
* [fx] refactor code.
* [fx] remove redundant inplace=True.
* [fx] refactor code.
* [fx] refactor code.
* [fx] refactor code.
* [fx] dive into backward memory.
* [fx] fix variable names in ckpt_solvers and unskip tests.
* [fx] commit my changes.
* [fx] restore skips.
* [fx] restore skips.
* [fx] chaange stage into phase.
* [fx] chaange stage into phase.
* [fx] chaange stage into phase.
* [fx] compute memory stat and flop count for MetaInfoProp.
* [fx] modify node attribute.
* [fx] modify ckpt_chen.
* [fx] fix compatibility.
* [fx] fix import error.
* [fx] skip test for MetaInfoProp.
* [fx] skip test for MetaInfoProp.
* [fx] skip test for MetaInfoProp.
* [fx] skip test for MetaInfoProp.
* [fx] skip if torch 1.11.0.
* [fx] recover MetaInfoProp support for PyTorch 1.11.
* [fx] provide a stable but not accurate enough version of profiler.
* [fx] provide a stable but not accurate enough version of profiler.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix import error.
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP
* [pipeline/pipleline_process_group] remove comment
* [pipeline/pipleline_process_group] remove comment
* [pipeline/pipleline_process_group] skip process group test
* [pipeline/pipleline_process_group] remove test named function
* [fx] fix wrong variable name in solver rotor
* [fx] fix wrong variable name in solver rotor
* [fx] fix the discretize bug
* [fx] fix the first op in activation checkpoint codegen
* [fx] fix some bugs of ckpt solver
* [fx] modify test_ckpt_torchvision
* [fx] set sequence to __sequence__ attr of GraphModule
* [fx] docstring modification
* [fx] remove performance test
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1)
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] add rules to linearize computation graphs for searching. (#2)
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1)
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] fix test and algorithm bugs in activation checkpointing.
* [fx] polish ckpt_test.
* [fx] add rules to linearize computation graphs for searching.
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] fix inconsistencies.
* [fx] fix MetaInfoProp.
* [fx] fix MetaInfoProp.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] fix error in tests.
* [fx] unfix bug.
* [fx] unfix bug.
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* Delete p2p_v2.py
* Delete _pipeline_schedule_v2.py
* Delete test_object_list_p2p_v2.py
* Delete test_boardcast_send_recv_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
* [fx] Add use_reentrant=False of checkpoint into codegen
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1)
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* [fx] polish ckpt_test.
* [fx] polish ckpt_test.
* [fx] polish ckpt_test.
* [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen
* [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen
* Modification of test and add TODO in codegen
* [fx] Modification of colossal ckpt usage
* [fx] add gpc.destroy() to test_codegen
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [communication] add p2p_v2.py to support communication with List[Any]
* Delete _pipeline_schedule_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* Delete p2p_v2.py
* Delete test_boardcast_send_recv_v2.py
* Delete test_object_list_p2p_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [communication] remove print code
* [communication] remove print code
* [engin/schedule] shorten the running time of testing file to prevent cancelling in CI
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* mend
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.
* [fx] fix lowercase naming conventions.
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [communication] add p2p_v2.py to support communication with List[Any]
* Delete _pipeline_schedule_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [communication] remove print code
* [communication] remove print code
* impl nvme optimizer
* update cpu adam
* add unit test
* update hybrid adam
* update docstr
* add TODOs
* update CI
* fix CI
* fix CI
* fix CI path
* fix CI path
* fix CI path
* fix install tensornvme
* fix CI
* fix CI path
* fix CI env variables
* test CI
* test CI
* fix CI
* fix nvme optim __del__
* fix adam __del__
* fix nvme optim
* fix CI env variables
* fix nvme optim import
* test CI
* test CI
* fix CI
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4.
* manipulation
* [fx]add graph manipulation methods.
* [fx]methods to get fx graph property.
* add unit test
* add docstring to explain top node and leaf node in this context
* init a checkpoint dir
* [checkpoint]support resume for cosinewarmuplr
* [checkpoint]add unit test
* fix some bugs but still not OK
* fix bugs
* make it faster
* [checkpoint]support generalized scheduler
* polish
* [tensor] torch function return colotensor
* polish
* fix bugs
* remove debug info
* polish
* polish
* [tensor] test_model pass unittests
* polish
* [hotfix] fx get comm size bug
Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4.
* [context]use meta tensor to init model lazily.
* polish
* make module with device kwargs bypass the normal init.
* change unit test to adapt updated context.
* update gemini mgr
* update chunk
* add docstr
* polish placement policy
* update test chunk
* update test zero
* polish unit test
* remove useless unit test
* add placement policy
* add gemini mgr
* update mem stats collector
* update zero
* update zero optim
* fix bugs
* zero optim monitor os
* polish unit test
* polish unit test
* add assert
* polish chunk manager
* polish unit test
* impl add_extern_static_tensor for chunk mgr
* add mem stats collector v2
* polish code
* polish unit test
* polish code
* polish get chunks
* add zero optimizer
* torch ok
* unit test ok
* polish code
* fix bugs
* polish unit test
* polish zero optim
* polish colo ddp v2
* refactor folder structure
* add comment
* polish unit test
* polish zero optim
* polish unit test
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4.
* [pipeline]add module lazy init feature to support large model initization.
* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model
* refactor the module structure
* polish
* [pipelinable]add unit test for pipelinable
* polish
* polish
* Fix CodeFactor issues.
only process module's own parameters in Zero context
add zero hooks for all modules that contrain parameters
gather parameters only belonging to module itself
* place params on cpu after zero init context
* polish code
* bucketzed cpu gpu tensor transter
* find a bug in sharded optim unittest
* add offload unittest for ShardedOptimV2.
* polish code and make it more robust
* Added CPU Adam
* finished the cpu adam
* updated the license
* delete useless parameters, removed resnet
* modified the method off cpu adam unittest
* deleted some useless codes
* removed useless codes
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
* add zero1 (#209)
* add zero1
* add test zero1
* update zero stage 1 develop (#212)
* Implement naive zero3 (#240)
* naive zero3 works well
* add zero3 param manager
* add TODOs in comments
* add gather full param ctx
* fix sub module streams
* add offload
* fix bugs of hook and add unit tests
* fix bugs of hook and add unit tests (#252)
* add gather full param ctx
* fix sub module streams
* add offload
* fix bugs of hook and add unit tests
* polish code and add state dict hook
* fix bug
* update unit test
* refactor reconstructed zero code
* clip_grad support zero3 and add unit test
* add unit test for Zero3ParameterManager
* [WIP] initialize the shard param class
* [WIP] Yet another sharded model implementation (#274)
* [WIP] initialize the shard param class
* [WIP] Yes another implementation of shardModel. Using a better hook method.
* torch.concat -> torch.cat
* fix test_zero_level_1.py::test_zero_level_1 unitest
* remove deepspeed implementation and refactor for the reconstructed zero module
* polish zero dp unittests
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
added branch context;
added vocab parallel layers;
moved split_batch from load_batch to tensor parallel embedding layers;
updated gpt model;
updated unit test cases;
fixed few collective communicator bugs
* add pipeline shared module wrapper and update load batch
* added model parallel process group for amp and clip grad (#86)
* added model parallel process group for amp and clip grad
* update amp and clip with model parallel process group
* remove pipeline_prev/next group (#88)
* micro batch offload
* optimize pipeline gpu memory usage
* pipeline can receive tensor shape (#93)
* optimize pipeline gpu memory usage
* fix grad accumulation step counter
* rename classes and functions
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>