diff --git a/docs/README.md b/docs/README.md index f520608d5..f0cb50ffe 100644 --- a/docs/README.md +++ b/docs/README.md @@ -98,7 +98,7 @@ Lastly, if you want to skip some code, you just need to add the following annota ``` -If you have any dependency required, please add it to `requriements-doc-test.txt` for pip and `conda-doc-test-deps.yml` for Conda. +If you have any dependency required, please add it to `requirements-doc-test.txt` for pip and `conda-doc-test-deps.yml` for Conda. ### 💉 Auto Documentation diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index 268119819..0984b2dc3 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -1,6 +1,6 @@ # References -The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few reserach works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format. +The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few research works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format. ## By Our Team diff --git a/docs/source/en/advanced_tutorials/add_your_parallel.md b/docs/source/en/advanced_tutorials/add_your_parallel.md index be7284a7a..1caf58c87 100644 --- a/docs/source/en/advanced_tutorials/add_your_parallel.md +++ b/docs/source/en/advanced_tutorials/add_your_parallel.md @@ -56,7 +56,7 @@ follow the steps below to create a new distributed initialization. world_size: int, config: Config, data_parallel_size: int, - pipeline_parlalel_size: int, + pipeline_parallel_size: int, tensor_parallel_size: int, arg1, arg2): diff --git a/docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md b/docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md index e01caf76d..d5edd135c 100644 --- a/docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md +++ b/docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md @@ -121,7 +121,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b ## Train Your Model -Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine. +Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine. We handle the back-propagation of MoE models for you. In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients. You can find more information about the handler `MoeGradientHandler` in colossal directory. diff --git a/docs/source/en/advanced_tutorials/opt_service.md b/docs/source/en/advanced_tutorials/opt_service.md index a43ec7fdd..eccfa12f9 100644 --- a/docs/source/en/advanced_tutorials/opt_service.md +++ b/docs/source/en/advanced_tutorials/opt_service.md @@ -53,7 +53,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path" # the ${CONFIG_DIR} must contain a server.sh file as the entry of service export CONFIG_DIR="config_file_path" -docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest +docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest ``` Then open `https://[IP-ADDRESS]:8020/docs#` in your browser to try out! diff --git a/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md b/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md index e7698e5e9..1a7ab9a65 100644 --- a/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md +++ b/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md @@ -69,7 +69,7 @@ After the forward operation of the embedding module, each word in all sequences
The embedding module
-Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer percepton is located in the second block. +Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer perception is located in the second block.
diff --git a/docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md b/docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md index b26599740..6adfe4f11 100644 --- a/docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md +++ b/docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md @@ -195,7 +195,7 @@ def build_cifar(batch_size): ## Training ViT using pipeline -You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an approriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage. +You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleaved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an appropriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage. You should `export DATA=/path/to/cifar`. diff --git a/docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md b/docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md index b2438a1cf..a2deaeb88 100644 --- a/docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md +++ b/docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md @@ -16,14 +16,14 @@ In this example for ViT model, Colossal-AI provides three different parallelism We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs. -## Tabel of Contents +## Table of Contents 1. Colossal-AI installation 2. Steps to train ViT with data parallelism 3. Steps to train ViT with pipeline parallelism 4. Steps to train ViT with tensor parallelism or hybrid parallelism ## Colossal-AI Installation -You can install Colossal-AI pacakage and its dependencies with PyPI. +You can install Colossal-AI package and its dependencies with PyPI. ```bash pip install colossalai ``` @@ -31,7 +31,7 @@ pip install colossalai ## Data Parallelism -Data parallism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps: +Data parallelism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps: 1. Define a configuration file 2. Change a few lines of code in train script @@ -94,7 +94,7 @@ from torchvision import transforms from torchvision.datasets import CIFAR10 ``` -#### Lauch Colossal-AI +#### Launch Colossal-AI In train script, you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process `launch`. In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the parameters via command line. Besides, Colossal-AI can utilize the existing launch tool provided by PyTorch as many users are familiar with by using `colossalai.launch_from_torch`. For more details, you can view the related [documents](https://www.colossalai.org/docs/basics/launch_colossalai). @@ -613,7 +613,7 @@ NUM_MICRO_BATCHES = parallel['pipeline'] TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE) ``` -Ohter configs: +Other configs: ```python # hyper parameters # BATCH_SIZE is as per GPU diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index cafcb6d43..a446ff31b 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -14,9 +14,9 @@ In our new design, `colossalai.booster` replaces the role of `colossalai.initial ### Plugin Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows: -***GeminiPlugin:*** This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management. +***GeminiPlugin:*** This plugin wraps the Gemini acceleration solution, that ZeRO with chunk-based memory management. -***TorchDDPPlugin:*** This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines. +***TorchDDPPlugin:*** This plugin wraps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines. ***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs. diff --git a/docs/source/en/basics/colotensor_concept.md b/docs/source/en/basics/colotensor_concept.md index 050f2ef9f..abe470fe0 100644 --- a/docs/source/en/basics/colotensor_concept.md +++ b/docs/source/en/basics/colotensor_concept.md @@ -52,7 +52,7 @@ An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/c ## Example -Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_dgree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor. +Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_degree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor. ```python diff --git a/docs/source/en/features/3D_tensor_parallel.md b/docs/source/en/features/3D_tensor_parallel.md index b9e98eac9..0e28f08b2 100644 --- a/docs/source/en/features/3D_tensor_parallel.md +++ b/docs/source/en/features/3D_tensor_parallel.md @@ -67,7 +67,7 @@ Given $P=q \times q \times q$ processors, we present the theoretical computation ## Usage -To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below. +To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below. ```python CONFIG = dict(parallel=dict( data=1, diff --git a/docs/source/en/features/gradient_clipping_with_booster.md b/docs/source/en/features/gradient_clipping_with_booster.md index 8686eb06f..341a608a5 100644 --- a/docs/source/en/features/gradient_clipping_with_booster.md +++ b/docs/source/en/features/gradient_clipping_with_booster.md @@ -75,7 +75,7 @@ Build your model, optimizer, loss function, lr scheduler and dataloaders. Note t NUM_EPOCHS = 200 BATCH_SIZE = 128 GRADIENT_CLIPPING = 0.1 -# build resnetå +# build resnet model = resnet34(num_classes=10) # build dataloaders train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')), diff --git a/docs/source/en/features/nvme_offload.md b/docs/source/en/features/nvme_offload.md index 4374da3c9..d940fd5ec 100644 --- a/docs/source/en/features/nvme_offload.md +++ b/docs/source/en/features/nvme_offload.md @@ -53,7 +53,7 @@ It's compatible with all parallel methods in ColossalAI. > ⚠ It only offloads optimizer states on CPU. This means it only affects CPU training or Zero/Gemini with offloading. -## Exampls +## Examples Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`. diff --git a/docs/source/en/features/pipeline_parallel.md b/docs/source/en/features/pipeline_parallel.md index ac49863b3..30654b0b0 100644 --- a/docs/source/en/features/pipeline_parallel.md +++ b/docs/source/en/features/pipeline_parallel.md @@ -156,4 +156,4 @@ trainer.fit(train_dataloader=train_dataloader, display_progress=True) ``` -We use `2` pipeline stages and the batch will be splitted into `4` micro batches. +We use `2` pipeline stages and the batch will be split into `4` micro batches. diff --git a/docs/source/en/features/zero_with_chunk.md b/docs/source/en/features/zero_with_chunk.md index a105831a5..8448c52ac 100644 --- a/docs/source/en/features/zero_with_chunk.md +++ b/docs/source/en/features/zero_with_chunk.md @@ -72,7 +72,7 @@ chunk_manager = init_chunk_manager(model=module, gemini_manager = GeminiManager(placement_policy, chunk_manager) ``` -`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk. +`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still smaller than the minimum chunk size, all parameters will be compacted into one small chunk. Initialization of the optimizer. ```python diff --git a/docs/source/zh-Hans/advanced_tutorials/add_your_parallel.md b/docs/source/zh-Hans/advanced_tutorials/add_your_parallel.md index 4825a6fa1..059eb014a 100644 --- a/docs/source/zh-Hans/advanced_tutorials/add_your_parallel.md +++ b/docs/source/zh-Hans/advanced_tutorials/add_your_parallel.md @@ -48,7 +48,7 @@ Colossal-AI 为用户提供了一个全局 context,使他们能够轻松地管 world_size: int, config: Config, data_parallel_size: int, - pipeline_parlalel_size: int, + pipeline_parallel_size: int, tensor_parallel_size: int, arg1, arg2): diff --git a/docs/source/zh-Hans/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md b/docs/source/zh-Hans/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md index 456878caa..276fcc261 100644 --- a/docs/source/zh-Hans/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md +++ b/docs/source/zh-Hans/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md @@ -122,7 +122,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b ## Train Your Model -Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine. +Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine. We handle the back-propagation of MoE models for you. In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients. You can find more information about the handler `MoeGradientHandler` in colossal directory. diff --git a/docs/source/zh-Hans/advanced_tutorials/meet_gemini.md b/docs/source/zh-Hans/advanced_tutorials/meet_gemini.md index 2bf0a9c98..a52bc6ac7 100644 --- a/docs/source/zh-Hans/advanced_tutorials/meet_gemini.md +++ b/docs/source/zh-Hans/advanced_tutorials/meet_gemini.md @@ -48,7 +48,7 @@ zero = dict(
-ColossalAI设计了Gemini,就像双子星一样,它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内,从而让模型训练突破GPU的内存墙。内存管理器由两部分组成,分别是MemStatsCollector(MSC)和StatefuleTensorMgr(STM)。 +ColossalAI设计了Gemini,就像双子星一样,它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内,从而让模型训练突破GPU的内存墙。内存管理器由两部分组成,分别是MemStatsCollector(MSC)和StatefulTensorMgr(STM)。 我们利用了深度学习网络训练过程的迭代特性。我们将迭代分为warmup和non-warmup两个阶段,开始时的一个或若干迭代步属于预热阶段,其余的迭代步属于正式阶段。在warmup阶段我们为MSC收集信息,而在non-warmup阶段STM入去MSC收集的信息来移动tensor,以达到最小化CPU-GPU数据移动volume的目的。 @@ -75,7 +75,7 @@ STM管理所有model data tensor的信息。在模型的构造过程中,Coloss 我们在算子的开始和结束计算时,触发内存采样操作,我们称这个时间点为**采样时刻(sampling moment)**,两个采样时刻之间的时间我们称为**period**。计算过程是一个黑盒,由于可能分配临时buffer,内存使用情况很复杂。但是,我们可以较准确的获取period的系统最大内存使用。非模型数据的使用可以通过两个统计时刻之间系统最大内存使用-模型内存使用获得。 -我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used,和下一个period的model data memoy used。并行策略会给MSC的工作造成障碍。如图所示,比如对于ZeRO或者Tensor Parallel,由于Op计算前需要gather模型数据,会带来额外的内存需求。因此,我们要求在模型数据变化前进行采样系统内存,这样在一个period内,MSC会把preOp的模型变化内存捕捉。比如在period 2-3内,我们考虑的tensor gather和shard带来的内存变化。 +我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used,和下一个period的model data memory used。并行策略会给MSC的工作造成障碍。如图所示,比如对于ZeRO或者Tensor Parallel,由于Op计算前需要gather模型数据,会带来额外的内存需求。因此,我们要求在模型数据变化前进行采样系统内存,这样在一个period内,MSC会把preOp的模型变化内存捕捉。比如在period 2-3内,我们考虑的tensor gather和shard带来的内存变化。 尽管可以将采样时刻放在其他位置,比如排除gather buffer的变动新信息,但是会给造成麻烦。不同并行方式Op的实现有差异,比如对于Linear Op,Tensor Parallel中gather buffer的分配在Op中。而对于ZeRO,gather buffer的分配是在PreOp中。将放在PreOp开始时采样有利于将两种情况统一。 diff --git a/docs/source/zh-Hans/advanced_tutorials/opt_service.md b/docs/source/zh-Hans/advanced_tutorials/opt_service.md index a213584fd..1f8324a53 100644 --- a/docs/source/zh-Hans/advanced_tutorials/opt_service.md +++ b/docs/source/zh-Hans/advanced_tutorials/opt_service.md @@ -52,7 +52,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path" # the ${CONFIG_DIR} must contain a server.sh file as the entry of service export CONFIG_DIR="config_file_path" -docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest +docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest ``` 接下来,您就可以在您的浏览器中打开 `https://[IP-ADDRESS]:8020/docs#` 进行测试。 diff --git a/docs/source/zh-Hans/advanced_tutorials/train_vit_with_hybrid_parallelism.md b/docs/source/zh-Hans/advanced_tutorials/train_vit_with_hybrid_parallelism.md index 6dc5eccf4..e2f2c90a3 100644 --- a/docs/source/zh-Hans/advanced_tutorials/train_vit_with_hybrid_parallelism.md +++ b/docs/source/zh-Hans/advanced_tutorials/train_vit_with_hybrid_parallelism.md @@ -477,7 +477,7 @@ def build_cifar(batch_size): return train_dataloader, test_dataloader -# craete dataloaders +# create dataloaders train_dataloader , test_dataloader = build_cifar() # create loss function criterion = CrossEntropyLoss(label_smoothing=0.1) @@ -492,7 +492,7 @@ lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer, #### 启动 Colossal-AI 引擎 ```python -# intiailize +# initialize engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model, optimizer=optimizer, criterion=criterion, diff --git a/docs/source/zh-Hans/basics/colotensor_concept.md b/docs/source/zh-Hans/basics/colotensor_concept.md index b725e48a7..ab2413e99 100644 --- a/docs/source/zh-Hans/basics/colotensor_concept.md +++ b/docs/source/zh-Hans/basics/colotensor_concept.md @@ -53,7 +53,7 @@ ColoTensor 包含额外的属性[ColoTensorSpec](https://colossalai.readthedocs. ## Example -让我们看一个例子。 使用 tp_degree=4, dp_dgree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度(dim 0)对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。 +让我们看一个例子。 使用 tp_degree=4, dp_degree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度(dim 0)对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。 ```python diff --git a/docs/source/zh-Hans/features/mixed_precision_training.md b/docs/source/zh-Hans/features/mixed_precision_training.md index c4df6271b..4628b09cd 100644 --- a/docs/source/zh-Hans/features/mixed_precision_training.md +++ b/docs/source/zh-Hans/features/mixed_precision_training.md @@ -203,7 +203,7 @@ Naive AMP 的默认参数: - initial_scale(int): gradient scaler 的初始值 - growth_factor(int): loss scale 的增长率 - backoff_factor(float): loss scale 的下降率 -- hysterisis(int): 动态 loss scaling 的延迟偏移 +- hysteresis(int): 动态 loss scaling 的延迟偏移 - max_scale(int): loss scale 的最大允许值 - verbose(bool): 如果被设为`True`,将打印调试信息 diff --git a/docs/source/zh-Hans/features/nvme_offload.md b/docs/source/zh-Hans/features/nvme_offload.md index fd75ed1f5..db5f10184 100644 --- a/docs/source/zh-Hans/features/nvme_offload.md +++ b/docs/source/zh-Hans/features/nvme_offload.md @@ -53,7 +53,7 @@ optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, n > ⚠ 它只会卸载在 CPU 上的优化器状态。这意味着它只会影响 CPU 训练或者使用卸载的 Zero/Gemini。 -## Exampls +## Examples Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`. 首先让我们从两个简单的例子开始 -- 用不同的方法训练 GPT。这些例子依赖`transformers`。