ColossalAI/docs/trainer_engine.md

# Build your engine & Customize your trainer

## Build your engine

To better understand how `Engine` class works, let's start from the conception of the process function in common engines. The process function 
usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. Here we give a standard process 
function in the following code block.

```python
def process_function(dataloader, model, criterion, optim):
    optim.zero_grad()
    data, label = next(dataloader)
    output = model(data)
    loss = criterion(output, label)
    loss.backward()
    optim.setp()
```

In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users to write their own process 
functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for users, we provide the powerful `Engine` class. This class 
enables pipeline parallelism and offers one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler 
in the `Engine` class to adjust learning rate during training.

In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The following code block provides
an example.

```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai

model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
schedule = colossalai.engine.schedule.NoPipelineSchedule()

MyEngine = Engine(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    schedule=schedule
)
```

More information regarding the class can be found in the API references.

## Customize your trainer

### Overview

To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly recommend that you read *Get Started* 
section and *Build your engine* first.

The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write your own scripts, you can simply 
construct your own trainer by calling the `Trainer` class, just like what we did in the following code block.

```python
MyTrainer = Trainer(MyEngine)
```

After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more powerful, we incorporate a set of 
handy tools to the class. For example, you can monitor or record the running states and metrics which indicate the current performance of the model. These
functions are realized by hooks. The `BasicHook` class allows you to execute your hook functions at specified time. We have already created some practical
hooks for you, as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the class can be found 
in the API references.

```python
hooks = [
    dict(type='LogMetricByEpochHook'),
    dict(type='LogTimingByEpochHook'),
    dict(type='LogMemoryByEpochHook'),
    dict(type='AccuracyHook'),
    dict(type='LossHook'),
    dict(type='TensorboardHook', log_dir='./tfb_logs'),
    dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
    dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
```

These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides, they print the current loss and 
accuracy to let users monitor the performance of the model.

### Hook

If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook` class to write a metric collector. 
These hook functions can be called at twelve timing in the trainer's life cycle. Besides, you can define the priorities of all hooks to arrange the execution order of them.
More information can be found in the API references. 

### Metric

You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your write your own metric hooks, please set 
the priority carefully and make sure the hook is called before other hooks which might require the results of the metric hook.

We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary and metrics can be accessed by their names.
Migrated project 3 years ago			`# Build your engine & Customize your trainer`

			`## Build your engine`

added Chinese documents and fixed some typos in English documents 3 years ago			To better understand how `Engine` class works, let's start from the conception of the process function in common engines. The process function
			usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. Here we give a standard process
			`function in the following code block.`
Migrated project 3 years ago
			```python
			`def process_function(dataloader, model, criterion, optim):`
			`optim.zero_grad()`
			`data, label = next(dataloader)`
			`output = model(data)`
			`loss = criterion(output, label)`
			`loss.backward()`
			`optim.setp()`
			```

added Chinese documents and fixed some typos in English documents 3 years ago			In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users to write their own process
			functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for users, we provide the powerful `Engine` class. This class
			`enables pipeline parallelism and offers one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler`
			in the `Engine` class to adjust learning rate during training.
Migrated project 3 years ago
added Chinese documents and fixed some typos in English documents 3 years ago			In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The following code block provides
			`an example.`
Migrated project 3 years ago
			```python
			`import torch`
			`import torch.nn as nn`
			`import torchvision.models as models`
			`import colossalai`

			`model = models.resnet18()`
			`criterion = nn.CrossEntropyLoss()`
			`optimizer = torch.optim.Adam(model)`
			`lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)`
			`schedule = colossalai.engine.schedule.NoPipelineSchedule()`

			`MyEngine = Engine(`
			`model=model,`
			`criterion=criterion,`
			`optimizer=optimizer,`
			`lr_scheduler=lr_scheduler,`
			`schedule=schedule`
			`)`
			```

added Chinese documents and fixed some typos in English documents 3 years ago			`More information regarding the class can be found in the API references.`
Migrated project 3 years ago
			`## Customize your trainer`

			`### Overview`

added Chinese documents and fixed some typos in English documents 3 years ago			To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly recommend that you read Get Started
			`section and Build your engine first.`
Migrated project 3 years ago
fixed some typos in the documents, added blog link and paper author information in README 3 years ago			The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write your own scripts, you can simply
added Chinese documents and fixed some typos in English documents 3 years ago			construct your own trainer by calling the `Trainer` class, just like what we did in the following code block.
Migrated project 3 years ago
added Chinese documents and fixed some typos in English documents 3 years ago			```python
			`MyTrainer = Trainer(MyEngine)`
			```
Migrated project 3 years ago
added Chinese documents and fixed some typos in English documents 3 years ago			After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more powerful, we incorporate a set of
			`handy tools to the class. For example, you can monitor or record the running states and metrics which indicate the current performance of the model. These`
			functions are realized by hooks. The `BasicHook` class allows you to execute your hook functions at specified time. We have already created some practical
			`hooks for you, as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the class can be found`
			`in the API references.`
Migrated project 3 years ago
			```python
			`hooks = [`
			`dict(type='LogMetricByEpochHook'),`
			`dict(type='LogTimingByEpochHook'),`
			`dict(type='LogMemoryByEpochHook'),`
			`dict(type='AccuracyHook'),`
			`dict(type='LossHook'),`
added Chinese documents and fixed some typos in English documents 3 years ago			`dict(type='TensorboardHook', log_dir='./tfb_logs'),`
			`dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),`
			`dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')`
Migrated project 3 years ago			`]`
			```

added Chinese documents and fixed some typos in English documents 3 years ago			`These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides, they print the current loss and`
			`accuracy to let users monitor the performance of the model.`
Migrated project 3 years ago
			`### Hook`

added Chinese documents and fixed some typos in English documents 3 years ago			If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook` class to write a metric collector.
fixed some typos in the documents, added blog link and paper author information in README 3 years ago			`These hook functions can be called at twelve timing in the trainer's life cycle. Besides, you can define the priorities of all hooks to arrange the execution order of them.`
added Chinese documents and fixed some typos in English documents 3 years ago			`More information can be found in the API references.`
Migrated project 3 years ago
			`### Metric`

added Chinese documents and fixed some typos in English documents 3 years ago			You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your write your own metric hooks, please set
			`the priority carefully and make sure the hook is called before other hooks which might require the results of the metric hook.`
Migrated project 3 years ago
added Chinese documents and fixed some typos in English documents 3 years ago			We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary and metrics can be accessed by their names.