ColossalAI/colossalai/utils/tensor_detector/readme.md

# Tensor Detector

This tool supports you to detect tensors on both CPU and GPU. However, there will always be some strange tensors on CPU, including the rng state of PyTorch.

## Example

An example is worth than a thousand words.

The code below defines a simple MLP module, with which we will show you how to use the tool.

```python
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.mlp = nn.Sequential(nn.Linear(64, 8),
                                 nn.ReLU(),
                                 nn.Linear(8, 32))   
    def forward(self, x):
        return self.mlp(x)
```

And here is how to use the tool.

```python
from colossalai.utils import TensorDetector

# create random data
data = torch.rand(64, requires_grad=True).cuda()
data.retain_grad()
# create the module
model = MLP().cuda()
# create the detector
# by passing the model to the detector, it can distinguish module parameters from common tensors
detector = TensorDetector(include_cpu=False, module=model)
detector.detect()

out = model(data)

detector.detect()

loss = out.sum()
loss.backward()

detector.detect()
```

I have made some comments on the right of the output for your understanding.

Note that the total `Mem` of all the tensors and parameters is not equal to `Total GPU Memery Allocated`.  PyTorch's memory management is really complicated, and for models of a large scale, it's impossible to figure out clearly.

**The order of print is not equal to the order the tensor creates, but they are really close.**

```bash
------------------------------------------------------------------------------------------------------------
   Tensor                            device               shape      grad               dtype            Mem
------------------------------------------------------------------------------------------------------------
+  Tensor                            cuda:0               (64,)      True       torch.float32          256 B    # data
+  mlp.0.weight                      cuda:0             (8, 64)      True       torch.float32         2.0 KB
+  mlp.0.bias                        cuda:0                (8,)      True       torch.float32           32 B
+  mlp.2.weight                      cuda:0             (32, 8)      True       torch.float32         1.0 KB
+  mlp.2.bias                        cuda:0               (32,)      True       torch.float32          128 B
------------------------------------------------------------------------------------------------------------
Detect Location: "test_tensor_detector.py" line 27
Totle GPU Memery Allocated on cuda:0 is 4.5 KB
------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------
   Tensor                            device               shape      grad               dtype            Mem
------------------------------------------------------------------------------------------------------------
+  Tensor                            cuda:0                (8,)      True       torch.float32           32 B    # activation
+  Tensor                            cuda:0               (32,)      True       torch.float32          128 B    # output
------------------------------------------------------------------------------------------------------------
Detect Location: "test_tensor_detector.py" line 30
Totle GPU Memery Allocated on cuda:0 is 5.5 KB
------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------
   Tensor                            device               shape      grad               dtype            Mem
------------------------------------------------------------------------------------------------------------
+  Tensor                            cuda:0                  ()      True       torch.float32            4 B    # loss
------------------------------------------------------------------------------------------------------------
Detect Location: "test_tensor_detector.py" line 32
Totle GPU Memery Allocated on cuda:0 is 6.0 KB
------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------
   Tensor                            device               shape      grad               dtype            Mem
------------------------------------------------------------------------------------------------------------
+  Tensor (with grad)                cuda:0               (64,)      True       torch.float32          512 B    # data with grad
+  mlp.0.weight (with grad)          cuda:0             (8, 64)      True       torch.float32         4.0 KB    # for use data.retain_grad()
+  mlp.0.bias (with grad)            cuda:0                (8,)      True       torch.float32           64 B
+  mlp.2.weight (with grad)          cuda:0             (32, 8)      True       torch.float32         2.0 KB
+  mlp.2.bias (with grad)            cuda:0               (32,)      True       torch.float32          256 B

-  mlp.0.weight                      cuda:0             (8, 64)      True       torch.float32         2.0 KB
-  mlp.0.bias                        cuda:0                (8,)      True       torch.float32           32 B
-  mlp.2.weight                      cuda:0             (32, 8)      True       torch.float32         1.0 KB
-  mlp.2.bias                        cuda:0               (32,)      True       torch.float32          128 B
-  Tensor                            cuda:0               (64,)      True       torch.float32          256 B
-  Tensor                            cuda:0                (8,)      True       torch.float32           32 B    # deleted activation
------------------------------------------------------------------------------------------------------------
Detect Location: "test_tensor_detector.py" line 34
Totle GPU Memery Allocated on cuda:0 is 10.0 KB
------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------
   Tensor                            device               shape      grad               dtype            Mem
------------------------------------------------------------------------------------------------------------
+  Tensor                            cuda:0               (64,)     False       torch.float32          256 B
+  Tensor                            cuda:0             (8, 64)     False       torch.float32         2.0 KB
+  Tensor                            cuda:0                (8,)     False       torch.float32           32 B
+  Tensor                            cuda:0             (32, 8)     False       torch.float32         1.0 KB
+  Tensor                            cuda:0               (32,)     False       torch.float32          128 B
------------------------------------------------------------------------------------------------------------
Detect Location: "test_tensor_detector.py" line 36
Totle GPU Memery Allocated on cuda:0 is 14.0 KB
------------------------------------------------------------------------------------------------------------
```

## Reference

 This tool was inspired by https://github.com/Stonesjtu/pytorch_memlab/blob/master/pytorch_memlab/mem_reporter.py
 and https://github.com/Oldpan/Pytorch-Memory-Utils
Added tensor detector (#393) * Added tensor detector * Added the - states * Allowed change include_cpu when detect() 2022-03-14 10:01:46 +00:00			`# Tensor Detector`

			`This tool supports you to detect tensors on both CPU and GPU. However, there will always be some strange tensors on CPU, including the rng state of PyTorch.`

			`## Example`

			`An example is worth than a thousand words.`

			`The code below defines a simple MLP module, with which we will show you how to use the tool.`

			```python
			`class MLP(nn.Module):`
			`def __init__(self):`
			`super().__init__()`
			`self.mlp = nn.Sequential(nn.Linear(64, 8),`
			`nn.ReLU(),`
			`nn.Linear(8, 32))`
			`def forward(self, x):`
			`return self.mlp(x)`
			```

			`And here is how to use the tool.`

			```python
			`from colossalai.utils import TensorDetector`

			`# create random data`
			`data = torch.rand(64, requires_grad=True).cuda()`
			`data.retain_grad()`
			`# create the module`
			`model = MLP().cuda()`
			`# create the detector`
			`# by passing the model to the detector, it can distinguish module parameters from common tensors`
			`detector = TensorDetector(include_cpu=False, module=model)`
			`detector.detect()`

			`out = model(data)`

			`detector.detect()`

			`loss = out.sum()`
			`loss.backward()`

			`detector.detect()`
			```

			`I have made some comments on the right of the output for your understanding.`

			Note that the total `Mem` of all the tensors and parameters is not equal to `Total GPU Memery Allocated`. PyTorch's memory management is really complicated, and for models of a large scale, it's impossible to figure out clearly.

			`The order of print is not equal to the order the tensor creates, but they are really close.`

			```bash
			`------------------------------------------------------------------------------------------------------------`
			`Tensor device shape grad dtype Mem`
			`------------------------------------------------------------------------------------------------------------`
			`+ Tensor cuda:0 (64,) True torch.float32 256 B # data`
			`+ mlp.0.weight cuda:0 (8, 64) True torch.float32 2.0 KB`
			`+ mlp.0.bias cuda:0 (8,) True torch.float32 32 B`
			`+ mlp.2.weight cuda:0 (32, 8) True torch.float32 1.0 KB`
			`+ mlp.2.bias cuda:0 (32,) True torch.float32 128 B`
			`------------------------------------------------------------------------------------------------------------`
			`Detect Location: "test_tensor_detector.py" line 27`
			`Totle GPU Memery Allocated on cuda:0 is 4.5 KB`
			`------------------------------------------------------------------------------------------------------------`


			`------------------------------------------------------------------------------------------------------------`
			`Tensor device shape grad dtype Mem`
			`------------------------------------------------------------------------------------------------------------`
			`+ Tensor cuda:0 (8,) True torch.float32 32 B # activation`
			`+ Tensor cuda:0 (32,) True torch.float32 128 B # output`
			`------------------------------------------------------------------------------------------------------------`
			`Detect Location: "test_tensor_detector.py" line 30`
			`Totle GPU Memery Allocated on cuda:0 is 5.5 KB`
			`------------------------------------------------------------------------------------------------------------`


			`------------------------------------------------------------------------------------------------------------`
			`Tensor device shape grad dtype Mem`
			`------------------------------------------------------------------------------------------------------------`
			`+ Tensor cuda:0 () True torch.float32 4 B # loss`
			`------------------------------------------------------------------------------------------------------------`
			`Detect Location: "test_tensor_detector.py" line 32`
			`Totle GPU Memery Allocated on cuda:0 is 6.0 KB`
			`------------------------------------------------------------------------------------------------------------`


			`------------------------------------------------------------------------------------------------------------`
			`Tensor device shape grad dtype Mem`
			`------------------------------------------------------------------------------------------------------------`
			`+ Tensor (with grad) cuda:0 (64,) True torch.float32 512 B # data with grad`
			`+ mlp.0.weight (with grad) cuda:0 (8, 64) True torch.float32 4.0 KB # for use data.retain_grad()`
			`+ mlp.0.bias (with grad) cuda:0 (8,) True torch.float32 64 B`
			`+ mlp.2.weight (with grad) cuda:0 (32, 8) True torch.float32 2.0 KB`
			`+ mlp.2.bias (with grad) cuda:0 (32,) True torch.float32 256 B`

			`- mlp.0.weight cuda:0 (8, 64) True torch.float32 2.0 KB`
			`- mlp.0.bias cuda:0 (8,) True torch.float32 32 B`
			`- mlp.2.weight cuda:0 (32, 8) True torch.float32 1.0 KB`
			`- mlp.2.bias cuda:0 (32,) True torch.float32 128 B`
			`- Tensor cuda:0 (64,) True torch.float32 256 B`
			`- Tensor cuda:0 (8,) True torch.float32 32 B # deleted activation`
			`------------------------------------------------------------------------------------------------------------`
			`Detect Location: "test_tensor_detector.py" line 34`
			`Totle GPU Memery Allocated on cuda:0 is 10.0 KB`
			`------------------------------------------------------------------------------------------------------------`


			`------------------------------------------------------------------------------------------------------------`
			`Tensor device shape grad dtype Mem`
			`------------------------------------------------------------------------------------------------------------`
			`+ Tensor cuda:0 (64,) False torch.float32 256 B`
			`+ Tensor cuda:0 (8, 64) False torch.float32 2.0 KB`
			`+ Tensor cuda:0 (8,) False torch.float32 32 B`
			`+ Tensor cuda:0 (32, 8) False torch.float32 1.0 KB`
			`+ Tensor cuda:0 (32,) False torch.float32 128 B`
			`------------------------------------------------------------------------------------------------------------`
			`Detect Location: "test_tensor_detector.py" line 36`
			`Totle GPU Memery Allocated on cuda:0 is 14.0 KB`
			`------------------------------------------------------------------------------------------------------------`
			```

			`## Reference`

			`This tool was inspired by https://github.com/Stonesjtu/pytorch_memlab/blob/master/pytorch_memlab/mem_reporter.py`
			`and https://github.com/Oldpan/Pytorch-Memory-Utils`