ColossalAI/examples/community/roberta/README.md

# Introduction
This example introduce how to pretrain roberta from scratch, including preprocessing, pretraining, finetune. The example can help you quickly train a high-quality roberta.

## 0. Prerequisite
- Install Colossal-AI
- Editing the port from `/etc/ssh/sshd_config` and `/etc/ssh/ssh_config`, every host expose the same ssh port of server and client. If you are a root user, you also set the **PermitRootLogin** from `/etc/ssh/sshd_config` to "yes"
- Ensure that each host can log in to each other without password. If you have n hosts, need to execute n<sup>2</sup> times

```
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
```

- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.

```bash
192.168.2.1   GPU001
192.168.2.2   GPU002
192.168.2.3   GPU003
192.168.2.4   GPU004
192.168.2.5   GPU005
192.168.2.6   GPU006
192.168.2.7   GPU007
...
```

- restart ssh
```
service ssh restart
```

## 1. Corpus Preprocessing
```bash
cd preprocessing
```
following the `README.md`, preprocess original corpus to h5py plus numpy

## 2. Pretrain

```bash
cd pretraining
```
following the `README.md`, load the h5py generated by preprocess of step 1 to pretrain the model

## 3. Finetune

The checkpoint produced by this repo can replace `pytorch_model.bin` from  [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main) directly. Then use transformers from Hugging Face to finetune downstream application.

## Contributors
The example is contributed by AI team from [Moore Threads](https://www.mthreads.com/). If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution!
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00			`# Introduction`
[example] update roberta with newer ColossalAI (#3472) * update roberta example * update roberta example 2023-04-07 02:34:51 +00:00			`This example introduce how to pretrain roberta from scratch, including preprocessing, pretraining, finetune. The example can help you quickly train a high-quality roberta.`
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00
			`## 0. Prerequisite`
			`- Install Colossal-AI`
[example] update roberta with newer ColossalAI (#3472) * update roberta example * update roberta example 2023-04-07 02:34:51 +00:00			- Editing the port from `/etc/ssh/sshd_config` and `/etc/ssh/ssh_config`, every host expose the same ssh port of server and client. If you are a root user, you also set the PermitRootLogin from `/etc/ssh/sshd_config` to "yes"
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00			`- Ensure that each host can log in to each other without password. If you have n hosts, need to execute n<sup>2</sup> times`

			```
			`ssh-keygen`
			`ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination`
			```

[example] reorganize for community examples (#3557) 2023-04-14 08:27:48 +00:00			`- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.`
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00
			```bash
			`192.168.2.1 GPU001`
			`192.168.2.2 GPU002`
			`192.168.2.3 GPU003`
			`192.168.2.4 GPU004`
			`192.168.2.5 GPU005`
			`192.168.2.6 GPU006`
			`192.168.2.7 GPU007`
			`...`
			```

			`- restart ssh`
			```
			`service ssh restart`
			```

[example] reorganize for community examples (#3557) 2023-04-14 08:27:48 +00:00			`## 1. Corpus Preprocessing`
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00			```bash
			`cd preprocessing`
			```
[example] update roberta with newer ColossalAI (#3472) * update roberta example * update roberta example 2023-04-07 02:34:51 +00:00			following the `README.md`, preprocess original corpus to h5py plus numpy
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00
			`## 2. Pretrain`

			```bash
			`cd pretraining`
			```
			following the `README.md`, load the h5py generated by preprocess of step 1 to pretrain the model

			`## 3. Finetune`

fix typo examples/community/roberta (#3925) 2023-06-08 06:28:34 +00:00			The checkpoint produced by this repo can replace `pytorch_model.bin` from [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main) directly. Then use transformers from Hugging Face to finetune downstream application.
add RoBERTa (#1980) * update roberta * update roberta & readme * update roberta & readme * update roberta & readme 2022-11-18 06:04:49 +00:00
			`## Contributors`
[example] update roberta with newer ColossalAI (#3472) * update roberta example * update roberta example 2023-04-07 02:34:51 +00:00			`The example is contributed by AI team from [Moore Threads](https://www.mthreads.com/). If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution!`