# Introduction This repo introduce how to pretrain a chinese roberta-large from scratch, including preprocessing, pretraining, finetune. The repo can help you quickly train a high-quality bert. ## 0. Prerequisite - Install Colossal-AI - Editing the port from /etc/ssh/sshd_config and /etc/ssh/ssh_config, every host expose the same ssh port of server and client. If you are a root user, you also set the **PermitRootLogin** from /etc/ssh/sshd_config to "yes" - Ensure that each host can log in to each other without password. If you have n hosts, need to execute n2 times ``` ssh-keygen ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination ``` - In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below. ```bash 192.168.2.1 GPU001 192.168.2.2 GPU002 192.168.2.3 GPU003 192.168.2.4 GPU004 192.168.2.5 GPU005 192.168.2.6 GPU006 192.168.2.7 GPU007 ... ``` - restart ssh ``` service ssh restart ``` ## 1. Corpus Preprocessing ```bash cd preprocessing ``` following the `README.md`, preprocess orginal corpus to h5py+numpy ## 2. Pretrain ```bash cd pretraining ``` following the `README.md`, load the h5py generated by preprocess of step 1 to pretrain the model ## 3. Finetune The checkpoint produced by this repo can replace `pytorch_model.bin` from [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main) directly. Then use transfomers from HuggingFace to finetune downstream application. ## Contributors The repo is contributed by AI team from [Moore Threads](https://www.mthreads.com/). If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution! ``` @misc{ title={A simple Chinese RoBERTa Example for Whole Word Masked}, author={Yehua Zhang, Chen Zhang}, year={2022} } ```