You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ColossalAI/examples/language/roberta
binmakeswell d7352bef2c
[example] add example requirement (#2345)
2 years ago
..
configs add RoBERTa (#1980) 2 years ago
preprocessing add RoBERTa (#1980) 2 years ago
pretraining add RoBERTa (#1980) 2 years ago
README.md add RoBERTa (#1980) 2 years ago
requirements.txt [example] add example requirement (#2345) 2 years ago

README.md

Introduction

This repo introduce how to pretrain a chinese roberta-large from scratch, including preprocessing, pretraining, finetune. The repo can help you quickly train a high-quality bert.

0. Prerequisite

  • Install Colossal-AI
  • Editing the port from /etc/ssh/sshd_config and /etc/ssh/ssh_config, every host expose the same ssh port of server and client. If you are a root user, you also set the PermitRootLogin from /etc/ssh/sshd_config to "yes"
  • Ensure that each host can log in to each other without password. If you have n hosts, need to execute n2 times
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
  • In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.
192.168.2.1   GPU001
192.168.2.2   GPU002
192.168.2.3   GPU003
192.168.2.4   GPU004
192.168.2.5   GPU005
192.168.2.6   GPU006
192.168.2.7   GPU007
...
  • restart ssh
service ssh restart

1. Corpus Preprocessing

cd preprocessing

following the README.md, preprocess orginal corpus to h5py+numpy

2. Pretrain

cd pretraining

following the README.md, load the h5py generated by preprocess of step 1 to pretrain the model

3. Finetune

The checkpoint produced by this repo can replace pytorch_model.bin from hfl/chinese-roberta-wwm-ext-large directly. Then use transfomers from HuggingFace to finetune downstream application.

Contributors

The repo is contributed by AI team from Moore Threads. If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution!

@misc{
  title={A simple Chinese RoBERTa Example for Whole Word Masked},
  author={Yehua Zhang, Chen Zhang},
  year={2022}
}