mirror of https://github.com/hpcaitech/ColossalAI
27061426f7
* [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example |
||
---|---|---|
.. | ||
preprocessing | ||
pretraining | ||
README.md | ||
requirements.txt | ||
test_ci.sh |
README.md
Introduction
This example introduce how to pretrain roberta from scratch, including preprocessing, pretraining, finetune. The example can help you quickly train a high-quality roberta.
0. Prerequisite
- Install Colossal-AI
- Editing the port from
/etc/ssh/sshd_config
and/etc/ssh/ssh_config
, every host expose the same ssh port of server and client. If you are a root user, you also set the PermitRootLogin from/etc/ssh/sshd_config
to "yes" - Ensure that each host can log in to each other without password. If you have n hosts, need to execute n2 times
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.
192.168.2.1 GPU001
192.168.2.2 GPU002
192.168.2.3 GPU003
192.168.2.4 GPU004
192.168.2.5 GPU005
192.168.2.6 GPU006
192.168.2.7 GPU007
...
- restart ssh
service ssh restart
1. Corpus Preprocessing
cd preprocessing
following the README.md
, preprocess original corpus to h5py plus numpy
2. Pretrain
cd pretraining
following the README.md
, load the h5py generated by preprocess of step 1 to pretrain the model
3. Finetune
The checkpoint produced by this repo can replace pytorch_model.bin
from hfl/chinese-roberta-wwm-ext-large directly. Then use transformers from Hugging Face to finetune downstream application.
Contributors
The example is contributed by AI team from Moore Threads. If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution!