diff --git a/colossalai/shardformer/README.md b/colossalai/shardformer/README.md index 3ce4baa64..c8670affb 100644 --- a/colossalai/shardformer/README.md +++ b/colossalai/shardformer/README.md @@ -116,18 +116,18 @@ We will follow this roadmap to develop Shardformer: | model | tensor parallel | pipeline parallel | lazy initialization | xformer | flash attn2 | jit fused operator | fused layernorm | sequence parallel | overlap | | :------: | :-----: | :-----: | :--------: | :---------: | :------: | :-----: | :-----: | :--------: | :---------: | -| bert | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | -| t5 | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [ ] | [ ] | -| llama V1/V2 | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [ ] | [ ] | -| gpt2 | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | -| opt | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [ ] | [ ] | -| bloom | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | -| chatglm2 | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | [x] | -| vit | [x] | [x] | [ ] | [x] | [x] | [x] | [x] | [ ] | [ ] | -| whisper | [x] | [x] | [x] | [x] | [x] | [ ] | [x] | [ ] | [ ] | -| sam | [x] | [ ] | [ ] | [x] | [x] | [x] | [x] | [ ] | [ ] | -| blip2 | [x] | [ ] | [ ] | [x] | [x] | [x] | [x] | [ ] | [ ] | -| falcon | [x] | [x] | [x] | [x] | [x] | [ ] | [x] | [ ] | [ ] | +| bert | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | +| t5 | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [ ] | [ ] | +| llama V1/V2 | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [ ] | [ ] | +| gpt2 | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | +| opt | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [ ] | [ ] | +| bloom | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | +| chatglm2 | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | [√] | +| vit | [√] | [√] | [ ] | [√] | [√] | [√] | [√] | [ ] | [ ] | +| whisper | [√] | [√] | [√] | [√] | [√] | [ ] | [√] | [ ] | [ ] | +| sam | [√] | [ ] | [ ] | [√] | [√] | [√] | [√] | [ ] | [ ] | +| blip2 | [√] | [ ] | [ ] | [√] | [√] | [√] | [√] | [ ] | [ ] | +| falcon | [√] | [√] | [√] | [√] | [√] | [ ] | [√] | [ ] | [ ] | | roberta | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | | albert | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | | ernie | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | @@ -137,7 +137,7 @@ We will follow this roadmap to develop Shardformer: | swin | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | | swin V2 | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | | qwen | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | [ ] | -| mistral | [x] | [ ] | [ ] | [x] | [x] | [x] | [x] | [ ] | [ ] | +| mistral | [√] | [ ] | [ ] | [√] | [√] | [√] | [√] | [ ] | [ ] | ## 💡 API Design diff --git a/examples/language/llama2/README.md b/examples/language/llama2/README.md index f29b9dcdd..752453b5a 100644 --- a/examples/language/llama2/README.md +++ b/examples/language/llama2/README.md @@ -6,7 +6,6 @@

- 70 billion parameter LLaMA2 model training accelerated by 195% -[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/llama2) [[blog]](https://www.hpc-ai.tech/blog/70b-llama2-training) ### LLaMA1 @@ -15,7 +14,6 @@

- 65-billion-parameter large model pretraining accelerated by 38% -[[code]](https://github.com/hpcaitech/ColossalAI/tree/example/llama/examples/language/llama) [[blog]](https://www.hpc-ai.tech/blog/large-model-pretraining) ## Dataset @@ -123,7 +121,7 @@ Here we will show an example of how to run training llama pretraining with `gemini, batch_size=16, sequence_length=4096, gradient_checkpoint=True, flash_attn=True`. #### a. Running environment -This experiment was performed on 4 computing nodes with 32 A800 GPUs in total for LLaMA-1 65B. The nodes are +This experiment was performed on 4 computing nodes with 32 A800/H800 80GB GPUs in total for LLaMA-1 65B or LLaMA-2 70B. The nodes are connected with RDMA and GPUs within one node are fully connected with NVLink. #### b. Running command