Commit Graph

3073 Commits (868afdb31191ef7b3fa48d6fa71e7758c8707786)

Author SHA1 Message Date
Wang Binluo 868afdb311
Dev/zero offload (#5858)
* fix llama

* fix llama
2024-06-26 16:07:06 +08:00
Wang Binluo de3f67d128
fix llama (#5856) 2024-06-26 10:15:13 +08:00
Wang Binluo 4c06215dce
Merge pull request #5844 from wangbluo/offload
Update Qwen2 model
2024-06-20 17:07:57 +08:00
Wang Binluo e893f88a4f
Merge branch 'dev/zero-offload' into offload 2024-06-20 17:07:24 +08:00
wangbluo d4ff644ef3 update qwen model 2024-06-20 09:04:57 +00:00
wangbluo dba59354d7 remove vocab_size args 2024-06-20 08:06:39 +00:00
Wang Binluo 35ef72bfd1
Merge pull request #5842 from wangbluo/dev/zero-offload
update llama model
2024-06-20 15:37:03 +08:00
pre-commit-ci[bot] 351a1c269b [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
2024-06-20 06:50:40 +00:00
wangbluo b12e9a3275 update llama model 2024-06-20 06:46:25 +00:00
wangbluo 52ea64824e remove 4d attention mask 2024-06-19 09:28:08 +00:00
pre-commit-ci[bot] df612434c9 [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
2024-06-14 16:27:46 +08:00
Wang Binluo 4c69e2dc91 support qwen model 2024-06-14 16:27:46 +08:00
Wenhao Chen 32e642bf40 revert: enable return_outputs when necessary 2024-06-14 16:27:46 +08:00
Wenhao Chen 856b39f69d to: add qwen2 auto policy 2024-06-14 16:27:46 +08:00
Wenhao Chen 6fa181ebef feat: add qwen2 to model_zoo 2024-06-14 16:27:46 +08:00
Wenhao Chen 14305c9449 test: add qwen2 shard test 2024-06-14 16:27:46 +08:00
Wenhao Chen 5512bdf1fc fix: modify model config and add Qwen2RMSNorm 2024-06-14 16:27:46 +08:00
Wenhao Chen 5c2a47a667 feat: support qwen2 model 2024-06-14 16:27:46 +08:00
Wenhao Chen 61545fcfee feat: add `sub_dp_size` in plugin 2024-04-01 16:02:12 +08:00
Wenhao Chen 6ceaf4f1f8 tests: add `sub_dp_group` test 2024-04-01 16:02:12 +08:00
Wenhao Chen 9291f07964 feat: add `sub_dp_group` 2024-04-01 16:02:12 +08:00
Wenhao Chen 1aaa453706 perf: use async copy to accelerate memcpy 2024-04-01 16:02:12 +08:00
Wenhao Chen a53c8c1ade to: remove MoE temporarily 2024-04-01 16:02:12 +08:00
Wenhao Chen 93aaa21d4a feat: add `DataPrefetcher` 2024-04-01 16:02:12 +08:00
Wenhao Chen a1ab2d374e misc: add offload warning 2024-04-01 16:02:12 +08:00
Wenhao Chen e614aa34f3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)
* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`

* feat: apply `GradientCheckpointConfig` to policy and llama_forward

* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager

* fix: add optional args for `distribute_layer` and `get_stage_index`

* fix: fix changed API calls

* test: update llama tests

* style: polish `GradientCheckpointConfig`

* fix: fix pipeline utils tests
2024-04-01 11:34:58 +08:00
YeAnbang df5e9c53cf
[ColossalChat] Update RLHF V2 (#5286)
* Add dpo. Fix sft, ppo, lora. Refactor all

* fix and tested ppo

* 2 nd round refactor

* add ci tests

* fix ci

* fix ci

* fix readme, style

* fix readme style

* fix style, fix benchmark

* reproduce benchmark result, remove useless files

* rename to ColossalChat

* use new image

* fix ci workflow

* fix ci

* use local model/tokenizer for ci tests

* fix ci

* fix ci

* fix ci

* fix ci timeout

* fix rm progress bar. fix ci timeout

* fix ci

* fix ci typo

* remove 3d plugin from ci temporary

* test environment

* cannot save optimizer

* support chat template

* fix readme

* fix path

* test ci locally

* restore build_or_pr

* fix ci data path

* fix benchmark

* fix ci, move ci tests to 3080, disable fast tokenizer

* move ci to 85

* support flash attention 2

* add all-in-one data preparation script. Fix colossal-llama2-chat chat template

* add hardware requirements

* move ci test data

* fix save_model, add unwrap

* fix missing bos

* fix missing bos; support grad accumulation with gemini

* fix ci

* fix ci

* fix ci

* fix llama2 chat template config

* debug sft

* debug sft

* fix colossalai version requirement

* fix ci

* add sanity check to prevent NaN loss

* fix requirements

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* update readme

* update readme

* update readme and ignore

* fix logger bug

* support parallel_output

* modify data preparation logic

* fix tokenization

* update lr

* fix inference

* run pre-commit

---------

Co-authored-by: Tong Li <tong.li352711588@gmail.com>
2024-03-29 14:12:29 +08:00
Yuanheng Zhao 36c4bb2893
[Fix] Grok-1 use tokenizer from the same pretrained path (#5532)
* [fix] use tokenizer from the same pretrained path

* trust remote code
2024-03-28 16:30:04 +08:00
Insu Jang 00525f7772
[shardformer] fix pipeline forward error if custom layer distribution is used (#5189)
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-03-27 13:57:00 +08:00
github-actions[bot] e6707a6e8d
[format] applied code formatting on changed files in pull request 5510 (#5517)
Co-authored-by: github-actions <github-actions@github.com>
2024-03-27 11:21:03 +08:00
Hongxin Liu 19e1a5cf16
[shardformer] update colo attention to support custom mask (#5510)
* [feature] refactor colo attention (#5462)

* [extension] update api

* [feature] add colo attention

* [feature] update sdpa

* [feature] update npu attention

* [feature] update flash-attn

* [test] add flash attn test

* [test] update flash attn test

* [shardformer] update modeling to fit colo attention (#5465)

* [misc] refactor folder structure

* [shardformer] update llama flash-attn

* [shardformer] fix llama policy

* [devops] update tensornvme install

* [test] update llama test

* [shardformer] update colo attn kernel dispatch

* [shardformer] update blip2

* [shardformer] update chatglm

* [shardformer] update gpt2

* [shardformer] update gptj

* [shardformer] update opt

* [shardformer] update vit

* [shardformer] update colo attention mask prep

* [shardformer] update whisper

* [test] fix shardformer tests (#5514)

* [test] fix shardformer tests

* [test] fix shardformer tests
2024-03-27 11:19:32 +08:00
Edenzzzz 9a3321e9f4
Merge pull request #5515 from Edenzzzz/fix_layout_convert
Fix layout convertor caching
2024-03-26 19:51:02 +08:00
Edenzzzz 18edcd5368 Empty-Commit 2024-03-26 19:50:41 +08:00
Edenzzzz 61da3fbc52 fixed layout converter caching and updated tester 2024-03-26 17:22:27 +08:00
Rocky Duan cbe34c557c
Fix ColoTensorSpec for py11 (#5440) 2024-03-26 15:56:49 +08:00
Hongxin Liu a7790a92e8
[devops] fix example test ci (#5504) 2024-03-26 15:09:05 +08:00
Yuanheng Zhao 131f32a076
[fix] fix grok-1 example typo (#5506) 2024-03-26 10:19:42 +08:00
flybird11111 0688d92e2d
[shardformer]Fix lm parallel. (#5480)
* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* fix lm forward distribution

* fix

* test ci

* fix
2024-03-25 17:21:51 +08:00
binmakeswell 34e909256c
[release] grok-1 inference benchmark (#5500)
* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark
2024-03-25 14:42:51 +08:00
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404)
* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value
2024-03-25 12:31:09 +08:00
Yuanheng Zhao 5fcd7795cd
[example] update Grok-1 inference (#5495)
* revise grok-1 example

* remove unused arg in scripts

* prevent re-installing torch

* update readme

* revert modifying colossalai requirements

* add perf

* trivial

* add tokenizer url
2024-03-24 20:24:11 +08:00
binmakeswell 6df844b8c4
[release] grok-1 314b inference (#5490)
* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference
2024-03-22 15:48:12 +08:00
Hongxin Liu 848a574c26
[example] add grok-1 inference (#5485)
* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme
2024-03-21 18:07:22 +08:00
binmakeswell d158fc0e64
[doc] update open-sora demo (#5479)
* [doc] update open-sora demo

* [doc] update open-sora demo

* [doc] update open-sora demo
2024-03-20 16:08:41 +08:00
binmakeswell bd998ced03
[doc] release Open-Sora 1.0 with model weights (#5468)
* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights
2024-03-18 18:31:18 +08:00
flybird11111 5e16bf7980
[shardformer] fix gathering output when using tensor parallelism (#5431)
* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert
2024-03-18 15:55:11 +08:00
Hongxin Liu f2e8b9ef9f
[devops] fix compatibility (#5444)
* [devops] fix compatibility

* [hotfix] update compatibility test on pr

* [devops] fix compatibility

* [devops] record duration during comp test

* [test] decrease test duration

* fix falcon
2024-03-13 15:24:13 +08:00
digger yu 385e85afd4
[hotfix] fix typo s/keywrods/keywords etc. (#5429) 2024-03-12 11:25:16 +08:00
Camille Zhong da885ed540
fix tensor data update for gemini loss caluculation (#5442) 2024-03-11 13:49:58 +08:00
Hongxin Liu 8020f42630
[release] update version (#5411) 2024-03-07 23:36:07 +08:00