Commit Graph

1659 Commits (11009103beb7298f9f50423fe1de702a6bbab5d6)

Author SHA1 Message Date
digger yu 11009103be
[nfc] fix some typo with colossalai/ docs/ etc. (#4920) 2023-10-18 15:44:04 +08:00
github-actions[bot] 486d06a2d5
[format] applied code formatting on changed files in pull request 4820 (#4886)
Co-authored-by: github-actions <github-actions@github.com>
2023-10-18 11:46:37 +08:00
Zhongkai Zhao c7aa319ba0
[test] add no master test for low level zero plugin (#4934) 2023-10-18 11:41:23 +08:00
Hongxin Liu 1f5d2e8062
[hotfix] fix torch 2.0 compatibility (#4936)
* [hotfix] fix launch

* [test] fix test gemini optim

* [shardformer] fix vit
2023-10-18 11:05:25 +08:00
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869)
* add test

* fix no_sync bug in low level zero plugin

* fix test

* add argument for grad accum

* add grad accum in backward hook for gemini

* finish implementation, rewrite tests

* fix test

* skip stuck model in low level zero test

* update doc

* optimize communication & fix gradient checkpoint

* modify doc

* cleaning codes

* update cpu adam fp16 case
2023-10-17 14:07:21 +08:00
Hongxin Liu 4f68b3f10c
[kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921)
* [kernel] support pure fp16 for cpu adam (#4896)

* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919)

* [kernel] fix cpu adam

* [test] update gemini optim test
2023-10-16 21:56:53 +08:00
Xu Kai 611a5a80ca
[inference] Add smmoothquant for llama (#4904)
* [inference] add int8 rotary embedding kernel for smoothquant (#4843)

* [inference] add smoothquant llama attention (#4850)

* add smoothquant llama attention

* remove uselss code

* remove useless code

* fix import error

* rename file name

* [inference] add silu linear fusion for smoothquant llama mlp  (#4853)

* add silu linear

* update skip condition

* catch smoothquant cuda lib exception

* prcocess exception for tests

* [inference] add llama mlp for smoothquant (#4854)

* add llama mlp for smoothquant

* fix down out scale

* remove duplicate lines

* add llama mlp check

* delete useless code

* [inference] add smoothquant llama (#4861)

* add smoothquant llama

* fix attention accuracy

* fix accuracy

* add kv cache and save pretrained

* refactor example

* delete smooth

* refactor code

* [inference] add smooth function and delete useless code for smoothquant (#4895)

* add smooth function and delete useless code

* update datasets

* remove duplicate import

* delete useless file

* refactor codes (#4902)

* rafactor code

* add license

* add torch-int and smoothquant license
2023-10-16 11:28:44 +08:00
Zhongkai Zhao a0684e7bd6
[feature] support no master weights option for low level zero plugin (#4816)
* [feature] support no master weights for low level zero plugin

* [feature] support no master weights for low level zero plugin, remove data copy when no master weights

* remove data copy and typecasting when no master weights

* not load weights to cpu when using no master weights

* fix grad: use fp16 grad when no master weights

* only do not update working param when no master weights

* fix: only do not update working param when no master weights

* fix: passing params in dict format in hybrid plugin

* fix: remove extra params (tp_process_group) in hybrid_parallel_plugin
2023-10-13 07:57:45 +00:00
Xu Kai 77a9328304
[inference] add llama2 support (#4898)
* add llama2 support

* fix multi group bug
2023-10-13 13:09:23 +08:00
Baizhou Zhang 39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 (#4864) 2023-10-12 14:04:24 +08:00
littsk 83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837)
* Add clip_grad_norm for hibrid_parallel_plugin

* polish code

* add unittests

* Move tp to a higher-level optimizer interface.

* bug fix

* polish code
2023-10-12 11:32:37 +08:00
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872)
* [gemini] support no reuse fp16 chunk

* [gemini] support no master weight for optim

* [gemini] support no master weight for gemini ddp

* [test] update gemini tests

* [test] update gemini tests

* [plugin] update gemini plugin

* [test] fix gemini checkpointio test

* [test] fix gemini checkpoint io
2023-10-12 10:39:08 +08:00
ppt0011 1dcaf249bd [doc] add reminder for issue encountered with hybrid adam 2023-10-11 17:51:14 +08:00
Bin Jia 08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main (#4820)
* [pipeline inference] pipeline inference (#4492)

* add pp stage manager as circle stage

* fix a bug when create process group

* add ppinfer basic framework

* add micro batch manager and support kvcache-pp gpt2 fwd

* add generate schedule

* use mb size to control mb number

* support generate with kv cache

* add output, remove unused code

* add test

* reuse shardformer to build model

* refactor some code and use the same attribute name of hf

* fix review and add test for generation

* remove unused file

* fix CI

* add cache clear

* fix code error

* fix typo

* [Pipeline inference] Modify to tieweight (#4599)

* add pp stage manager as circle stage

* fix a bug when create process group

* add ppinfer basic framework

* add micro batch manager and support kvcache-pp gpt2 fwd

* add generate schedule

* use mb size to control mb number

* support generate with kv cache

* add output, remove unused code

* add test

* reuse shardformer to build model

* refactor some code and use the same attribute name of hf

* fix review and add test for generation

* remove unused file

* modify the way of saving newtokens

* modify to tieweight

* modify test

* remove unused file

* solve review

* add docstring

* [Pipeline inference] support llama pipeline inference (#4647)

* support llama pipeline inference

* remove tie weight operation

* [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708)

* add benchmark verbose

* fix export tokens

* fix benchmark verbose

* add P2POp style to do p2p communication

* modify schedule as p2p type when ppsize is 2

* remove unused code and add docstring

* [Pipeline inference] Refactor code, add docsting, fix bug (#4790)

* add benchmark script

* update argparse

* fix fp16 load

* refactor code style

* add docstring

* polish code

* fix test bug

* [Pipeline inference] Add pipeline inference docs (#4817)

* add readme doc

* add a ico

* Add performance

* update table of contents

* refactor code (#4873)
2023-10-11 11:40:06 +08:00
Camille Zhong cd6a962e66 [NFC] polish code style (#4799) 2023-10-07 13:36:52 +08:00
Michelle 07ed155e86 [NFC] polish colossalai/inference/quant/gptq/cai_gptq/__init__.py code style (#4792) 2023-10-07 13:36:52 +08:00
littsk eef96e0877 polish code for gptq (#4793) 2023-10-07 13:36:52 +08:00
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824) 2023-10-07 10:45:52 +08:00
shaoyuw c97a3523db fix: typo in comment of low_level_zero plugin 2023-10-05 16:30:34 +00:00
Xu Kai d1fcc0fa4d
[infer] fix test bug (#4838)
* fix test bug

* delete useless code

* fix typo
2023-10-04 10:01:03 +08:00
Jianghai 013a4bedf0
[inference]fix import bug and delete down useless init (#4830)
* fix import bug and release useless init

* fix

* fix

* fix
2023-10-04 09:18:45 +08:00
Xu Kai c3bef20478
add autotune (#4822) 2023-09-28 13:47:35 +08:00
binmakeswell 822051d888
[doc] update slack link (#4823) 2023-09-27 17:37:39 +08:00
littsk 11f1e426fe
[hotfix] Correct several erroneous code comments (#4794) 2023-09-27 10:43:03 +08:00
littsk 54b3ad8924
[hotfix] fix norm type error in zero optimizer (#4795) 2023-09-27 10:35:24 +08:00
Hongxin Liu da15fdb9ca
[doc] add lazy init docs (#4808) 2023-09-27 10:24:04 +08:00
Yan haixu a22706337a
[misc] add last_epoch in CosineAnnealingWarmupLR (#4778) 2023-09-26 14:43:46 +08:00
Hongxin Liu 4965c0dabd
[lazy] support from_pretrained (#4801)
* [lazy] patch from pretrained

* [lazy] fix from pretrained and add tests

* [devops] update ci
2023-09-26 11:04:11 +08:00
Baizhou Zhang 64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774)
* support unsharded saving/loading for model

* support optimizer unsharded saving

* update doc

* support unsharded loading for optimizer

* small fix
2023-09-26 10:58:03 +08:00
Baizhou Zhang a2db75546d
[doc] polish shardformer doc (#4779)
* fix example format in docstring

* polish shardformer doc
2023-09-26 10:57:47 +08:00
Jianghai ce7ade3882
[inference] chatglm2 infer demo (#4724)
* add chatglm2

* add

* gather needed kernels

* fix some bugs

* finish context forward

* finish context stage

* fix

* add

* pause

* add

* fix bugs

* finish chatglm

* fix bug

* change some logic

* fix bugs

* change some logics

* add

* add

* add

* fix

* fix tests

* fix
2023-09-22 11:12:50 +08:00
Xu Kai 946ab56c48
[feature] add gptq for inference (#4754)
* [gptq] add gptq kernel (#4416)

* add gptq

* refactor code

* fix tests

* replace auto-gptq

* rname inferance/quant

* refactor test

* add auto-gptq as an option

* reset requirements

* change assert and check auto-gptq

* add import warnings

* change test flash attn version

* remove example

* change requirements of flash_attn

* modify tests

* [skip ci] change requirements-test

* [gptq] faster gptq cuda kernel (#4494)

* [skip ci] add cuda kernels

* add license

* [skip ci] fix max_input_len

* format files & change test size

* [skip ci]

* [gptq] add gptq tensor parallel (#4538)

* add gptq tensor parallel

* add gptq tp

* delete print

* add test gptq check

* add test auto gptq check

* [gptq] combine gptq and kv cache manager (#4706)

* combine gptq and kv cache manager

* add init bits

* delete useless code

* add model path

* delete usless print and update test

* delete usless import

* move option gptq to shard config

* change replace linear to shardformer

* update bloom policy

* delete useless code

* fix import bug and delete uselss code

* change colossalai/gptq to colossalai/quant/gptq

* update import linear for tests

* delete useless code and mv gptq_kernel to kernel directory

* fix triton kernel

* add triton import
2023-09-22 11:02:50 +08:00
littsk 1e0e080837
[bug] Fix the version check bug in colossalai run when generating the cmd. (#4713)
* Fix the version check bug in colossalai run when generating the cmd.

* polish code
2023-09-22 10:50:47 +08:00
Hongxin Liu 3e05c07bb8
[lazy] support torch 2.0 (#4763)
* [lazy] support _like methods and clamp

* [lazy] pass transformers models

* [lazy] fix device move and requires grad

* [lazy] fix requires grad and refactor api

* [lazy] fix requires grad
2023-09-21 16:30:23 +08:00
Baizhou Zhang df66741f77
[bug] fix get_default_parser in examples (#4764) 2023-09-21 10:42:25 +08:00
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758)
* fix master param sync for hybrid plugin

* rewrite unwrap for ddp/fsdp

* rewrite unwrap for zero/gemini

* rewrite unwrap for hybrid plugin

* fix geemini unwrap

* fix bugs
2023-09-20 18:29:37 +08:00
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format
2023-09-19 14:20:26 +08:00
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
* [legacy] remove outdated codes of pipeline (#4692)

* [legacy] remove cli of benchmark and update optim (#4690)

* [legacy] remove cli of benchmark and update optim

* [doc] fix cli doc test

* [legacy] fix engine clip grad norm

* [legacy] remove outdated colo tensor (#4694)

* [legacy] remove outdated colo tensor

* [test] fix test import

* [legacy] move outdated zero to legacy (#4696)

* [legacy] clean up utils (#4700)

* [legacy] clean up utils

* [example] update examples

* [legacy] clean up amp

* [legacy] fix amp module

* [legacy] clean up gpc (#4742)

* [legacy] clean up context

* [legacy] clean core, constants and global vars

* [legacy] refactor initialize

* [example] fix examples ci

* [example] fix examples ci

* [legacy] fix tests

* [example] fix gpt example

* [example] fix examples ci

* [devops] fix ci installation

* [example] fix examples ci
2023-09-18 16:31:06 +08:00
Xuanlei Zhao 32e7f99416
[kernel] update triton init #4740 (#4740) 2023-09-18 09:44:27 +08:00
flybird11111 4c4482f3ad
[example] llama2 add fine-tune example (#4673)
* [shardformer] update shardformer readme

[shardformer] update shardformer readme

[shardformer] update shardformer readme

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] change dataset

* [shardformer] change dataset

* [shardformer] fix CI

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

[example] update opt example

[example] resolve comments

fix

fix

* [example] llama2 add finetune example

* [example] llama2 add finetune example

* [example] llama2 add finetune example

* [example] llama2 add finetune example

* fix

* update llama2 example

* update llama2 example

* fix

* update llama2 example

* update llama2 example

* update llama2 example

* update llama2 example

* update llama2 example

* update llama2 example

* Update requirements.txt

* update llama2 example

* update llama2 example

* update llama2 example
2023-09-15 18:45:44 +08:00
Xuanlei Zhao ac2797996b
[shardformer] add custom policy in hybrid parallel plugin (#4718)
* add custom policy

* update assert
2023-09-15 17:53:13 +08:00
Baizhou Zhang f911d5b09d
[doc] Add user document for Shardformer (#4702)
* create shardformer doc files

* add docstring for seq-parallel

* update ShardConfig docstring

* add links to llama example

* add outdated massage

* finish introduction & supporting information

* finish 'how shardformer works'

* finish shardformer.md English doc

* fix doctest fail

* add Chinese document
2023-09-15 10:56:39 +08:00
flybird11111 20190b49a5
[shardformer] to fix whisper test failed due to significant accuracy differences. (#4710)
* [shardformer] fix whisper test failed

* [shardformer] fix whisper test failed

* [shardformer] fix whisper test failed

* [shardformer] fix whisper test failed
2023-09-14 21:34:20 +08:00
Yuanheng Zhao e2c0e7f92a
[hotfix] Fix import error: colossal.kernel without triton installed (#4722)
* [hotfix] remove triton kernels from kernel init

* revise bloom/llama kernel imports for infer
2023-09-14 18:03:55 +08:00
flybird11111 c7d6975d29
[shardformer] fix GPT2DoubleHeadsModel (#4703) 2023-09-13 15:57:16 +08:00
digger yu 9c2feb2f0b
fix some typo with colossalai/device colossalai/tensor/ etc. (#4171)
Co-authored-by: flybird11111 <1829166702@qq.com>
2023-09-12 17:41:52 +08:00
Baizhou Zhang d8ceeac14e
[hotfix] fix typo in hybrid parallel io (#4697) 2023-09-12 17:32:19 +08:00
flybird11111 8844691f4b
[shardformer] update shardformer readme (#4689)
* [shardformer] update shardformer readme

* [shardformer] update shardformer readme

* [shardformer] update shardformer readme

* [shardformer] update shardformer readme

* [shardformer] update shardformer readme
2023-09-12 15:14:24 +08:00
Baizhou Zhang 1d454733c4
[doc] Update booster user documents. (#4669)
* update booster_api.md

* update booster_checkpoint.md

* update booster_plugins.md

* move transformers importing inside function

* fix Dict typing

* fix autodoc bug

* small fix
2023-09-12 10:47:23 +08:00
Cuiqing Li bce0f16702
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577)
* [infer] Infer/llama demo (#4503)

* add

* add infer example

* finish

* finish

* stash

* fix

* [Kernels]  add inference token attention kernel (#4505)

* add token forward

* fix tests

* fix comments

* add try import triton

* add adapted license

* add tests check

* [Kernels] add necessary kernels (llama & bloom) for attention forward and kv-cache manager  (#4485)

* added _vllm_rms_norm

* change place

* added tests

* added tests

* modify

* adding kernels

* added tests:

* adding kernels

* modify

* added

* updating kernels

* adding tests

* added tests

* kernel change

* submit

* modify

* added

* edit comments

* change name

* change commnets and fix import

* add

* added

* combine codes (#4509)

* [feature] add KV cache manager for llama & bloom inference (#4495)

* add kv cache memory manager

* add stateinfo during inference

* format

* format

* rename file

* add kv cache test

* revise on BatchInferState

* file dir change

* [Bug FIx] import llama context ops fix (#4524)

* added _vllm_rms_norm

* change place

* added tests

* added tests

* modify

* adding kernels

* added tests:

* adding kernels

* modify

* added

* updating kernels

* adding tests

* added tests

* kernel change

* submit

* modify

* added

* edit comments

* change name

* change commnets and fix import

* add

* added

* fix

* add ops into init.py

* add

* [Infer] Add TPInferEngine and fix file path (#4532)

* add engine for TP inference

* move file path

* update path

* fix TPInferEngine

* remove unused file

* add engine test demo

* revise TPInferEngine

* fix TPInferEngine, add test

* fix

* Add Inference test for llama (#4508)

* add kv cache memory manager

* add stateinfo during inference

* add

* add infer example

* finish

* finish

* format

* format

* rename file

* add kv cache test

* revise on BatchInferState

* add inference test for llama

* fix conflict

* feature: add some new features for llama engine

* adapt colossalai triton interface

* Change the parent class of llama  policy

* add nvtx

* move llama inference code to tensor_parallel

* fix __init__.py

* rm tensor_parallel

* fix: fix bugs in auto_policy.py

* fix:rm some unused codes

* mv colossalai/tpinference to colossalai/inference/tensor_parallel

* change __init__.py

* save change

* fix engine

* Bug fix: Fix hang

* remove llama_infer_engine.py

---------

Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>

* [infer] Add Bloom inference policy and replaced methods (#4512)

* add bloom inference methods and policy

* enable pass BatchInferState from model forward

* revise bloom infer layers/policies

* add engine for inference (draft)

* add test for bloom infer

* fix bloom infer policy and flow

* revise bloom test

* fix bloom file path

* remove unused codes

* fix bloom modeling

* fix dir typo

* fix trivial

* fix policy

* clean pr

* trivial fix

* Revert "[infer] Add Bloom inference policy and replaced methods (#4512)" (#4552)

This reverts commit 17cfa57140.

* [Doc] Add colossal inference doc (#4549)

* create readme

* add readme.md

* fix typos

* [infer] Add Bloom inference policy and replaced methods (#4553)

* add bloom inference methods and policy

* enable pass BatchInferState from model forward

* revise bloom infer layers/policies

* add engine for inference (draft)

* add test for bloom infer

* fix bloom infer policy and flow

* revise bloom test

* fix bloom file path

* remove unused codes

* fix bloom modeling

* fix dir typo

* fix trivial

* fix policy

* clean pr

* trivial fix

* trivial

* Fix Bugs In Llama Model Forward (#4550)

* add kv cache memory manager

* add stateinfo during inference

* add

* add infer example

* finish

* finish

* format

* format

* rename file

* add kv cache test

* revise on BatchInferState

* add inference test for llama

* fix conflict

* feature: add some new features for llama engine

* adapt colossalai triton interface

* Change the parent class of llama  policy

* add nvtx

* move llama inference code to tensor_parallel

* fix __init__.py

* rm tensor_parallel

* fix: fix bugs in auto_policy.py

* fix:rm some unused codes

* mv colossalai/tpinference to colossalai/inference/tensor_parallel

* change __init__.py

* save change

* fix engine

* Bug fix: Fix hang

* remove llama_infer_engine.py

* bug fix: fix bugs about infer_state.is_context_stage

* remove pollcies

* fix: delete unused code

* fix: delete unused code

* remove unused coda

* fix conflict

---------

Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>

* [doc] add colossal inference fig (#4554)

* create readme

* add readme.md

* fix typos

* upload fig

* [NFC] fix docstring for colossal inference (#4555)

Fix docstring and comments in kv cache manager and bloom modeling

* fix docstring in llama modeling (#4557)

* [Infer] check import vllm (#4559)

* change import vllm

* import apply_rotary_pos_emb

* change import location

* [DOC] add installation req (#4561)

* add installation req

* fix

* slight change

* remove empty

* [Feature] rms-norm transfer into inference llama.py  (#4563)

* add installation req

* fix

* slight change

* remove empty

* add rmsnorm polciy

* add

* clean codes

* [infer] Fix tp inference engine (#4564)

* fix engine prepare data

* add engine test

* use bloom for testing

* revise on test

* revise on test

* reset shardformer llama (#4569)

* [infer] Fix engine - tensors on different devices (#4570)


* fix diff device in engine

* [codefactor] Feature/colossal inference (#4579)

* code factors

* remove

* change coding (#4581)

* [doc] complete README of colossal inference (#4585)

* complete fig

* Update README.md

* [doc]update readme (#4586)

* update readme

* Update README.md

* bug fix: fix bus in llama and bloom (#4588)

* [BUG FIX]Fix test engine in CI and non-vllm kernels llama forward  (#4592)

* fix tests

* clean

* clean

* fix bugs

* add

* fix llama non-vllm kernels bug

* modify

* clean codes

* [Kernel]Rmsnorm fix (#4598)

* fix tests

* clean

* clean

* fix bugs

* add

* fix llama non-vllm kernels bug

* modify

* clean codes

* add triton rmsnorm

* delete vllm kernel flag

* [Bug Fix]Fix bugs in llama (#4601)

* fix tests

* clean

* clean

* fix bugs

* add

* fix llama non-vllm kernels bug

* modify

* clean codes

* bug fix: remove rotary_positions_ids

---------

Co-authored-by: cuiqing.li <lixx3527@gmail.com>

* [kernel] Add triton layer norm & replace norm for bloom (#4609)

* add layernorm for inference

* add test for layernorm kernel

* add bloom layernorm replacement policy

* trivial: path

* [Infer] Bug fix rotary embedding in llama (#4608)

* fix rotary embedding

* delete print

* fix init seq len bug

* rename pytest

* add benchmark for llama

* refactor codes

* delete useless code

* [bench] Add bloom inference benchmark (#4621)

* add bloom benchmark

* readme - update benchmark res

* trivial - uncomment for testing (#4622)

* [Infer] add check triton and cuda version for tests (#4627)

* fix rotary embedding

* delete print

* fix init seq len bug

* rename pytest

* add benchmark for llama

* refactor codes

* delete useless code

* add check triton and cuda

* Update sharder.py (#4629)

* [Inference] Hot fix some bugs and typos (#4632)

* fix

* fix test

* fix conflicts

* [typo]Comments fix (#4633)

* fallback

* fix commnets

* bug fix: fix some bugs in test_llama and test_bloom (#4635)

* [Infer] delete benchmark in tests and fix bug for llama and bloom (#4636)

* fix rotary embedding

* delete print

* fix init seq len bug

* rename pytest

* add benchmark for llama

* refactor codes

* delete useless code

* add check triton and cuda

* delete benchmark and fix infer bugs

* delete benchmark for tests

* delete useless code

* delete bechmark function in utils

* [Fix] Revise TPInferEngine, inference tests and benchmarks (#4642)

* [Fix] revise TPInferEngine methods and inference tests

* fix llama/bloom infer benchmarks

* fix infer tests

* trivial fix: benchmakrs

* trivial

* trivial: rm print

* modify utils filename for infer ops test (#4657)

* [Infer] Fix TPInferEngine init & inference tests, benchmarks (#4670)

* fix engine funcs

* TPInferEngine: receive shard config in init

* benchmarks: revise TPInferEngine init

* benchmarks: remove pytest decorator

* trivial fix

* use small model for tests

* [NFC] use args for infer benchmarks (#4674)

* revise infer default (#4683)

* [Fix] optimize/shard model in TPInferEngine init (#4684)

* remove using orig model in engine

* revise inference tests

* trivial: rename

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: yuanheng-zhao <jonathan.zhaoyh@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
2023-09-12 01:22:56 +08:00