char-1ee
|
5f398fc000
|
Pass inference model shard configs for module init
Signed-off-by: char-1ee <xingjianli59@gmail.com>
|
2024-06-07 08:33:52 +00:00 |
Runyu Lu
|
18d67d0e8e
|
[Feat]Inference RPC Server Support (#5705)
* rpc support source
* kv cache logical/physical disaggregation
* sampler refactor
* colossalai launch built in
* Unitest
* Rpyc support
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
|
2024-05-14 10:00:55 +08:00 |
Yuanheng Zhao
|
04863a9b14
|
[example] Update Llama Inference example (#5629)
* [example] add infernece benchmark llama3
* revise inference config - arg
* remove unused args
* add llama generation demo script
* fix init rope in llama policy
* add benchmark-llama3 - cleanup
|
2024-04-23 22:23:07 +08:00 |
Runyu Lu
|
e37ee2fb65
|
[Feat]Tensor Model Parallel Support For Inference (#5563)
* tensor parallel support naive source
* [fix]precision, model load and refactor the framework
* add tp unit test
* docstring
* fix do_sample
|
2024-04-18 16:56:46 +08:00 |
Yuanheng Zhao
|
4bb5d8923a
|
[Fix/Inference] Remove unused and non-functional functions (#5543)
* [fix] remove unused func
* rm non-functional partial
|
2024-04-02 14:16:59 +08:00 |
Steve Luo
|
f7aecc0c6b
|
feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)
|
2024-03-08 16:21:12 +08:00 |
yuehuayingxueluo
|
2a718c8be8
|
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
|
2024-02-21 13:23:57 +08:00 |
yuehuayingxueluo
|
21ad4a27f9
|
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350)
* opt rms_norm
* fix bugs in rms_layernorm
|
2024-02-02 15:06:01 +08:00 |
yuehuayingxueluo
|
249644c23b
|
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
|
2024-02-01 15:49:39 +08:00 |
yuehuayingxueluo
|
e8f0642f28
|
[Inference]Add Nopadding Llama Modeling (#5327)
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
|
2024-01-30 10:31:46 +08:00 |