Jianghai
|
730103819d
|
[Inference]Fused kv copy into rotary calculation (#5383)
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
|
2024-02-21 11:31:48 +08:00 |
yuehuayingxueluo
|
8c69debdc7
|
[Inference]Support vllm testing in benchmark scripts (#5379)
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
|
2024-02-08 15:27:26 +08:00 |
Frank Lee
|
8106ede07f
|
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)
This reverts commit 9f4ab2eb92 .
|
2024-02-07 14:27:04 +08:00 |
Jianghai
|
9f4ab2eb92
|
[Inference] Adapt to Fused rotary (#5348)
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
|
2024-02-07 11:36:04 +08:00 |
yuehuayingxueluo
|
21ad4a27f9
|
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350)
* opt rms_norm
* fix bugs in rms_layernorm
|
2024-02-02 15:06:01 +08:00 |
yuehuayingxueluo
|
249644c23b
|
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
|
2024-02-01 15:49:39 +08:00 |
yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |
yuehuayingxueluo
|
86b63f720c
|
[Inference]Adapted to the triton attn kernels (#5264)
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
|
2024-01-17 16:03:10 +08:00 |
Hongxin Liu
|
1cd7efc520
|
[inference] refactor examples and fix schedule (#5077)
* [setup] refactor infer setup
* [hotfix] fix infenrece behavior on 1 1 gpu
* [exmaple] refactor inference examples
|
2023-11-21 10:46:03 +08:00 |
Bin Jia
|
4e3959d316
|
[hotfix/hybridengine] Fix init model with random parameters in benchmark (#5074)
* fix init model with random parameters
* fix example
|
2023-11-20 20:15:25 +08:00 |
Xu Kai
|
fd6482ad8c
|
[inference] Refactor inference architecture (#5057)
* [inference] support only TP (#4998)
* support only tp
* enable tp
* add support for bloom (#5008)
* [refactor] refactor gptq and smoothquant llama (#5012)
* refactor gptq and smoothquant llama
* fix import error
* fix linear import torch-int
* fix smoothquant llama import error
* fix import accelerate error
* fix bug
* fix import smooth cuda
* fix smoothcuda
* [Inference Refactor] Merge chatglm2 with pp and tp (#5023)
merge chatglm with pp and tp
* [Refactor] remove useless inference code (#5022)
* remove useless code
* fix quant model
* fix test import bug
* mv original inference legacy
* fix chatglm2
* [Refactor] refactor policy search and quant type controlling in inference (#5035)
* [Refactor] refactor policy search and quant type controling in inference
* [inference] update readme (#5051)
* update readme
* update readme
* fix architecture
* fix table
* fix table
* [inference] udpate example (#5053)
* udpate example
* fix run.sh
* fix rebase bug
* fix some errors
* update readme
* add some features
* update interface
* update readme
* update benchmark
* add requirements-infer
---------
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
|
2023-11-19 21:05:05 +08:00 |