yuehuayingxueluo
|
87079cffe8
|
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)
* Support FP16/BF16 Flash Attention 2
* fix bugs in test_kv_cache_memcpy.py
* add context_kv_cache_memcpy_kernel.cu
* rm typename MT
* add tail process
* add high_precision
* add high_precision to config.py
* rm unused code
* change the comment for the high_precision parameter
* update test_rotary_embdding_unpad.py
* fix vector_copy_utils.h
* add comment for self.high_precision when using float32
|
2024-03-25 13:40:34 +08:00 |
yuehuayingxueluo
|
2a718c8be8
|
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
|
2024-02-21 13:23:57 +08:00 |
Jianghai
|
730103819d
|
[Inference]Fused kv copy into rotary calculation (#5383)
* revise rotary embedding
* remove useless print
* adapt
* fix
* add
* fix
* modeling
* fix
* fix
* fix
* fused kv copy
* fused copy
* colossalai/kernel/triton/no_pad_rotary_embedding.py
* del padding llama
* del
|
2024-02-21 11:31:48 +08:00 |
yuehuayingxueluo
|
8c69debdc7
|
[Inference]Support vllm testing in benchmark scripts (#5379)
* add vllm benchmark scripts
* fix code style
* update run_benchmark.sh
* fix code style
|
2024-02-08 15:27:26 +08:00 |
yuehuayingxueluo
|
631862f339
|
[Inference]Optimize generation process of inference engine (#5356)
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
|
2024-02-02 15:38:21 +08:00 |
yuehuayingxueluo
|
21ad4a27f9
|
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350)
* opt rms_norm
* fix bugs in rms_layernorm
|
2024-02-02 15:06:01 +08:00 |
FrankLeeeee
|
c565519913
|
merge commit
|
2024-01-31 10:41:47 +08:00 |
yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |
yuehuayingxueluo
|
bfff9254ac
|
[inference] Adapted to Rotary Embedding and RMS Norm (#5283)
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
|
2024-01-22 10:55:34 +08:00 |
Jianghai
|
9e2342bde2
|
[Hotfix] Fix bugs in testing continuous batching (#5270)
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
|
2024-01-18 16:31:14 +08:00 |
yuehuayingxueluo
|
86b63f720c
|
[Inference]Adapted to the triton attn kernels (#5264)
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
|
2024-01-17 16:03:10 +08:00 |
Hongxin Liu
|
d202cc28c0
|
[npu] change device to accelerator api (#5239)
* update accelerator
* fix timer
* fix amp
* update
* fix
* update bug
* add error raise
* fix autocast
* fix set device
* remove doc accelerator
* update doc
* update doc
* update doc
* use nullcontext
* update cpu
* update null context
* change time limit for example
* udpate
* update
* update
* update
* [npu] polish accelerator code
---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>
|
2024-01-09 10:20:05 +08:00 |
Hongxin Liu
|
1cd7efc520
|
[inference] refactor examples and fix schedule (#5077)
* [setup] refactor infer setup
* [hotfix] fix infenrece behavior on 1 1 gpu
* [exmaple] refactor inference examples
|
2023-11-21 10:46:03 +08:00 |