yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |
yuehuayingxueluo
|
fab294c7f4
|
fix CI bugs
|
2024-01-11 13:46:14 +00:00 |
Jianghai
|
e545a871b8
|
[Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)
* fix accuracy
* alignment in attention
* fix attention
* fix
* fix bugs
* fix bugs
* fix bugs
|
2024-01-11 13:46:14 +00:00 |
Jianghai
|
0e616462a7
|
[Inference] add logit processor and request handler (#5166)
* add logit processor and request handler
* add
* add
* add
* fix
* add search tokens and update func
* finish request handler
* add running list test
* fix test
* fix some bug
* add
* add
* fix bugs
* fix some bugs
* fix bug
* fix
* fix
* add copy fun
* del useless attn
* fix request status
---------
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
|
2024-01-11 13:39:56 +00:00 |