Steve Luo
|
7806842f2d
|
add paged-attetionv2: support seq length split across thread block (#5707)
|
2024-05-14 12:46:54 +08:00 |
yuehuayingxueluo
|
3c91e3f176
|
[Inference]Adapt to baichuan2 13B (#5614)
* adapt to baichuan2 13B
* adapt to baichuan2 13B
* change BAICHUAN_MODEL_NAME_OR_PATH
* fix test_decoding_attn.py
* Modifications based on review comments.
* change BAICHUAN_MODEL_NAME_OR_PATH
* mv attn mask processes to test flash decoding
* mv get_alibi_slopes baichuan modeling
* fix bugs in test_baichuan.py
|
2024-04-25 23:11:30 +08:00 |
Jianghai
|
1f8c7e7046
|
[Inference] User Experience: update the logic of default tokenizer and generation config. (#5337)
* add
* fix
* fix
* pause
* fix
* fix pytest
* align
* fix
* license
* fix
* fix
* fix readme
* fix some bugs
* remove tokenizer config
|
2024-02-07 17:55:48 +08:00 |
yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |