Xu Kai
|
fd6482ad8c
|
[inference] Refactor inference architecture (#5057)
* [inference] support only TP (#4998)
* support only tp
* enable tp
* add support for bloom (#5008)
* [refactor] refactor gptq and smoothquant llama (#5012)
* refactor gptq and smoothquant llama
* fix import error
* fix linear import torch-int
* fix smoothquant llama import error
* fix import accelerate error
* fix bug
* fix import smooth cuda
* fix smoothcuda
* [Inference Refactor] Merge chatglm2 with pp and tp (#5023)
merge chatglm with pp and tp
* [Refactor] remove useless inference code (#5022)
* remove useless code
* fix quant model
* fix test import bug
* mv original inference legacy
* fix chatglm2
* [Refactor] refactor policy search and quant type controlling in inference (#5035)
* [Refactor] refactor policy search and quant type controling in inference
* [inference] update readme (#5051)
* update readme
* update readme
* fix architecture
* fix table
* fix table
* [inference] udpate example (#5053)
* udpate example
* fix run.sh
* fix rebase bug
* fix some errors
* update readme
* add some features
* update interface
* update readme
* update benchmark
* add requirements-infer
---------
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
|
2023-11-19 21:05:05 +08:00 |
Xu Kai
|
785802e809
|
[inference] add reference and fix some bugs (#4937)
* add reference and fix some bugs
* update gptq init
---------
Co-authored-by: Xu Kai <xukai16@foxamil.com>
|
2023-10-20 13:39:34 +08:00 |
Xu Kai
|
611a5a80ca
|
[inference] Add smmoothquant for llama (#4904)
* [inference] add int8 rotary embedding kernel for smoothquant (#4843)
* [inference] add smoothquant llama attention (#4850)
* add smoothquant llama attention
* remove uselss code
* remove useless code
* fix import error
* rename file name
* [inference] add silu linear fusion for smoothquant llama mlp (#4853)
* add silu linear
* update skip condition
* catch smoothquant cuda lib exception
* prcocess exception for tests
* [inference] add llama mlp for smoothquant (#4854)
* add llama mlp for smoothquant
* fix down out scale
* remove duplicate lines
* add llama mlp check
* delete useless code
* [inference] add smoothquant llama (#4861)
* add smoothquant llama
* fix attention accuracy
* fix accuracy
* add kv cache and save pretrained
* refactor example
* delete smooth
* refactor code
* [inference] add smooth function and delete useless code for smoothquant (#4895)
* add smooth function and delete useless code
* update datasets
* remove duplicate import
* delete useless file
* refactor codes (#4902)
* rafactor code
* add license
* add torch-int and smoothquant license
|
2023-10-16 11:28:44 +08:00 |