Camille Zhong
|
a5756a8720
|
[eval] update llama npu eval (#5366)
|
2024-02-06 10:53:03 +08:00 |
Camille Zhong
|
44ca61a22b
|
[llama] fix neftune & pbar with start_step (#5364)
|
2024-02-05 18:04:23 +08:00 |
Hongxin Liu
|
a4cec1715b
|
[llama] add flash attn patch for npu (#5362)
|
2024-02-05 16:48:34 +08:00 |
Hongxin Liu
|
73f9f23fc6
|
[llama] update training script (#5360)
* [llama] update training script
* [doc] polish docstr
|
2024-02-05 16:33:18 +08:00 |
Hongxin Liu
|
6c0fa7b9a8
|
[llama] fix dataloader for hybrid parallel (#5358)
* [plugin] refactor prepare dataloader
* [plugin] update train script
|
2024-02-05 15:14:56 +08:00 |
Hongxin Liu
|
2dd01e3a14
|
[gemini] fix param op hook when output is tuple (#5355)
* [gemini] fix param op hook when output is tuple
* [gemini] fix param op hook
|
2024-02-04 11:58:26 +08:00 |
yuehuayingxueluo
|
631862f339
|
[Inference]Optimize generation process of inference engine (#5356)
* opt inference engine
* fix run_benchmark.sh
* fix generate in engine.py
* rollback tesh_inference_engine.py
|
2024-02-02 15:38:21 +08:00 |
yuehuayingxueluo
|
21ad4a27f9
|
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350)
* opt rms_norm
* fix bugs in rms_layernorm
|
2024-02-02 15:06:01 +08:00 |
Wenhao Chen
|
1c790c0877
|
[fix] remove unnecessary dp_size assert (#5351)
* fix: remove unnecessary assert
* test: add more 3d plugin tests
* fix: add warning
|
2024-02-02 14:40:20 +08:00 |
Frank Lee
|
027aa1043f
|
[doc] updated inference readme (#5343)
|
2024-02-02 14:31:10 +08:00 |
Frank Lee
|
e76acbb076
|
[inference] moved ops tests to test_infer (#5354)
|
2024-02-02 13:51:22 +08:00 |
Frank Lee
|
db1a763307
|
[inference] removed redundancy init_batch (#5353)
|
2024-02-02 11:44:15 +08:00 |
Hongxin Liu
|
ffffc32dc7
|
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
* [checkpointio] fix hybrid parallel optim checkpoint
* [extension] fix cuda extension
* [checkpointio] fix gemini optimizer checkpoint
* polish code
|
2024-02-01 16:13:06 +08:00 |
yuehuayingxueluo
|
249644c23b
|
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
|
2024-02-01 15:49:39 +08:00 |
Frank Lee
|
f8e456d202
|
[inference] simplified config verification (#5346)
* [inference] simplified config verification
* polish
* polish
|
2024-02-01 15:31:01 +08:00 |
YeAnbang
|
c5239840e6
|
[Chat] fix sft loss nan (#5345)
* fix script
* fix script
* fix chat nan
* fix chat nan
|
2024-02-01 14:25:16 +08:00 |
Frank Lee
|
abd8e77ad8
|
[extension] fixed exception catch (#5342)
|
2024-01-31 18:09:49 +08:00 |
Jianghai
|
df0aa49585
|
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)
* revise rotary embedding
* remove useless print
* adapt
|
2024-01-31 16:31:29 +08:00 |
Frank Lee
|
1336838a91
|
Merge pull request #5339 from FrankLeeeee/sync/merge-main
Sync/merge main
|
2024-01-31 16:29:26 +08:00 |
FrankLeeeee
|
c565519913
|
merge commit
|
2024-01-31 10:41:47 +08:00 |
Yuanheng Zhao
|
5f98a9d68a
|
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
|
2024-01-30 16:06:09 +08:00 |
yuehuayingxueluo
|
e8f0642f28
|
[Inference]Add Nopadding Llama Modeling (#5327)
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
|
2024-01-30 10:31:46 +08:00 |
digger yu
|
71321a07cf
|
fix typo change dosen't to doesn't (#5308)
|
2024-01-30 09:57:38 +08:00 |
digger yu
|
6a3086a505
|
fix typo under extensions/ (#5330)
|
2024-01-30 09:55:16 +08:00 |
Frank Lee
|
febed23288
|
[doc] added docs for extensions (#5324)
* [doc] added docs for extensions
* polish
* polish
|
2024-01-29 17:39:23 +08:00 |
flybird11111
|
388179f966
|
[tests] fix t5 test. (#5322)
* [ci] fix shardformer tests. (#5255)
* fix ci
fix
* revert: revert p2p
* feat: add enable_metadata_cache option
* revert: enable t5 tests
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* fix t5 test
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
|
2024-01-29 17:38:46 +08:00 |
Jianghai
|
c7c104cb7c
|
[DOC] Update inference readme (#5280)
* add readme
* add readme
* 1
* update engine
* finish readme
* add readme
|
2024-01-29 16:21:06 +08:00 |
Frank Lee
|
a6709afe66
|
Merge pull request #5321 from FrankLeeeee/hotfix/accelerator-api
[accelerator] fixed npu api
|
2024-01-29 14:29:58 +08:00 |
FrankLeeeee
|
087d0cb1fc
|
[accelerator] fixed npu api
|
2024-01-29 14:27:52 +08:00 |
Frank Lee
|
8823cc4831
|
Merge pull request #5310 from hpcaitech/feature/npu
Feature/npu
|
2024-01-29 13:49:39 +08:00 |
Frank Lee
|
73f4dc578e
|
[workflow] updated CI image (#5318)
|
2024-01-29 11:53:07 +08:00 |
Jianghai
|
1f8a75d470
|
[Inference] Update rms norm kernel, benchmark with vLLM (#5315)
* add
* xi
* del
* del
* fix
|
2024-01-29 10:22:33 +08:00 |
Jianghai
|
7ddd8b37f0
|
fix (#5311)
|
2024-01-26 15:02:12 +08:00 |
yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |
Frank Lee
|
7cfed5f076
|
[feat] refactored extension module (#5298)
* [feat] refactored extension module
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
|
2024-01-25 17:01:48 +08:00 |
digger yu
|
bce9499ed3
|
fix some typo (#5307)
|
2024-01-25 13:56:27 +08:00 |
李文军
|
ec912b1ba9
|
[NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/init_tokenizer.py code style (#5228)
|
2024-01-25 13:14:48 +08:00 |
Yuanheng Zhao
|
af8359c430
|
[hotfix] fix boundary check in batch (#5306)
|
2024-01-25 10:23:12 +08:00 |
Jianghai
|
c647e00e3c
|
[Inference]Add fused rotary kernel and get cos cache kernel (#5302)
* add fused rotary and get cos cache func
* staged
* fix bugs
* fix bugs
|
2024-01-24 16:20:42 +08:00 |
Yuanheng Zhao
|
3da9993b0d
|
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
* fix decoding kernel pytest
* revise and add triton context attn benchmark
|
2024-01-23 17:16:02 +08:00 |
Jianghai
|
8e606ecc7e
|
[Inference] Benchmarking rotary embedding and add a fetch function (#5277)
* fix bugs and add a cos/sin cache fetch func
* add docstring
* fix bug
* fix
|
2024-01-23 12:11:53 +08:00 |
Desperado-Jia
|
ddf879e2db
|
fix bug for mefture (#5299)
|
2024-01-22 22:17:54 +08:00 |
yuehuayingxueluo
|
b7853196a0
|
Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding
[Inference/fix]Add utils.py for Rotary Embedding
|
2024-01-22 17:07:14 +08:00 |
yuehuayingxueluo
|
cea9c86e45
|
add utils.py
|
2024-01-22 16:06:27 +08:00 |
Hongxin Liu
|
d7f8db8e21
|
[hotfix] fix 3d plugin test (#5292)
|
2024-01-22 15:19:04 +08:00 |
yuehuayingxueluo
|
bfff9254ac
|
[inference] Adapted to Rotary Embedding and RMS Norm (#5283)
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
|
2024-01-22 10:55:34 +08:00 |
flybird11111
|
f7e3f82a7e
|
fix llama pretrain (#5287)
|
2024-01-19 17:49:02 +08:00 |
Desperado-Jia
|
6a56967855
|
[doc] add llama2-13B disyplay (#5285)
* Update README.md
* fix 13b typo
---------
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
|
2024-01-19 16:04:08 +08:00 |
Yuanheng Zhao
|
6e487e7d3c
|
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
|
2024-01-19 15:47:16 +08:00 |
Jianghai
|
9e2342bde2
|
[Hotfix] Fix bugs in testing continuous batching (#5270)
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
|
2024-01-18 16:31:14 +08:00 |