ColossalAI/tests/test_infer/test_inference_engine.py

import random

import numpy as np
import pytest
import torch
import transformers
from transformers import AutoTokenizer, GenerationConfig

import colossalai
from colossalai.inference.config import InferenceConfig
from colossalai.inference.core.engine import InferenceEngine
from colossalai.testing import rerun_if_address_is_in_use, spawn


def setup_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)


def check_inference_engine(test_cai=False):
    setup_seed(20)
    tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
    model = transformers.LlamaForCausalLM(
        transformers.LlamaConfig(
            vocab_size=50000, hidden_size=512, intermediate_size=1536, num_attention_heads=4, num_hidden_layers=16
        )
    ).cuda()

    model = model.eval()

    inputs = [
        "介绍一下今天的北京,比如故宫，天安门，长城或者其他的一些景点,",
        "介绍一下武汉,",
    ]

    output_len = 128
    do_sample = True
    top_p = 0.5
    top_k = 50

    if test_cai:
        inference_config = InferenceConfig(max_output_len=output_len)
        inference_engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)
        inference_engine.add_request(prompts=inputs)
        assert inference_engine.request_handler._has_waiting()
        generation_config = GenerationConfig(do_sample=do_sample, top_p=top_p, top_k=top_k)
        outputs = inference_engine.generate(generation_config)
    else:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
        inputs = tokenizer.batch_encode_plus(inputs, padding=True, return_tensors="pt")["input_ids"]
        inputs = inputs.cuda()
        generation_config = GenerationConfig(
            do_sample=do_sample,
            top_p=top_p,
            top_k=top_k,
            pad_token_id=tokenizer.pad_token_id,
            max_new_tokens=output_len,
        )
        outputs = model.generate(inputs, generation_config=generation_config)
        outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return outputs


def run_dist(rank, world_size, port):
    colossalai.launch(config={}, rank=rank, world_size=world_size, port=port, host="localhost")
    cai_outputs = check_inference_engine(True)
    transformer_outputs = check_inference_engine(False)

    for s1, s2 in zip(cai_outputs, transformer_outputs):
        assert s1 == s2


@pytest.mark.dist
@rerun_if_address_is_in_use()
def test_inference_engine():
    spawn(run_dist, 1)


if __name__ == "__main__":
    test_inference_engine()
add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`import random`

			`import numpy as np`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00			`import pytest`
add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`import torch`
			`import transformers`
Fixed a bug in the inference frame 2023-12-26 13:34:27 +00:00			`from transformers import AutoTokenizer, GenerationConfig`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00
			`import colossalai`
			`from colossalai.inference.config import InferenceConfig`
			`from colossalai.inference.core.engine import InferenceEngine`
fix bugs in attention.py and request_handler.py 2024-01-08 04:35:06 +00:00			`from colossalai.testing import rerun_if_address_is_in_use, spawn`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00

add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`def setup_seed(seed):`
			`torch.manual_seed(seed)`
			`torch.cuda.manual_seed_all(seed)`
			`np.random.seed(seed)`
			`random.seed(seed)`


precision alignment 2024-01-02 10:30:11 +00:00			`def check_inference_engine(test_cai=False):`
add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`setup_seed(20)`
precision alignment 2024-01-02 10:30:11 +00:00			`tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00			`model = transformers.LlamaForCausalLM(`
			`transformers.LlamaConfig(`
fix bugs in attention.py and request_handler.py 2024-01-08 04:35:06 +00:00			`vocab_size=50000, hidden_size=512, intermediate_size=1536, num_attention_heads=4, num_hidden_layers=16`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00			`)`
fix bugs in sampler 2024-01-04 07:03:18 +00:00			`).cuda()`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00
adapted to pad_context_forward 2024-01-09 05:52:53 +00:00			`model = model.eval()`

[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00			`inputs = [`
adapted to pad_context_forward 2024-01-09 05:52:53 +00:00			`"介绍一下今天的北京,比如故宫，天安门，长城或者其他的一些景点,",`
add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`"介绍一下武汉,",`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00			`]`

adapted to pad_context_forward 2024-01-09 05:52:53 +00:00			`output_len = 128`
fix bugs in attention.py and request_handler.py 2024-01-08 04:35:06 +00:00			`do_sample = True`
adapted to pad_context_forward 2024-01-09 05:52:53 +00:00			`top_p = 0.5`
			`top_k = 50`
fix bugs in attention.py and request_handler.py 2024-01-08 04:35:06 +00:00
precision alignment 2024-01-02 10:30:11 +00:00			`if test_cai:`
fix bugs in attention.py and request_handler.py 2024-01-08 04:35:06 +00:00			`inference_config = InferenceConfig(max_output_len=output_len)`
precision alignment 2024-01-02 10:30:11 +00:00			`inference_engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)`
			`inference_engine.add_request(prompts=inputs)`
			`assert inference_engine.request_handler._has_waiting()`
adapted to pad_context_forward 2024-01-09 05:52:53 +00:00			`generation_config = GenerationConfig(do_sample=do_sample, top_p=top_p, top_k=top_k)`
precision alignment 2024-01-02 10:30:11 +00:00			`outputs = inference_engine.generate(generation_config)`
			`else:`
			`tokenizer.pad_token = tokenizer.eos_token`
			`tokenizer.pad_token_id = tokenizer.eos_token_id`
			`inputs = tokenizer.batch_encode_plus(inputs, padding=True, return_tensors="pt")["input_ids"]`
fix bugs in sampler 2024-01-04 07:03:18 +00:00			`inputs = inputs.cuda()`
			`generation_config = GenerationConfig(`
adapted to pad_context_forward 2024-01-09 05:52:53 +00:00			`do_sample=do_sample,`
			`top_p=top_p,`
			`top_k=top_k,`
			`pad_token_id=tokenizer.pad_token_id,`
			`max_new_tokens=output_len,`
fix bugs in sampler 2024-01-04 07:03:18 +00:00			`)`
precision alignment 2024-01-02 10:30:11 +00:00			`outputs = model.generate(inputs, generation_config=generation_config)`
			`outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)`
add context_attention_unpadded 2024-01-03 10:50:26 +00:00
precision alignment 2024-01-02 10:30:11 +00:00			`return outputs`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00

			`def run_dist(rank, world_size, port):`
			`colossalai.launch(config={}, rank=rank, world_size=world_size, port=port, host="localhost")`
add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`cai_outputs = check_inference_engine(True)`
			`transformer_outputs = check_inference_engine(False)`
precision alignment 2024-01-02 10:30:11 +00:00
add context_attention_unpadded 2024-01-03 10:50:26 +00:00			`for s1, s2 in zip(cai_outputs, transformer_outputs):`
			`assert s1 == s2`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00

			`@pytest.mark.dist`
fix bugs in attention.py and request_handler.py 2024-01-08 04:35:06 +00:00			`@rerun_if_address_is_in_use()`
[Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt 2023-12-18 02:40:47 +00:00			`def test_inference_engine():`
			`spawn(run_dist, 1)`


			`if __name__ == "__main__":`
			`test_inference_engine()`