diff --git a/inference/lmdeploy.md b/inference/lmdeploy.md new file mode 100644 index 0000000..d376bd4 --- /dev/null +++ b/inference/lmdeploy.md @@ -0,0 +1,56 @@ +# Inference by LMDeploy + +English | [简体中文](lmdeploy_zh_cn.md) + +[LMDeploy](https://github.com/InternLM/lmdeploy) is an efficient, user-friendly toolkit designed for compressing, deploying, and serving LLM models. + +This article primarily highlights the basic usage of LMDeploy. For a comprehensive understanding of the toolkit, we invite you to refer to [the tutorials](https://lmdeploy.readthedocs.io/en/latest/). + + +## Installation + +Install lmdeploy with pip (python 3.8+) + +```shell +pip install lmdeploy +``` + +## Offline batch inference + +With just 4 lines of codes, you can execute batch inference using a list of prompts: + +```python +from lmdeploy import pipeline +pipe = pipeline("internlm/internlm2-chat-7b") +response = pipe(["Hi, pls intro yourself", "Shanghai is"]) +print(response) +``` + +With dynamic ntk, LMDeploy can handle a context length of 200K for `InternLM2`: + +```python +from lmdeploy import pipeline, TurbomindEngineConfig +engine_config = TurbomindEngineConfig(session_len=200000, + rope_scaling_factor=2.0), +pipe = pipeline("internlm/internlm2-chat-7b", engine_config) +prompt = 'Please offer a long prompt here' +print(response) +``` + +For more information about LMDeploy pipeline usage, please refer to [here](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html). + +## Serving + +LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: + +```shell +lmdeploy serve api_server internlm/internlm2-chat-7b +``` + +The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`: + +```shell +lmdeploy serve api_client http://0.0.0.0:23333 +``` + +Alternatively, you can test the server's APIs oneline through the Swagger UI at `http://0.0.0.0:23333`. A detailed overview of the API specification is available [here](https://lmdeploy.readthedocs.io/en/latest/serving/restful_api.html). diff --git a/inference/lmdeploy_zh_cn.md b/inference/lmdeploy_zh_cn.md new file mode 100644 index 0000000..835aedd --- /dev/null +++ b/inference/lmdeploy_zh_cn.md @@ -0,0 +1,55 @@ +# LMDeploy 推理 + +[English](lmdeploy.md) | 简体中文 + +[LMDeploy](https://github.com/InternLM/lmdeploy) 是一个高效且友好的 LLM 模型部署工具箱,功能涵盖了量化、推理和服务。 + +本文主要介绍 LMDeploy 的基本用法,包括[安装](#安装)、[离线批处理](#离线批处理)和[推理服务](#推理服务)。更全面的介绍请参考 [LMDeploy 用户指南](https://lmdeploy.readthedocs.io/zh-cn/latest/)。 + + +## 安装 + +使用 pip(python 3.8+)安装 LMDeploy + +```shell +pip install lmdeploy +``` + +## 离线批处理 + +只用以下 4 行代码,就可以完成 prompts 的批处理: + +```python +from lmdeploy import pipeline +pipe = pipeline("internlm/internlm2-chat-7b") +response = pipe(["Hi, pls intro yourself", "Shanghai is"]) +print(response) +``` + +LMDeploy 实现了 dynamic ntk,支持长文本外推。使用如下代码,可以把 InternLM2 的文本外推到 200K: +```python +from lmdeploy import pipeline, TurbomindEngineConfig +engine_config = TurbomindEngineConfig(session_len=200000, + rope_scaling_factor=2.0), +pipe = pipeline("internlm/internlm2-chat-7b", engine_config) +prompt = 'Please offer a long prompt here' +print(response) +``` + +更多关于 pipeline 的使用方式,请参考[这里](https://lmdeploy.readthedocs.io/zh-cn/latest/inference/pipeline.html) + +## 推理服务 + +LMDeploy `api_server` 支持把模型一键封装为服务,对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例: + +```shell +lmdeploy serve api_server internlm/internlm2-chat-7b +``` + +服务默认端口是23333。在 server 启动后,你可以在终端通过`api_client`与server进行对话,体验对话效果: + +```shell +lmdeploy serve api_client http://0.0.0.0:23333 +``` + +此外,你还可以通过 Swagger UI `http://0.0.0.0:23333` 在线阅读和试用 `api_server` 的各接口,也可直接查阅[文档](https://lmdeploy.readthedocs.io/zh-cn/latest/serving/restful_api.html),了解各接口的定义和使用方法。