增加api流式接口; 整理demo和api抽到一个新目录; 修改readme

2023-04-25 04:56:30 +00:00 · 2023-04-25 04:56:30 +00:00 · e2039a8b87
parent aeced3619b
commit e2039a8b87
13 changed files with 197 additions and 77 deletions
--- a/README.md
+++ b/README.md
@ -90,47 +90,56 @@ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm-6b
 ## Demo & API
-我们提供了一个基于 [Gradio](https://gradio.app) 的网页版 Demo 和一个命令行 Demo。使用时首先需要下载本仓库：
+### 网页版(基于gradio) Demo
 ![web-demo](resources/web-demo-gradio.gif)
 首先安装 Gradio：`pip install gradio mdtex2html`，然后运行仓库中的 [web_demo_gradio.py](demo_and_api/web_demo_gradio.py)： 
 ```shell
-git clone https://github.com/THUDM/ChatGLM-6B
+cd demo_and_api
-cd ChatGLM-6B
+python web_demo_gradio.py
 ```
 #### 网页版 Demo
 ![web-demo](resources/web-demo.gif)
 首先安装 Gradio：`pip install gradio`，然后运行仓库中的 [web_demo.py](web_demo.py)： 
 ```shell
 python web_demo.py
 ```
 程序会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。最新版 Demo 实现了打字机效果，速度体验大大提升。注意，由于国内 Gradio 的网络访问较为缓慢，启用 `demo.queue().launch(share=True, inbrowser=True)` 时所有网络会经过 Gradio 服务器转发，导致打字机体验大幅下降，现在默认启动方式已经改为 `share=False`，如有需要公网访问的需求，可以重新修改为 `share=True` 启动。
-感谢 [@AdamBear](https://github.com/AdamBear) 实现了基于 Streamlit 的网页版 Demo，运行方式见[#117](https://github.com/THUDM/ChatGLM-6B/pull/117).
+### 网页版(基于streamlit) Demo
-#### 命令行 Demo
+首先安装 Streamlit: `pip install streamlit streamlit-chat`，然后运行仓库中的 [web_demo_streamlit.py](demo_and_api/web_demo_streamlit.py)： 
 ```shell
 cd demo_and_api
 streamlit run web_demo_streamlit.py --server.port 6006
 ```
 *感谢 [@AdamBear](https://github.com/AdamBear) 贡献的此实现，详见[#117](https://github.com/THUDM/ChatGLM-6B/pull/117).*
 ### 命令行 Demo
 ![cli-demo](resources/cli-demo.png)
 运行仓库中 [cli_demo.py](cli_demo.py)：
 ```shell
 cd demo_and_api
 python cli_demo.py
 ```
 程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 `clear` 可以清空对话历史，输入 `stop` 终止程序。
 ### API部署
-首先需要安装额外的依赖 `pip install fastapi uvicorn`，然后运行仓库中的 [api.py](api.py)：
+首先需要安装额外的依赖 `pip install fastapi uvicorn pydantic`，然后运行仓库中的 [api.py](demo_and_api/api.py)：
 ```shell
 cd demo_and_api
 python api.py
 ```
 API支持普通接口(/chat)和流式接口(/stream_chat); 
 流式接口可实现打字机效果，调用方式可参考 [web_demo_streamlit_with_api.py](demo_and_api/web_demo_streamlit_with_api.py)：
 默认部署在本地的 8000 端口，通过 POST 方法进行调用
 ```shell
-curl -X POST "http://127.0.0.1:8000" \
+curl -X POST "http://127.0.0.1:8000/chat" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你好", "history": []}'
 ```
@ -144,6 +153,23 @@ curl -X POST "http://127.0.0.1:8000" \
 }
 ```
 ### 网页版(基于streamlit和API) Demo
 streamlit作为前端，api作为后端，使用了api的流式接口
 启动后端
 api依赖安装参照前文
 ```shell
 cd demo_and_api
 python api.py
 ```
 启动前端
 streamlit依赖安装参照前文
 ```shell
 cd demo_and_api
 streamlit run web_demo_streamlit_with_api.py --server.port 6006
 ```
 ## 低成本部署
 ### 模型量化
 默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：
@ -211,7 +237,7 @@ model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=2)
 ## ChatGLM-6B 示例
-以下是一些使用 `web_demo.py` 得到的示例截图。更多 ChatGLM-6B 的可能，等待你来探索发现！
+以下是一些使用 `web_demo_gradio.py` 得到的示例截图。更多 ChatGLM-6B 的可能，等待你来探索发现！
 <details><summary><b>自我认知</b></summary>
--- a/README_en.md
+++ b/README_en.md
@ -85,7 +85,7 @@ cd ChatGLM-6B
 #### Web Demo
-![web-demo](resources/web-demo.png)
+![web-demo](resources/web-demo-gradio.png)
 Install Gradio `pip install gradio`，and run [web_demo.py](web_demo.py):
--- a/api.py
+++ b/api.py
@ -1,56 +0,0 @@
 from fastapi import FastAPI, Request
 from transformers import AutoTokenizer, AutoModel
 import uvicorn, json, datetime
 import torch
 DEVICE = "cuda"
 DEVICE_ID = "0"
 CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
 def torch_gc():
    if torch.cuda.is_available():
        with torch.cuda.device(CUDA_DEVICE):
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
 app = FastAPI()
@app.post("/")
 async def create_item(request: Request):
    global model, tokenizer
    json_post_raw = await request.json()
    json_post = json.dumps(json_post_raw)
    json_post_list = json.loads(json_post)
    prompt = json_post_list.get('prompt')
    history = json_post_list.get('history')
    max_length = json_post_list.get('max_length')
    top_p = json_post_list.get('top_p')
    temperature = json_post_list.get('temperature')
    response, history = model.chat(tokenizer,
                                   prompt,
                                   history=history,
                                   max_length=max_length if max_length else 2048,
                                   top_p=top_p if top_p else 0.7,
                                   temperature=temperature if temperature else 0.95)
    now = datetime.datetime.now()
    time = now.strftime("%Y-%m-%d %H:%M:%S")
    answer = {
        "response": response,
        "history": history,
        "status": 200,
        "time": time
    }
    log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
    print(log)
    torch_gc()
    return answer
 if __name__ == '__main__':
    tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
    model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
    model.eval()
    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/demo_and_api/api.py
+++ b/demo_and_api/api.py
@ -0,0 +1,73 @@
 from fastapi import FastAPI, Request
 from fastapi.responses import StreamingResponse
 from transformers import AutoTokenizer, AutoModel
 from pydantic import BaseModel
 import uvicorn, json, datetime
 import torch
 DEVICE = "cuda"
 DEVICE_ID = "0"
 CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
 app = FastAPI()
 class Params(BaseModel):
    prompt: str = 'hello'
    history: list[list[str]] = []
    max_length: int = 2048
    top_p: float = 0.7
    temperature: float = 0.95
 class Answer(BaseModel):
    status: int = 200
    time: str = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    response: str
    history: list[list[str]] = []
 def torch_gc():
    if torch.cuda.is_available():
        with torch.cuda.device(CUDA_DEVICE):
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
 async def create_chat(params: Params):
    global model, tokenizer
    response, history = model.chat(tokenizer,
                                   params.prompt,
                                   history=params.history,
                                   max_length=params.max_length,
                                   top_p=params.top_p,
                                   temperature=params.temperature)
    answer_ok = Answer(response=response, history=history)
    print(answer_ok.json())
    torch_gc()
    return answer_ok
 async def create_stream_chat(params: Params):
    global model, tokenizer
    for response, history in model.stream_chat(tokenizer,
                                   params.prompt,
                                   history=params.history,
                                   max_length=params.max_length,
                                   top_p=params.top_p,
                                   temperature=params.temperature):  
        answer_ok = Answer(response=response, history=history)
        # print(answer_ok.json())
        yield "\ndata: " + json.dumps(answer_ok.json())
    torch_gc()
@app.post("/chat")
 async def post_chat(params: Params):
    answer = await create_chat(params)
    return answer
@app.post("/stream_chat")
 async def post_stream_chat(params: Params):
    return StreamingResponse(create_stream_chat(params))
 if __name__ == '__main__':
    tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
    model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
    model.eval()
    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/demo_and_api/cli_demo.py
+++ b/demo_and_api/cli_demo.py
--- a/demo_and_api/requirements.txt
+++ b/demo_and_api/requirements.txt
@ -0,0 +1,7 @@
 pydantic
 fastapi
 uvicorn
 gradio
 mdtex2html
 streamlit
 streamlit-chat
--- a/demo_and_api/web_demo_gradio.py
+++ b/demo_and_api/web_demo_gradio.py
--- a/demo_and_api/web_demo_old.py
+++ b/demo_and_api/web_demo_old.py
--- a/demo_and_api/web_demo_streamlit.py
+++ b/demo_and_api/web_demo_streamlit.py
--- a/demo_and_api/web_demo_streamlit_with_api.py
+++ b/demo_and_api/web_demo_streamlit_with_api.py
@ -0,0 +1,72 @@
 import streamlit as st
 from streamlit_chat import message
 import requests
 import json
 st.set_page_config(
    page_title="ChatGLM-6b 演示",
    page_icon=":robot:"
 )
 MAX_TURNS = 20
 MAX_BOXES = MAX_TURNS * 2
 url = "http://localhost:8000/stream_chat"
 def predict(input, max_length, top_p, temperature, history=None):
    if history is None:
        history = []
    with container:
        if len(history) > 0:
            for i, (query, response) in enumerate(history):
                message(query, avatar_style="big-smile", key=str(i) + "_user")
                message(response, avatar_style="bottts", key=str(i))
        message(input, avatar_style="big-smile", key=str(len(history)) + "_user")
        st.write("AI正在回复:")
        with st.empty():
            req = {
                "prompt": input,
                "history": history,
                "max_length": max_length,
                "top_p": top_p,
                "temperature": temperature
            }
            res = requests.post(url=url,json=req,stream=True)
            for line in res.iter_lines(delimiter=b'\ndata: '):
                line = line.decode(encoding='utf-8')  
                if line.strip() == '':
                    continue;
                response_json = json.loads(json.loads(line))
                response = response_json['response']
                history = response_json['history']
                st.write(response)
    return history
 container = st.container()
 # create a prompt text for the text generation
 prompt_text = st.text_area(label="用户命令输入",
            height = 100,
            placeholder="请在这儿输入您的命令")
 max_length = st.sidebar.slider(
    'max_length', 0, 4096, 2048, step=1
 )
 top_p = st.sidebar.slider(
    'top_p', 0.0, 1.0, 0.6, step=0.01
 )
 temperature = st.sidebar.slider(
    'temperature', 0.0, 1.0, 0.95, step=0.01
 )
 if 'state' not in st.session_state:
    st.session_state['state'] = []
 if st.button("发送", key="predict"):
    with st.spinner("AI正在思考，请稍等........"):
        # text generation
        st.session_state["state"] = predict(prompt_text, max_length, top_p, temperature, st.session_state["state"])
--- a/requirements.txt
+++ b/requirements.txt
@ -2,7 +2,5 @@ protobuf
 transformers==4.27.1
 cpm_kernels
 torch>=1.10
 gradio
 mdtex2html
 sentencepiece
 accelerate
--- a/resources/web-demo-gradio.gif
+++ b/resources/web-demo-gradio.gif
--- a/resources/web-demo-gradio.png
+++ b/resources/web-demo-gradio.png