mirror of https://github.com/InternLM/InternLM
check in initial version of internlm ecosystem
parent
3be5894976
commit
2621b3e8f6
|
@ -0,0 +1,220 @@
|
||||||
|
# InternLM Ecosystem
|
||||||
|
|
||||||
|
## Training
|
||||||
|
|
||||||
|
### [XTuner](https://github.com/InternLM/xtuner)
|
||||||
|
|
||||||
|
XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large models.
|
||||||
|
|
||||||
|
You can find the best practice of finetuing the internlm2 model in the [README](https://github.com/InternLM/InternLM/tree/main/finetune#xtuner)
|
||||||
|
|
||||||
|
### [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
|
||||||
|
|
||||||
|
LLaMA-Factory is an open-source, easy-to-use fine-tuning and training framework for LLMs
|
||||||
|
|
||||||
|
### [swift](https://github.com/modelscope/swift)
|
||||||
|
|
||||||
|
SWIFT supports training, inference, evaluation and deployment of LLMs and MLLMs (multimodal large models).
|
||||||
|
|
||||||
|
## Inference
|
||||||
|
|
||||||
|
### [LMDeploy](https://github.com/InternLM/lmdeploy)
|
||||||
|
|
||||||
|
LMDeploy is an efficient toolkit for compressing, deploying, and serving LLMs and VLMs.
|
||||||
|
|
||||||
|
With only 4 lines of codes, you can perform `internlm2-chat-7b` inference after `pip install lmdeploy`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from lmdeploy import pipeline
|
||||||
|
pipe = pipeline("internlm/internlm2-chat-7b")
|
||||||
|
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
||||||
|
print(response)
|
||||||
|
```
|
||||||
|
|
||||||
|
### [vLLM](https://github.com/vllm-project/vllm)
|
||||||
|
|
||||||
|
`vLLM` is a high-throughput and memory-efficient inference and serving engine for LLMs.
|
||||||
|
|
||||||
|
After the installation via `pip install vllm`, you can conduct the `internlm2-chat-7b` model inference as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from vllm import LLM, SamplingParams
|
||||||
|
|
||||||
|
# Sample prompts.
|
||||||
|
prompts = [
|
||||||
|
"Hello, my name is",
|
||||||
|
"The future of AI is",
|
||||||
|
]
|
||||||
|
# Create a sampling params object.
|
||||||
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||||
|
|
||||||
|
# Create an LLM.
|
||||||
|
llm = LLM(model="internlm/internlm2-chat-7b")
|
||||||
|
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||||
|
# that contain the prompt, generated text, and other information.
|
||||||
|
outputs = llm.generate(prompts, sampling_params)
|
||||||
|
# Print the outputs.
|
||||||
|
for output in outputs:
|
||||||
|
prompt = output.prompt
|
||||||
|
generated_text = output.outputs[0].text
|
||||||
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### [TGI](https://github.com/huggingface/text-generation-inference)
|
||||||
|
|
||||||
|
TGI is a toolkit for deploying and serving Large Language Models (LLMs). The easiest way of deploying a LLM is using the official Docker container:
|
||||||
|
|
||||||
|
```python
|
||||||
|
model=internlm/internlm2-chat-7b
|
||||||
|
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
|
||||||
|
|
||||||
|
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
|
||||||
|
```
|
||||||
|
|
||||||
|
And then you can make requests like
|
||||||
|
|
||||||
|
```shell
|
||||||
|
curl 127.0.0.1:8080/generate_stream \
|
||||||
|
-X POST \
|
||||||
|
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
||||||
|
-H 'Content-Type: application/json'
|
||||||
|
```
|
||||||
|
|
||||||
|
### [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
||||||
|
|
||||||
|
`llama.cpp` is a LLM inference framework developed in C/C++. Its goal is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
|
||||||
|
|
||||||
|
`InternLM2` can be deployed with `llama.cpp` by following the below instructions:
|
||||||
|
|
||||||
|
- Refer [this](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build) guide to build llama.cpp from source
|
||||||
|
- Convert the InternLM model to GGUF model and run it according to the [guide](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize)
|
||||||
|
|
||||||
|
### [ollama](https://github.com/ollama/ollama)
|
||||||
|
|
||||||
|
Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. It optimizes setup and configuration details, enabling users to easily set up and execute LLMs locally (in CPU and GPU modes).
|
||||||
|
|
||||||
|
The following snippet presents the Modefile of InternLM2 with `internlm2-chat-7b` as an example. Note that the InternLM2 model has to be converted to GGUF model at first.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
echo 'FROM ./internlm2-chat-7b.gguf
|
||||||
|
TEMPLATE """{{ if .System }}<|im_start|>system
|
||||||
|
{{ .System }}<|im_end|>
|
||||||
|
{{ end }}{{ if .Prompt }}<|im_start|>user
|
||||||
|
{{ .Prompt }}<im_end>
|
||||||
|
{{ end }}<|im_start|>assistant
|
||||||
|
{{ .Response }}<|im_end|>"""
|
||||||
|
|
||||||
|
PARAMETER stop "<|action_end|>"
|
||||||
|
PARAMETER stop "<|im_end|>"
|
||||||
|
|
||||||
|
SYSTEM """You are an AI assistant whose name is InternLM (书生·浦语).
|
||||||
|
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
|
||||||
|
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
|
||||||
|
"""
|
||||||
|
' > ./Modelfile
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, create an image from the above `Modelfile` like this:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
ollama create internlm2:chat-7b -f ./Modelfile
|
||||||
|
```
|
||||||
|
|
||||||
|
Regarding the usage of `ollama`, please refer [here](https://github.com/ollama/ollama/tree/main/docs).
|
||||||
|
|
||||||
|
### [llamafile](https://github.com/Mozilla-Ocho/llamafile)
|
||||||
|
|
||||||
|
llamafile lets you turn large language model (LLM) weights into executables. It combines [llama.cpp](https://github.com/ggerganov/llama.cpp) with [Cosmopolitan Libc](https://github.com/jart/cosmopolitan).
|
||||||
|
|
||||||
|
The best practice of deploying InternLM2 using llamafile is shown as below:
|
||||||
|
|
||||||
|
- Convert the internlm2 model into GGUF model by `llama.cpp`. Suppose we get `internlm2-chat-7b.gguf` in this step
|
||||||
|
- create the llamafile
|
||||||
|
|
||||||
|
```shell
|
||||||
|
wget https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.6/llamafile-0.8.6.zip
|
||||||
|
unzip llamafile-0.8.6.zip
|
||||||
|
|
||||||
|
cp llamafile-0.8.6/bin/llamafile internlm2.llamafile
|
||||||
|
|
||||||
|
echo "-m
|
||||||
|
internlm2-chat-7b.gguf
|
||||||
|
--host
|
||||||
|
0.0.0.0
|
||||||
|
-ngl
|
||||||
|
999
|
||||||
|
..." > .args
|
||||||
|
|
||||||
|
zipalign -j0 \
|
||||||
|
internlm2.llamafile \
|
||||||
|
internlm2-chat-7b.gguf \
|
||||||
|
.args
|
||||||
|
|
||||||
|
rm -rf .args
|
||||||
|
```
|
||||||
|
|
||||||
|
- Run the llamafile
|
||||||
|
|
||||||
|
```shell
|
||||||
|
./internlm2.llamafile
|
||||||
|
```
|
||||||
|
|
||||||
|
Your browser should open automatically and display a chat interface. (If it doesn't, just open your browser and point it at http://localhost:8080)
|
||||||
|
|
||||||
|
### [mlx](https://github.com/ml-explore/mlx)
|
||||||
|
|
||||||
|
MLX is an array framework for machine learning research on Apple silicon, brought to you by Apple machine learning research.
|
||||||
|
|
||||||
|
With the following steps, you can perform InternLM2 inference on Apple devices.
|
||||||
|
|
||||||
|
- Installation
|
||||||
|
|
||||||
|
```shell
|
||||||
|
pip install mlx mlx-lm
|
||||||
|
```
|
||||||
|
|
||||||
|
- Inference
|
||||||
|
|
||||||
|
```python
|
||||||
|
from mlx_lm import load, generate
|
||||||
|
tokenizer_config = {"trust_remote_code": True}
|
||||||
|
model, tokenizer = load("internlm/internlm2-chat-1_8b", tokenizer_config=tokenizer_config)
|
||||||
|
response = generate(model, tokenizer, prompt="write a story", verbose=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Application
|
||||||
|
|
||||||
|
### [Langchain](https://github.com/langchain-ai/langchain)
|
||||||
|
|
||||||
|
LangChain is a framework for developing applications powered by large language models (LLMs).
|
||||||
|
|
||||||
|
You can build a [LLM chain](https://python.langchain.com/v0.1/docs/get_started/quickstart/#llm-chain) by the OpenAI API. And the server is recommended to be launched by LMDeploy, vLLM or others that are compatible with openai server.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from langchain_openai import ChatOpenAI
|
||||||
|
from langchain_core.prompts import ChatPromptTemplate
|
||||||
|
|
||||||
|
llm = ChatOpenAI(
|
||||||
|
api_key="a dummy key",
|
||||||
|
base_ur='https://0.0.0.0:23333/v1')
|
||||||
|
prompt = ChatPromptTemplate.from_messages([
|
||||||
|
("system", "You are a world class technical documentation writer."),
|
||||||
|
("user", "{input}")
|
||||||
|
])
|
||||||
|
|
||||||
|
chain = prompt | llm
|
||||||
|
|
||||||
|
chain.invoke({"input": "how can langsmith help with testing?"})
|
||||||
|
```
|
||||||
|
|
||||||
|
Or you can follow the guide [here](https://python.langchain.com/v0.1/docs/get_started/quickstart/#llm-chain) and run an ollama model locally.
|
||||||
|
|
||||||
|
As for other user cases, please look for them from [here](https://python.langchain.com/v0.1/docs/get_started/introduction/).
|
||||||
|
|
||||||
|
### [LlamaIndex](https://github.com/run-llama/llama_index)
|
||||||
|
|
||||||
|
LlamaIndex is a framework for building context-augmented LLM applications.
|
||||||
|
|
||||||
|
It chooses ollama as the LLM inference engine locally. An example can be found from the [Starter Tutorial(Local Models)](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/).
|
||||||
|
|
||||||
|
Therefore, you can integrate InternLM2 to LlamaIndex smoothly if you can deploying InternLM2 with `ollama` as guided in the [ollama section](#ollama)
|
Loading…
Reference in New Issue