mirror of https://github.com/InternLM/InternLM
check in initial version of internlm ecosystem
parent
3be5894976
commit
2621b3e8f6
|
@ -0,0 +1,220 @@
|
|||
# InternLM Ecosystem
|
||||
|
||||
## Training
|
||||
|
||||
### [XTuner](https://github.com/InternLM/xtuner)
|
||||
|
||||
XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large models.
|
||||
|
||||
You can find the best practice of finetuing the internlm2 model in the [README](https://github.com/InternLM/InternLM/tree/main/finetune#xtuner)
|
||||
|
||||
### [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
|
||||
|
||||
LLaMA-Factory is an open-source, easy-to-use fine-tuning and training framework for LLMs
|
||||
|
||||
### [swift](https://github.com/modelscope/swift)
|
||||
|
||||
SWIFT supports training, inference, evaluation and deployment of LLMs and MLLMs (multimodal large models).
|
||||
|
||||
## Inference
|
||||
|
||||
### [LMDeploy](https://github.com/InternLM/lmdeploy)
|
||||
|
||||
LMDeploy is an efficient toolkit for compressing, deploying, and serving LLMs and VLMs.
|
||||
|
||||
With only 4 lines of codes, you can perform `internlm2-chat-7b` inference after `pip install lmdeploy`:
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline
|
||||
pipe = pipeline("internlm/internlm2-chat-7b")
|
||||
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
||||
print(response)
|
||||
```
|
||||
|
||||
### [vLLM](https://github.com/vllm-project/vllm)
|
||||
|
||||
`vLLM` is a high-throughput and memory-efficient inference and serving engine for LLMs.
|
||||
|
||||
After the installation via `pip install vllm`, you can conduct the `internlm2-chat-7b` model inference as follows:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="internlm/internlm2-chat-7b")
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
### [TGI](https://github.com/huggingface/text-generation-inference)
|
||||
|
||||
TGI is a toolkit for deploying and serving Large Language Models (LLMs). The easiest way of deploying a LLM is using the official Docker container:
|
||||
|
||||
```python
|
||||
model=internlm/internlm2-chat-7b
|
||||
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
|
||||
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
|
||||
```
|
||||
|
||||
And then you can make requests like
|
||||
|
||||
```shell
|
||||
curl 127.0.0.1:8080/generate_stream \
|
||||
-X POST \
|
||||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
||||
-H 'Content-Type: application/json'
|
||||
```
|
||||
|
||||
### [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
||||
|
||||
`llama.cpp` is a LLM inference framework developed in C/C++. Its goal is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
|
||||
|
||||
`InternLM2` can be deployed with `llama.cpp` by following the below instructions:
|
||||
|
||||
- Refer [this](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build) guide to build llama.cpp from source
|
||||
- Convert the InternLM model to GGUF model and run it according to the [guide](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize)
|
||||
|
||||
### [ollama](https://github.com/ollama/ollama)
|
||||
|
||||
Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. It optimizes setup and configuration details, enabling users to easily set up and execute LLMs locally (in CPU and GPU modes).
|
||||
|
||||
The following snippet presents the Modefile of InternLM2 with `internlm2-chat-7b` as an example. Note that the InternLM2 model has to be converted to GGUF model at first.
|
||||
|
||||
```shell
|
||||
echo 'FROM ./internlm2-chat-7b.gguf
|
||||
TEMPLATE """{{ if .System }}<|im_start|>system
|
||||
{{ .System }}<|im_end|>
|
||||
{{ end }}{{ if .Prompt }}<|im_start|>user
|
||||
{{ .Prompt }}<im_end>
|
||||
{{ end }}<|im_start|>assistant
|
||||
{{ .Response }}<|im_end|>"""
|
||||
|
||||
PARAMETER stop "<|action_end|>"
|
||||
PARAMETER stop "<|im_end|>"
|
||||
|
||||
SYSTEM """You are an AI assistant whose name is InternLM (书生·浦语).
|
||||
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
|
||||
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
|
||||
"""
|
||||
' > ./Modelfile
|
||||
```
|
||||
|
||||
Then, create an image from the above `Modelfile` like this:
|
||||
|
||||
```shell
|
||||
ollama create internlm2:chat-7b -f ./Modelfile
|
||||
```
|
||||
|
||||
Regarding the usage of `ollama`, please refer [here](https://github.com/ollama/ollama/tree/main/docs).
|
||||
|
||||
### [llamafile](https://github.com/Mozilla-Ocho/llamafile)
|
||||
|
||||
llamafile lets you turn large language model (LLM) weights into executables. It combines [llama.cpp](https://github.com/ggerganov/llama.cpp) with [Cosmopolitan Libc](https://github.com/jart/cosmopolitan).
|
||||
|
||||
The best practice of deploying InternLM2 using llamafile is shown as below:
|
||||
|
||||
- Convert the internlm2 model into GGUF model by `llama.cpp`. Suppose we get `internlm2-chat-7b.gguf` in this step
|
||||
- create the llamafile
|
||||
|
||||
```shell
|
||||
wget https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.6/llamafile-0.8.6.zip
|
||||
unzip llamafile-0.8.6.zip
|
||||
|
||||
cp llamafile-0.8.6/bin/llamafile internlm2.llamafile
|
||||
|
||||
echo "-m
|
||||
internlm2-chat-7b.gguf
|
||||
--host
|
||||
0.0.0.0
|
||||
-ngl
|
||||
999
|
||||
..." > .args
|
||||
|
||||
zipalign -j0 \
|
||||
internlm2.llamafile \
|
||||
internlm2-chat-7b.gguf \
|
||||
.args
|
||||
|
||||
rm -rf .args
|
||||
```
|
||||
|
||||
- Run the llamafile
|
||||
|
||||
```shell
|
||||
./internlm2.llamafile
|
||||
```
|
||||
|
||||
Your browser should open automatically and display a chat interface. (If it doesn't, just open your browser and point it at http://localhost:8080)
|
||||
|
||||
### [mlx](https://github.com/ml-explore/mlx)
|
||||
|
||||
MLX is an array framework for machine learning research on Apple silicon, brought to you by Apple machine learning research.
|
||||
|
||||
With the following steps, you can perform InternLM2 inference on Apple devices.
|
||||
|
||||
- Installation
|
||||
|
||||
```shell
|
||||
pip install mlx mlx-lm
|
||||
```
|
||||
|
||||
- Inference
|
||||
|
||||
```python
|
||||
from mlx_lm import load, generate
|
||||
tokenizer_config = {"trust_remote_code": True}
|
||||
model, tokenizer = load("internlm/internlm2-chat-1_8b", tokenizer_config=tokenizer_config)
|
||||
response = generate(model, tokenizer, prompt="write a story", verbose=True)
|
||||
```
|
||||
|
||||
## Application
|
||||
|
||||
### [Langchain](https://github.com/langchain-ai/langchain)
|
||||
|
||||
LangChain is a framework for developing applications powered by large language models (LLMs).
|
||||
|
||||
You can build a [LLM chain](https://python.langchain.com/v0.1/docs/get_started/quickstart/#llm-chain) by the OpenAI API. And the server is recommended to be launched by LMDeploy, vLLM or others that are compatible with openai server.
|
||||
|
||||
```python
|
||||
from langchain_openai import ChatOpenAI
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
|
||||
llm = ChatOpenAI(
|
||||
api_key="a dummy key",
|
||||
base_ur='https://0.0.0.0:23333/v1')
|
||||
prompt = ChatPromptTemplate.from_messages([
|
||||
("system", "You are a world class technical documentation writer."),
|
||||
("user", "{input}")
|
||||
])
|
||||
|
||||
chain = prompt | llm
|
||||
|
||||
chain.invoke({"input": "how can langsmith help with testing?"})
|
||||
```
|
||||
|
||||
Or you can follow the guide [here](https://python.langchain.com/v0.1/docs/get_started/quickstart/#llm-chain) and run an ollama model locally.
|
||||
|
||||
As for other user cases, please look for them from [here](https://python.langchain.com/v0.1/docs/get_started/introduction/).
|
||||
|
||||
### [LlamaIndex](https://github.com/run-llama/llama_index)
|
||||
|
||||
LlamaIndex is a framework for building context-augmented LLM applications.
|
||||
|
||||
It chooses ollama as the LLM inference engine locally. An example can be found from the [Starter Tutorial(Local Models)](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/).
|
||||
|
||||
Therefore, you can integrate InternLM2 to LlamaIndex smoothly if you can deploying InternLM2 with `ollama` as guided in the [ollama section](#ollama)
|
Loading…
Reference in New Issue