InternLM/chat/README.md

# Chat

English | [简体中文](./README_zh-CN.md)

This document briefly shows how to use [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue) to conduct inference with InternLM3-Instruct.

You can also know more about the [chatml format](./chat_format.md) and how to use [LMDeploy for inference and model serving](./lmdeploy.md).

## Import from Transformers

To load the InternLM3-8B-Instruct model using Transformers, use the following code:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm3-8b-instruct", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm3-8b-instruct", trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained("internlm/internlm3-8b-instruct", device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained("internlm/internlm3-8b-instruct", device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()

messages = [
    {"role": "system", "content": "You are an AI assistant whose name is InternLM."},
    {"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

generated_ids = model.generate(tokenized_chat, max_new_tokens=512)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
]
response = tokenizer.batch_decode(generated_ids)[0]
```

## Import from ModelScope

To load the InternLM3-8B-Instruct model using ModelScope, use the following code:

```python
import torch
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm3-8b-instruct')
tokenizer = AutoTokenizer.from_pretrained(model_dir,trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()

messages = [
    {"role": "system", "content": "You are an AI assistant whose name is InternLM."},
    {"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

generated_ids = model.generate(tokenized_chat, max_new_tokens=512)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
]
response = tokenizer.batch_decode(generated_ids)[0]
```

## Dialogue

You can interact with the InternLM3-8B-Instruct model through a frontend interface by running the following code:

```bash
pip install streamlit
pip install transformers>=4.48
streamlit run ./chat/web_demo.py
```

It supports switching between different inference modes and comparing their responses.

![demo](https://github.com/user-attachments/assets/4953befa-343f-499d-b289-048d982439f3)
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00			`# Chat`

[Doc]: Resolve comments in documentation (#587) * fix typos and try pass lint * fix wrong path in CI * fix wrong path in readme * update lint doc * update doc * update doc 2024-01-17 02:47:06 +00:00			`English \| [简体中文](./README_zh-CN.md)`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00
[Feat] Add deep thinking demo (#820) 2025-01-15 16:02:33 +00:00			`This document briefly shows how to use [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue) to conduct inference with InternLM3-Instruct.`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00
			`You can also know more about the [chatml format](./chat_format.md) and how to use [LMDeploy for inference and model serving](./lmdeploy.md).`

			`## Import from Transformers`

[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00			`To load the InternLM3-8B-Instruct model using Transformers, use the following code:`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00
			```python
[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00			`import torch`
			`from transformers import AutoTokenizer, AutoModelForCausalLM`
			`tokenizer = AutoTokenizer.from_pretrained("internlm/internlm3-8b-instruct", trust_remote_code=True)`
			# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
			`model = AutoModelForCausalLM.from_pretrained("internlm/internlm3-8b-instruct", trust_remote_code=True, torch_dtype=torch.float16)`
			`# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.`
			`# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.`
			`# pip install -U bitsandbytes`
			`# 8-bit: model = AutoModelForCausalLM.from_pretrained("internlm/internlm3-8b-instruct", device_map="auto", trust_remote_code=True, load_in_8bit=True)`
			`# 4-bit: model = AutoModelForCausalLM.from_pretrained("internlm/internlm3-8b-instruct", device_map="auto", trust_remote_code=True, load_in_4bit=True)`
			`model = model.eval()`

			`messages = [`
			`{"role": "system", "content": "You are an AI assistant whose name is InternLM."},`
			`{"role": "user", "content": "Please tell me five scenic spots in Shanghai"},`
			`]`
			`tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")`

			`generated_ids = model.generate(tokenized_chat, max_new_tokens=512)`

			`generated_ids = [`
			`output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)`
			`]`
			`response = tokenizer.batch_decode(generated_ids)[0]`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00			```

			`## Import from ModelScope`

[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00			`To load the InternLM3-8B-Instruct model using ModelScope, use the following code:`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00
			```python
			`import torch`
[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00			`from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM`
			`model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm3-8b-instruct')`
			`tokenizer = AutoTokenizer.from_pretrained(model_dir,trust_remote_code=True)`
			# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
			`model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)`
			`# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.`
			`# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.`
			`# pip install -U bitsandbytes`
			`# 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)`
			`# 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00			`model = model.eval()`
[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00
			`messages = [`
			`{"role": "system", "content": "You are an AI assistant whose name is InternLM."},`
			`{"role": "user", "content": "Please tell me five scenic spots in Shanghai"},`
			`]`
			`tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")`

			`generated_ids = model.generate(tokenized_chat, max_new_tokens=512)`

			`generated_ids = [`
			`output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)`
			`]`
			`response = tokenizer.batch_decode(generated_ids)[0]`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00			```

			`## Dialogue`

[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00			`You can interact with the InternLM3-8B-Instruct model through a frontend interface by running the following code:`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00
			```bash
[doc]: update requirements (#667) Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com> 2024-01-26 13:23:15 +00:00			`pip install streamlit`
[Docs] Update Chat Examples (#817) 2025-01-15 06:13:15 +00:00			`pip install transformers>=4.48`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00			`streamlit run ./chat/web_demo.py`
			```

[Feat] Add deep thinking demo (#820) 2025-01-15 16:02:33 +00:00			`It supports switching between different inference modes and comparing their responses.`
Update main branch and docs (#585) * [Refactor]: refactor with pure documentations and examples * update model information * update model information * Check-in lmdeploy user guide * Update chat format doc * update cn doc * clean doc 2024-01-17 01:46:11 +00:00
[Feat] Add deep thinking demo (#820) 2025-01-15 16:02:33 +00:00			`![demo](https://github.com/user-attachments/assets/4953befa-343f-499d-b289-048d982439f3)`