diff --git a/README_zh-CN.md b/README_zh-CN.md
index 8dcd7c4..42700e2 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -73,7 +73,7 @@ InternLM2.5 系列模型在本仓库正式发布，具有如下特性：
 目前 InternLM 2.5 系列只发布了 7B 大小的模型，我们接下来将开源 1.8B 和 20B 的版本。7B 为轻量级的研究和应用提供了一个轻便但性能不俗的模型，20B 模型的综合性能更为强劲，可以有效支持更加复杂的实用场景。每个规格不同模型关系如下所示：
 
 1. **InternLM2.5**：经历了大规模预训练的基座模型，是我们推荐的在大部分应用中考虑选用的优秀基座。
-2. **InternLM2.5-Chat**: 对话模型，在 InternLM2.5 基座上经历了有监督微调和 online RLHF。InternLM2。5-Chat 面向对话交互进行了优化，具有较好的指令遵循、共情聊天和调用工具等的能力，是我们推荐直接用于下游应用的模型。
+2. **InternLM2.5-Chat**: 对话模型，在 InternLM2.5 基座上经历了有监督微调和 online RLHF。InternLM2.5-Chat 面向对话交互进行了优化，具有较好的指令遵循、共情聊天和调用工具等的能力，是我们推荐直接用于下游应用的模型。
 3. **InternLM2.5-Chat-1M**: InternLM2.5-Chat-1M 支持一百万字超长上下文，并具有和 InternLM2.5-Chat 相当的综合性能表现。
 
 **局限性：** 尽管在训练过程中我们非常注重模型的安全性，尽力促使模型输出符合伦理和法律要求的文本，但受限于模型大小以及概率生成范式，模型可能会产生各种不符合预期的输出，例如回复内容包含偏见、歧视等有害内容，请勿传播这些内容。由于传播不良信息导致的任何后果，本项目不承担责任。
diff --git a/agent/README.md b/agent/README.md
index d135b97..73ddc89 100644
--- a/agent/README.md
+++ b/agent/README.md
@@ -4,77 +4,80 @@ English | [简体中文](README_zh-CN.md)
 
 ## Introduction
 
-InternLM-Chat-7B v1.1 has been released as the first open-source model with code interpreter capabilities, supporting external tools such as Python code interpreter and search engine.
+InternLM2.5-Chat, open sourced on June 30, 2024, further enhances its capabilities in code interpreter and general tool utilization. With improved and more generalized instruction understanding, tool selection, and reflection abilities, InternLM2.5-Chat can more reliably support complex agents and multi-step tool calling for more intricate tasks. When combined with a code interpreter, InternLM2.5-Chat obtains comparable results to GPT-4 on MATH. Leveraging strong foundational capabilities in mathematics and tools, InternLM2.5-Chat provides practical data analysis capabilities.
 
-InternLM2-Chat, open sourced on January 17, 2024, further enhances its capabilities in code interpreter and general tool utilization. With improved and more generalized instruction understanding, tool selection, and reflection abilities, InternLM2-Chat can more reliably support complex agents and multi-step tool calling for more intricate tasks. InternLM2-Chat exhibits decent computational and reasoning abilities even without external tools, surpassing ChatGPT in mathematical performance. When combined with a code interpreter, InternLM2-Chat-20B obtains comparable results to GPT-4 on GSM8K and MATH. Leveraging strong foundational capabilities in mathematics and tools, InternLM2-Chat provides practical data analysis capabilities.
+The results of InternLM2.5-Chat on math code interpreter is as below:
 
-The results of InternLM2-Chat-20B on math code interpreter is as below:
-
-|                                          | GSM8K | MATH  |
-| :--------------------------------------: | :---: | :---: |
-|            InternLM2-Chat-20B            | 79.6  | 32.5  |
-| InternLM2-Chat-20B with Code Interpreter | 84.5  | 51.2  |
-|            ChatGPT (GPT-3.5)             | 78.2  | 28.0  |
-|                  GPT-4                   | 91.4  | 45.8  |
+|       Models        | Tool-Integrated | MATH |
+| :-----------------: | :-------------: | :--: |
+|  InternLM2-Chat-7B  |       w/        | 45.1 |
+| InternLM2-Chat-20B  |       w/        | 51.2 |
+| InternLM2.5-7B-Chat |       w/        | 63.0 |
+| gpt-4-0125-preview  |       w/o       | 64.2 |
 
 ## Usages
 
-We offer an example using [Lagent](lagent.md) to build agents based on InternLM2-Chat to call the code interpreter. Firstly install the extra dependencies:
+We offer an example using [Lagent](lagent.md) to build agents based on InternLM2.5-Chat to call the code interpreter. Firstly install the extra dependencies:
 
 ```bash
 pip install -r requirements.txt
 ```
 
-Run the following script to perform inference and evaluation on GSM8K and MATH test.
+Run the following script to perform inference and evaluation on MATH test.
 
 ```bash
 python streaming_inference.py \
   --backend=lmdeploy \  # For HuggingFace models: hf
-  --model_path=internlm/internlm2-chat-20b \
-  --tp=2 \
+  --model_path=internlm/internlm2_5-7b-chat \
+  --tp=1 \
   --temperature=1.0 \
   --top_k=1 \
-  --dataset=math \
   --output_path=math_lmdeploy.jsonl \
   --do_eval
 ```
 
 `output_path` is a jsonl format file to save the inference results. Each line is like
 
-```json
+````json
 {
-    "idx": 41, 
-    "query": "The point $(a, b)$ lies on the line with the equation $3x + 2y = 12.$ When $a = 4$, what is the value of $b$?",
-    "gt": "0",
-    "pred": ["0"],
+    "problem": "The midpoint of the line segment between $(x,y)$ and $(2,4)$ is $(-7,0)$. Find $(x,y)$.",
+    "level": "Level 3",
+    "type": "Algebra",
+    "solution": "Applying the midpoint formula gives  $$\\left(\\frac{2+x}{2},\\frac{4+y}{2}\\right)=(-7,0).$$Solving $\\frac{2+x}{2}=-7$ for $x$ and $\\frac{4+y}{2} = 0$ for $y$ we find $(x,y)$ to be $\\boxed{(-16,-4)}$.",
+    "idx": 306,
+    "query": "The midpoint of the line segment between $(x,y)$ and $(2,4)$ is $(-7,0)$. Find $(x,y)$.",
+    "gt": "(-16,-4)",
+    "pred": [
+        "(-16,-4)"
+    ],
     "steps": [
         {
             "role": "language",
-            "content": ""
+            "content": "To find the coordinates of the point $(x,y)$, we can use the midpoint formula. The midpoint $M$ of a line segment with endpoints $(x_1, y_1)$ and $(x_2, y_2)$ is given by:\n\n\\[M = \\left(\\frac{x_1 + x_2}{2}, \\frac{y_1 + y_2}{2}\\right)\\]\n\nGiven that the midpoint is $(-7,0)$ and one endpoint is $(2,4)$, we can set up equations to solve for $x$ and $y$.\n\nLet's start by setting up the equations based on the midpoint formula:\n\n\\[\\frac{x + 2}{2} = -7\\]\n\\[\\frac{y + 4}{2} = 0\\]\n\nNow, let's solve these equations using Python to find the values of $x$ and $y$."
         },
         {
             "role": "tool",
             "content": {
-                "name": "IPythonInteractive",
+                "name": "AsyncIPythonInteractiveManager",
                 "parameters": {
-                    "command": "```python\nfrom sympy import symbols, solve\n\ndef find_b():\n    x, y = symbols('x y')\n    equation = 3*x + 2*y - 12\n    b = solve(equation.subs(x, 4), y)[0]\n\n    return b\n\nresult = find_b()\nprint(result)\n```"
+                    "command": "```python\nfrom sympy import symbols, Eq, solve\n\n# Define symbols\nx, y = symbols('x y')\n\n# Define equations based on the midpoint formula\neq1 = Eq((x + 2)/2, -7)\neq2 = Eq((y + 4)/2, 0)\n\n# Solve equations\nsolution = solve((eq1, eq2), (x, y))\n\nsolution\n```"
                 }
             },
             "name": "interpreter"
         },
         {
             "role": "environment",
-            "content": "0",
+            "content": "{x: -16, y: -4}",
             "name": "interpreter"
         },
         {
             "role": "language",
-            "content": "The value of $b$ when $a = 4$ is $\\boxed{0}$."
+            "content": "After solving the equations, we find that the coordinates of the point $(x,y)$ are $(-16, -4)$. Therefore, the solution to the problem is:\n\n\\[\\boxed{(-16, -4)}\\]"
         }
     ],
     "error": null
 }
-```
+````
 
 Once it is prepared, just skip the inference stage as follows.
 
diff --git a/agent/README_zh-CN.md b/agent/README_zh-CN.md
index 10981e3..17d531f 100644
--- a/agent/README_zh-CN.md
+++ b/agent/README_zh-CN.md
@@ -4,77 +4,80 @@
 
 ## 简介
 
-InternLM-Chat-7B v1.1 是首个具有代码解释能力的开源对话模型，支持 Python 解释器和搜索引擎等外部工具。
+InternLM2.5-Chat 在代码解释和通用工具调用方面的能力得到进一步提升。基于更强和更具有泛化性的指令理解、工具筛选与结果反思等能力，新版模型可以更可靠地支持复杂智能体的搭建，支持对工具进行有效的多轮调用，完成较复杂的任务。在配合代码解释器（code-interpreter）的条件下，InternLM2.5-Chat 在 MATH 上可以达到和 GPT-4 相仿的水平。基于在数理和工具方面强大的基础能力，InternLM2.5-Chat 提供了实用的数据分析能力。
 
-InternLM2-Chat 进一步提高了它在代码解释和通用工具调用方面的能力。基于更强和更具有泛化性的指令理解、工具筛选与结果反思等能力，新版模型可以更可靠地支持复杂智能体的搭建，支持对工具进行有效的多轮调用，完成较复杂的任务。模型在不使用外部工具的条件下已具备不错的计算能力和推理能力，数理表现超过 ChatGPT；在配合代码解释器（code-interpreter）的条件下，InternLM2-Chat-20B 在 GSM8K 和 MATH 上可以达到和 GPT-4 相仿的水平。基于在数理和工具方面强大的基础能力，InternLM2-Chat 提供了实用的数据分析能力。
+以下是 InternLM2.5-Chat 在数学代码解释器上的结果。
 
-以下是 InternLM2-Chat-20B 在数学代码解释器上的结果。
-
-|                                     | GSM8K | MATH  |
-| :---------------------------------: | :---: | :---: |
-| InternLM2-Chat-20B 单纯依靠内在能力 | 79.6  | 32.5  |
-|  InternLM2-Chat-20B 配合代码解释器  | 84.5  | 51.2  |
-|          ChatGPT (GPT-3.5)          | 78.2  | 28.0  |
-|                GPT-4                | 91.4  | 45.8  |
+|        模型         | 是否集成工具 | MATH |
+| :-----------------: | :----------: | :--: |
+|  InternLM2-Chat-7B  |      w/      | 45.1 |
+| InternLM2-Chat-20B  |      w/      | 51.2 |
+| InternLM2.5-7B-Chat |      w/      | 63.0 |
+| gpt-4-0125-preview  |     w/o      | 64.2 |
 
 ## 体验
 
-我们提供了使用 [Lagent](lagent_zh-CN.md) 来基于 InternLM2-Chat 构建智能体调用代码解释器的例子。首先安装额外依赖：
+我们提供了使用 [Lagent](lagent_zh-CN.md) 来基于 InternLM2.5-Chat 构建智能体调用代码解释器的例子。首先安装额外依赖：
 
 ```bash
 pip install -r requirements.txt
 ```
 
-运行以下脚本在 GSM8K 和 MATH 测试集上进行推理和评估：
+运行以下脚本在 MATH 测试集上进行推理和评估：
 
 ```bash
 python streaming_inference.py \
   --backend=lmdeploy \  # For HuggingFace models: hf
-  --model_path=internlm/internlm2-chat-20b \
-  --tp=2 \
+  --model_path=internlm/internlm2_5-7b-chat \
+  --tp=1 \
   --temperature=1.0 \
   --top_k=1 \
-  --dataset=math \
   --output_path=math_lmdeploy.jsonl \
   --do_eval
 ```
 
 `output_path` 是一个存储推理结果的 jsonl 格式文件，每行形如：
 
-```json
+````json
 {
-    "idx": 41, 
-    "query": "The point $(a, b)$ lies on the line with the equation $3x + 2y = 12.$ When $a = 4$, what is the value of $b$?",
-    "gt": "0",
-    "pred": ["0"],
+    "problem": "The midpoint of the line segment between $(x,y)$ and $(2,4)$ is $(-7,0)$. Find $(x,y)$.",
+    "level": "Level 3",
+    "type": "Algebra",
+    "solution": "Applying the midpoint formula gives  $$\\left(\\frac{2+x}{2},\\frac{4+y}{2}\\right)=(-7,0).$$Solving $\\frac{2+x}{2}=-7$ for $x$ and $\\frac{4+y}{2} = 0$ for $y$ we find $(x,y)$ to be $\\boxed{(-16,-4)}$.",
+    "idx": 306,
+    "query": "The midpoint of the line segment between $(x,y)$ and $(2,4)$ is $(-7,0)$. Find $(x,y)$.",
+    "gt": "(-16,-4)",
+    "pred": [
+        "(-16,-4)"
+    ],
     "steps": [
         {
             "role": "language",
-            "content": ""
+            "content": "To find the coordinates of the point $(x,y)$, we can use the midpoint formula. The midpoint $M$ of a line segment with endpoints $(x_1, y_1)$ and $(x_2, y_2)$ is given by:\n\n\\[M = \\left(\\frac{x_1 + x_2}{2}, \\frac{y_1 + y_2}{2}\\right)\\]\n\nGiven that the midpoint is $(-7,0)$ and one endpoint is $(2,4)$, we can set up equations to solve for $x$ and $y$.\n\nLet's start by setting up the equations based on the midpoint formula:\n\n\\[\\frac{x + 2}{2} = -7\\]\n\\[\\frac{y + 4}{2} = 0\\]\n\nNow, let's solve these equations using Python to find the values of $x$ and $y$."
         },
         {
             "role": "tool",
             "content": {
-                "name": "IPythonInteractive",
+                "name": "AsyncIPythonInteractiveManager",
                 "parameters": {
-                    "command": "```python\nfrom sympy import symbols, solve\n\ndef find_b():\n    x, y = symbols('x y')\n    equation = 3*x + 2*y - 12\n    b = solve(equation.subs(x, 4), y)[0]\n\n    return b\n\nresult = find_b()\nprint(result)\n```"
+                    "command": "```python\nfrom sympy import symbols, Eq, solve\n\n# Define symbols\nx, y = symbols('x y')\n\n# Define equations based on the midpoint formula\neq1 = Eq((x + 2)/2, -7)\neq2 = Eq((y + 4)/2, 0)\n\n# Solve equations\nsolution = solve((eq1, eq2), (x, y))\n\nsolution\n```"
                 }
             },
             "name": "interpreter"
         },
         {
             "role": "environment",
-            "content": "0",
+            "content": "{x: -16, y: -4}",
             "name": "interpreter"
         },
         {
             "role": "language",
-            "content": "The value of $b$ when $a = 4$ is $\\boxed{0}$."
+            "content": "After solving the equations, we find that the coordinates of the point $(x,y)$ are $(-16, -4)$. Therefore, the solution to the problem is:\n\n\\[\\boxed{(-16, -4)}\\]"
         }
     ],
     "error": null
 }
-```
+````
 
 如果已经准备好了该文件，可直接跳过推理阶段进行评估：
 
diff --git a/agent/lagent.md b/agent/lagent.md
index 09e5c7a..dafe62a 100644
--- a/agent/lagent.md
+++ b/agent/lagent.md
@@ -38,7 +38,7 @@ Then you can chat through the UI shown as below
 
 ![image](https://github.com/InternLM/lagent/assets/24622904/3aebb8b4-07d1-42a2-9da3-46080c556f68)
 
-## Run a ReAct agent with InternLM2-Chat
+## Run a ReAct agent with InternLM2.5-Chat
 
 **NOTE:** If you want to run a HuggingFace model, please run `pip install -e .[all]` first.
 
diff --git a/agent/lagent_zh-CN.md b/agent/lagent_zh-CN.md
index 141a782..d3501be 100644
--- a/agent/lagent_zh-CN.md
+++ b/agent/lagent_zh-CN.md
@@ -38,7 +38,7 @@ streamlit run examples/react_web_demo.py
 
 ![image](https://github.com/InternLM/lagent/assets/24622904/3aebb8b4-07d1-42a2-9da3-46080c556f68)
 
-## 用 InternLM-Chat 构建一个 ReAct 智能体
+## 用 InternLM2.5-Chat 构建一个 ReAct 智能体
 
 \*\*注意：\*\*如果你想要启动一个 HuggingFace 的模型，请先运行 pip install -e .\[all\]。
 
@@ -49,7 +49,7 @@ from lagent.actions import ActionExecutor, GoogleSearch, PythonInterpreter
 from lagent.llms import HFTransformer
 
 # Initialize the HFTransformer-based Language Model (llm) and provide the model name.
-llm = HFTransformer('internlm/internlm-chat-7b-v1_1')
+llm = HFTransformer('internlm/internlm2_5-7b-chat')
 
 # Initialize the Google Search tool and provide your API key.
 search_tool = GoogleSearch(api_key='Your SERPER_API_KEY')
diff --git a/agent/requirements.txt b/agent/requirements.txt
index 6dbeb86..772f585 100644
--- a/agent/requirements.txt
+++ b/agent/requirements.txt
@@ -1,10 +1,10 @@
-lmdeploy>=0.2.2
+antlr4-python3-runtime==4.11.0
 datasets
-tqdm
+einops
+jsonlines
+lagent @ git+https://github.com/InternLM/lagent@main
+lmdeploy>=0.2.2
 numpy
 pebble
-jsonlines
 sympy==1.12
-antlr4-python3-runtime==4.11.0
-lagent
-einops
+tqdm
diff --git a/agent/streaming_inference.py b/agent/streaming_inference.py
index e97d27f..d091152 100644
--- a/agent/streaming_inference.py
+++ b/agent/streaming_inference.py
@@ -46,13 +46,6 @@ from sympy.parsing.sympy_parser import parse_expr
 from tqdm import tqdm
 
 # --------------------- modify the system prompt as needed ---------------------
-# DEFAULT_PROMPT = (
-#     'Integrate step-by-step reasoning and Python code to solve math problems '
-#     'using the following guidelines:\n'
-#     '- Just write jupyter code to solve the problem without giving your thought;\n'
-#     r"- Present the final result in LaTeX using a '\boxed{{}}' without any "
-#     'units. \n')
-
 DEFAULT_PROMPT = (
     'Integrate step-by-step reasoning and Python code to solve math problems '
     'using the following guidelines:\n'
@@ -64,16 +57,15 @@ DEFAULT_PROMPT = (
 
 def parse_args():
     parser = argparse.ArgumentParser(description='Math Code Interpreter')
-    parser.add_argument(
-        '--backend',
-        type=str,
-        default='lmdeploy',
-        help='Which inference framework to use.',
-        choices=['lmdeploy', 'hf'])
+    parser.add_argument('--backend',
+                        type=str,
+                        default='lmdeploy',
+                        help='Which inference framework to use.',
+                        choices=['lmdeploy', 'hf'])
     parser.add_argument(
         '--model_path',
         type=str,
-        default='internlm/internlm2_5-7b-chat',
+        default='internlm/internlm2-chat-7b',
         help='Path or name to the model, could be HuggingFace model specifier.'
     )
     parser.add_argument(
@@ -81,21 +73,14 @@ def parse_args():
         type=str,
         required=True,
         help='Path to save inference results to, should be a `jsonl` file')
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='math',
-        choices=['gsm8k', 'math'],
-        help='Dataset for inference')
-    parser.add_argument(
-        '--batch_size',
-        type=int,
-        default=100,
-        help='Agent inference batch size')
+    parser.add_argument('--batch_size',
+                        type=int,
+                        default=100,
+                        help='Agent inference batch size')
     parser.add_argument(
         '--max_turn',
         type=int,
-        default=3,
+        default=5,
         help=
         'Maximum number of interaction rounds between the agent and environment'
     )
@@ -104,29 +89,27 @@ def parse_args():
         type=int,
         default=1,
         help='Number of tensor parallelism. It may be required in LMDelpoy.')
-    parser.add_argument(
-        '--temperature',
-        type=float,
-        default=0.1,
-        help='Temperature in next token prediction')
-    parser.add_argument(
-        '--top_p',
-        type=float,
-        default=0.8,
-        help='Parameter for Top-P Sampling.')
-    parser.add_argument(
-        '--top_k', type=int, default=40, help='Parameter for Top-K Sampling.')
-    parser.add_argument(
-        '--stop_words',
-        type=str,
-        default=['<|action_end|>', '<|im_end|>'],
-        action='append',
-        help='Stop words')
-    parser.add_argument(
-        '--max_new_tokens',
-        type=int,
-        default=512,
-        help='Number of maximum generated tokens.')
+    parser.add_argument('--temperature',
+                        type=float,
+                        default=0.1,
+                        help='Temperature in next token prediction')
+    parser.add_argument('--top_p',
+                        type=float,
+                        default=0.8,
+                        help='Parameter for Top-P Sampling.')
+    parser.add_argument('--top_k',
+                        type=int,
+                        default=40,
+                        help='Parameter for Top-K Sampling.')
+    parser.add_argument('--stop_words',
+                        type=str,
+                        default=['<|action_end|>', '<|im_end|>'],
+                        action='append',
+                        help='Stop words')
+    parser.add_argument('--max_new_tokens',
+                        type=int,
+                        default=512,
+                        help='Number of maximum generated tokens.')
     parser.add_argument(
         '--do_infer',
         default=True,
@@ -138,21 +121,14 @@ def parse_args():
     #     action='store_false',
     #     help='Disable the inference.'
     # )
-    parser.add_argument(
-        '--do_eval',
-        default=False,
-        action='store_true',
-        help='Whether to evaluate the inference results.')
-    parser.add_argument(
-        '--overwrite',
-        default=False,
-        action='store_true',
-        help='Whether to overwrite the existing result file')
-    # parser.add_argument(
-    #     '--debug',
-    #     default=False,
-    #     action='store_true',
-    #     help='Only infer the first 50 samples')
+    parser.add_argument('--do_eval',
+                        default=False,
+                        action='store_true',
+                        help='Whether to evaluate the inference results.')
+    parser.add_argument('--overwrite',
+                        default=False,
+                        action='store_true',
+                        help='Whether to overwrite the existing result file')
     return parser.parse_args()
 
 
@@ -339,28 +315,41 @@ def last_boxed_only_string(string):
     return retval
 
 
-def extract_answer(pred_str):
-    if 'boxed' not in pred_str:
-        return ''
-    answer = pred_str.split('boxed')[-1]
-    if len(answer) == 0:
-        return ''
-    elif (answer[0] == '{'):
-        stack = 1
-        a = ''
-        for c in answer[1:]:
-            if (c == '{'):
-                stack += 1
-                a += c
-            elif (c == '}'):
-                stack -= 1
-                if (stack == 0): break
-                a += c
-            else:
-                a += c
-    else:
-        a = answer.split('$')[0].strip()
+def extract_answer(pred_str: str, execute: bool = False) -> str:
+    if re.search('\boxed|boxed', pred_str):
+        answer = re.split('\boxed|boxed', pred_str)[-1]
+        if len(answer) == 0:
+            return ''
+        elif (answer[0] == '{'):
+            stack = 1
+            a = ''
+            for c in answer[1:]:
+                if (c == '{'):
+                    stack += 1
+                    a += c
+                elif (c == '}'):
+                    stack -= 1
+                    if (stack == 0): break
+                    a += c
+                else:
+                    a += c
+        else:
+            a = answer.split('$')[0].strip()
+    elif re.search('[Tt]he (final )?answer is:?', pred_str):
+        a = re.split('[Tt]he (final )?answer is:?',
+                     pred_str)[-1].strip().rstrip('.')
+    elif pred_str.startswith('```python') and execute:
+        # fall back to program
+        from lagent import get_tool
 
+        a = get_tool('IPythonInteractive').exec(pred_str).value or ''
+    else:  # use the last number
+        pred = re.findall(r'-?\d*\.?\d+', pred_str.replace(',', ''))
+        if len(pred) >= 1:
+            a = pred[-1]
+        else:
+            a = ''
+    # multiple lines
     pred = a.split('\n')[0]
     if pred != '' and pred[0] == ':':
         pred = pred[1:]
@@ -501,8 +490,9 @@ def symbolic_equal_process(a, b, output_queue):
 def call_with_timeout(func, *args, timeout=1, **kwargs):
     output_queue = multiprocessing.Queue()
     process_args = args + (output_queue, )
-    process = multiprocessing.Process(
-        target=func, args=process_args, kwargs=kwargs)
+    process = multiprocessing.Process(target=func,
+                                      args=process_args,
+                                      kwargs=kwargs)
     process.start()
     process.join(timeout)
 
@@ -525,65 +515,45 @@ def init_agent(backend: str, max_turn: int, model_path: str, tp: int,
             pipeline_cfg=dict(backend_config=TurbomindEngineConfig(tp=tp)),
             **kwargs)
     elif backend == 'hf':
-        model = HFTransformer(
-            path=model_path, meta_template=INTERNLM2_META, **kwargs)
+        model = HFTransformer(path=model_path,
+                              meta_template=INTERNLM2_META,
+                              **kwargs)
     else:
         raise NotImplementedError
 
     agent = Internlm2Agent(
         llm=model,
-        protocol=Internlm2Protocol(
-            meta_prompt=None, interpreter_prompt=DEFAULT_PROMPT),
+        protocol=Internlm2Protocol(meta_prompt=None,
+                                   interpreter_prompt=DEFAULT_PROMPT),
         interpreter_executor=ActionExecutor(actions=[
-            IPythonInteractiveManager(
-                max_workers=200,
-                ci_lock=os.path.join(
-                    os.path.dirname(__file__), '.ipython.lock'))
+            IPythonInteractiveManager(max_workers=200,
+                                      ci_lock=os.path.join(
+                                          os.path.dirname(__file__),
+                                          '.ipython.lock'))
         ]),
         max_turn=max_turn)
     return agent
 
 
 def predict(args):
-    if args.dataset == 'gsm8k':
 
-        def process(d, k):
-            d['answer'] = re.sub(r'#### (.+)', r'The answer is \1',
-                                 re.sub(r'<<.*?>>', '',
-                                        d['answer'])).replace('$', '')
-            d['idx'] = k
-            d['query'] = d['question'].replace('$', '')
-            d['gt'] = re.search('The answer is (.+)', d['answer'])[1]
-            d['pred'], d['steps'], d['error'] = [], [], None
-            return d
-
-        dataset = load_dataset(
-            'gsm8k', 'main', split='test').map(process, True)
-
-    elif args.dataset == 'math':
-
-        def process(d, k):
-            d['idx'] = k
-            d['query'] = d['problem']
-            gt = extract_answer(d['solution'])
-            if '\\boxed{90\\text{ square\nunits}}' in d['solution']:
-                gt = '90'
-            elif '$6$ is our answer' in d['solution']:
-                gt = '6'
-            elif gt.startswith('x\\in'):
-                gt = gt[len('x\\in'):]
-            gt = strip_string(gt)
-            d['gt'] = gt
-            d['pred'], d['steps'] = [], []
-            d['error'] = None
-            return d
-
-        dataset = load_dataset(
-            'lighteval/MATH', split='test').map(process, True)
-
-    else:
-        raise NotImplementedError
+    def process(d, k):
+        d['idx'] = k
+        d['query'] = d['problem']
+        gt = extract_answer(d['solution'])
+        if '\\boxed{90\\text{ square\nunits}}' in d['solution']:
+            gt = '90'
+        elif '$6$ is our answer' in d['solution']:
+            gt = '6'
+        elif gt.startswith('x\\in'):
+            gt = gt[len('x\\in'):]
+        gt = strip_string(gt)
+        d['gt'] = gt
+        d['pred'], d['steps'] = [], []
+        d['error'] = None
+        return d
 
+    dataset = load_dataset('lighteval/MATH', split='test').map(process, True)
     agent = init_agent(
         backend=args.backend,
         max_turn=args.max_turn,
@@ -601,19 +571,14 @@ def predict(args):
             batch = dataset.select(
                 range(i * args.batch_size,
                       min((i + 1) * args.batch_size, len(dataset))))
-            # for item in tqdm(
-            #         dataset if not args.debug else dataset.select(range(50))):
             try:
                 rets = agent.batch_chat(batch['query'])
                 for item, ret in zip(batch, rets):
                     item['steps'] = ret.inner_steps
-
-                    lang = [
-                        step for step in item['steps']
-                        if step['role'] == 'language'
-                    ]
-                    item['pred'].append('😭' if not lang else extract_answer(
-                        lang[-1]['content']) or '😭')
+                    last = item['steps'][-1]
+                    item['pred'].append(
+                        extract_answer(last['content']) if last['role'] ==
+                        'language' else '😭')
                     f.write(item)
             except Exception as e:
                 err = str(traceback.format_exc())
@@ -651,6 +616,7 @@ def evaluate(args):
                     timeout_cnt += 1
                 except Exception as error:
                     print(error.__traceback__)
+                    scores.append(False)
                     # sys.exit()
                 progress_bar.update(1)