InternLM/agent/pal_inference.md

# Inference on GSM8K with PAL in InternLM-Chat

English | [简体中文](pal_inference_zh-CN.md)

Utilize [PAL](https://github.com/reasoning-machines/pal) paradigm inference on the [GSM8K](https://huggingface.co/datasets/gsm8k) dataset, enabling the model to write code and execute it through the Python interpreter to solve mathematical problems. The usage is as follows:

```bash
python pal_inference.py \
    <model> \
    <out_dir> \
    [--dataset <dataset>] \
    [--max_length <length>] \
    [--top_p <threshold>] \
    [--eoh <end token>] \
    [--eoa <end token>] \
    [--eos <end token>] \
    [--temperature <temp>] \
    [--time_out <time>] \
    [--verbose, -v] \
    [--append, -a]
```

Parameter explanation:

|         Parameter         |                               Description                                |
| :-----------------------: | :----------------------------------------------------------------------: |
|         \<model>          |                   Path to the model used for inference                   |
|        \<out_dir>         |       Generated code will be saved in the specified output folder        |
|    --dataset <dataset>    |     Name of the dataset used for code generation (defaults to gsm8k)     |
|   --max_length <length>   |       Maximum input token length for the model (defaults to 2048)        |
|    --top_p <threshold>    | Probability threshold for the sum of candidate tokens (defaults to 0.8)  |
|     --eoh <end token>     |                User input end identifier (defaults to "")                |
|     --eoa <end token>     |               Model input end identifier (defaults to "")                |
|     --eos <end token>     |               System input end identifier (defaults to "")               |
| --temperature， -t <temp> |         Sampling temperature during generation (defaults to 1.0)         |
|     --time_out <time>     | Maximum time (in seconds) for executing generated code (defaults to 100) |
|       --verbose, -v       |                   Print code error messages (optional)                   |
|       --append, -a        |              Append output to historical results (optional)              |

A simple usage example is as follows:

```bash
python tools/pal_inference.py internlm/internlm-chat-7b ./output -v
```

Each line in the output file includes the input question, correct answer, executed answer, score, and the Python code block generated by the model:

````json
{
    "question": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
    "target": 18.0,
    "answer": 18.0,
    "score": 1,
    "generation": ["```python\ndef solution():\n    eggs_per_day = 16\n    eggs_per_breakfast = 3\n    eggs_per_muffin = 4\n    eggs_used = eggs_per_day - eggs_per_breakfast - eggs_per_muffin\n    eggs_sold = eggs_used\n    price_per_egg = 2\n    eggs_made = eggs_sold * price_per_egg\n    result = eggs_made\n    return result\n```"]
}
````

Performance of InternLM on GSM8K dataset with and without tools is shown in the table below.

| Method   | **InternLM-Chat-7B** |
| -------- | -------------------- |
| w/o tool | 34.5                 |
| w tool   | 39.2                 |