# Evaluation In this directory we will introduce how you can evaluate your model with GPT-4. ## Evaluation Pipeline The whole evaluation process undergoes two steps. 1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. 2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. ### Generate Answers To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows. ```shell python format_questions.py \ --questions_path "path to FastChat's question.jsonl" \ --save_path "path to the formatted file" \ ``` In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows. ```shell device_number=number of your devices model_name="name of your model" model_path="path to your model" dataset="path to the question dataset" answer_path="path to save the model answers" torchrun --standalone --nproc_per_node=$device_number generate_answers.py \ --model 'llama' \ --strategy ddp \ --model_path $model_path \ --model_name $model_name \ --dataset $dataset \ --batch_size 8 \ --max_datasets_size 80 \ --answer_path $answer_path \ --max_length 512 python merge.py \ --model_name $model_name \ --shards $device_number \ --answer_path $answer_path \ for (( i=0; i