# Evaluation In this directory we will introduce how you can evaluate your model with GPT-4. ## Evaluation Pipeline The whole evaluation process undergoes two steps. 1. Prepare the questions following the internal data structure in the data format section (described below). 2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. 3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. ### Generate Answers In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows. ```shell device_number=number of your devices model_name="name of your model" model_path="path to your model" dataset="path to the question dataset" answer_path="path to save the model answers" torchrun --standalone --nproc_per_node=$device_number generate_answers.py \ --model 'llama' \ --strategy ddp \ --model_path $model_path \ --model_name $model_name \ --dataset $dataset \ --batch_size 8 \ --max_datasets_size 80 \ --answer_path $answer_path \ --max_length 512 python merge.py \ --model_name $model_name \ --shards $device_number \ --answer_path $answer_path \ for (( i=0; i