# Evaluation In this directory, we introduce how you can evaluate your model with GPT-4. ## Evaluation Pipeline The whole evaluation process undergoes the following three steps: 1. Prepare the questions following the internal data structure in the data format section (described below). 2. Generate answers from different models: * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py). * Generate answers using your own models: [`generate_answers.py`](generate_answers.py). 3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py). ### Generate Answers #### Generate Answers Using GPT-3.5 You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py). An example script is provided as follows: ```shell python generate_gpt35_answers.py \ --dataset "path to the question dataset" \ --answer_path "path to answer folder" \ --num_workers 4 \ --openai_key "your openai key" \ --max_tokens 512 \ ``` #### Generate Answers Using our Own Model You can also generate answers using your own models. The generation process is divided into two stages: 1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py). 2. Merge multiple shards and output a single file: [`merge.py`](./merge.py). An example script is given as follows: ```shell device_number=number of your devices model_name="name of your model" model_path="path to your model" dataset="path to the question dataset" answer_path="path to save the model answers" torchrun --standalone --nproc_per_node=$device_number generate_answers.py \ --model 'llama' \ --strategy ddp \ --model_path $model_path \ --model_name $model_name \ --dataset $dataset \ --batch_size 8 \ --max_datasets_size 80 \ --answer_path $answer_path \ --max_length 512 python merge.py \ --model_name $model_name \ --shards $device_number \ --answer_path $answer_path \ for (( i=0; i