## Model Performance Evaluation (`evaluate.py`)

This script is used to evaluate the model's performance on a test set. It can operate in two modes:

-   **`pair`**: Calculates pairwise accuracy. 
-   **`ranking`**: Calculates ranking accuracy. 

**Pair-wise Sample**

We set path1's image is better than path2's image for simplicity.

```json
[
    {
        "prompt": ".....",
        "path1": ".....",
        "path2": "....."
    },
    {
        "prompt": ".....",
        "path1": ".....",
        "path2": "....."
    },
  ...
]
```

**Rank-wise Sample**

```json
[
    {
        "id": "005658-0040",
        "prompt": ".....",
        "generations": [
            "path to image1",
            "path to image2",
            "path to image3",
            "path to image4"
        ],
        "ranking": [
            1,
            2,
            5,
            3
        ]
    },
  ...
]
```

### Usage

```bash
python evaluate/evaluate.py \
  --test_json /path/to/your/test_data.json \
  --config_path config/HPSv3_7B.yaml \
  --checkpoint_path checkpoints/HPSv3_7B/model.pth \
  --mode pair \
  --batch_size 8 \
  --num_processes 8
```

**Arguments:**

-   `--test_json`: (Required) Path to the JSON file containing evaluation data.
-   `--config_path`: (Required) Path to the model's configuration file.
-   `--checkpoint_path`: (Required) Path to the model checkpoint.
-   `--mode`: The evaluation mode. Can be `pair` or `ranking`. (Default: `pair`)
-   `--batch_size`: Batch size for inference. (Default: 8)
-   `--num_processes`: Number of parallel processes to use. (Default: 8)

---

## Reward Benchmarking (`benchmark.py`)

This script is used to run inference with a reward model over one or more folders of images. It calculates a reward score for each image based on its corresponding text prompt (expected in a `.txt` file with the same name). The script then outputs statistics (mean, std, min, max) for each folder and saves the detailed results to a JSON file.

It supports multiple reward models through the `--model_type` argument.

### Usage

The script is run using `argparse`. Below is a command-line example:

```bash
python evaluate/benchmark.py \
  --config_path config/HPSv3_7B.yaml \
  --checkpoint_path checkpoints/HPSv3_7B/model.pth \
  --model_type hpsv3 \
  --image_folders /path/to/images/folder1 /path/to/images/folder2 \
  --output_path ./benchmark_results.json \
  --batch_size 16 \
  --num_processes 8
```

**Arguments:**

-   `--config_path`: (Required) Path to the model's configuration file.
-   `--checkpoint_path`: (Required) Path to the model checkpoint.
-   `--model_type`: The reward model to use. Choices: `hpsv3`, `hpsv2`, `imagereward`. (Default: `hpsv3`)
-   `--image_folders`: (Required) One or more paths to folders containing the images to benchmark.
-   `--output_path`: (Required) Path to save the output JSON file with results.
-   `--batch_size`: Batch size for processing. (Default: 16)
-   `--num_processes`: Number of parallel processes to use. (Default: 8)
-   `--num_machines`: For distributed inference, the total number of machines. (Default: 1)
-   `--machine_id`: For distributed inference, the ID of the current machine. (Default: 0)