Spaces:

sdsdgwe
/

HPSv3

Runtime error

App Files Files Community

HPSv3 / evaluate /README.md

sdsdgwe

update

9b57ce7 4 months ago

preview code

raw

history blame contribute delete

3.21 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Model Performance Evaluation (`evaluate.py`)

This script is used to evaluate the model's performance on a test set. It can operate in two modes:

pair: Calculates pairwise accuracy.
ranking: Calculates ranking accuracy.

Pair-wise Sample

We set path1's image is better than path2's image for simplicity.

[
    {
        "prompt": ".....",
        "path1": ".....",
        "path2": "....."
    },
    {
        "prompt": ".....",
        "path1": ".....",
        "path2": "....."
    },
  ...
]

Rank-wise Sample

[
    {
        "id": "005658-0040",
        "prompt": ".....",
        "generations": [
            "path to image1",
            "path to image2",
            "path to image3",
            "path to image4"
        ],
        "ranking": [
            1,
            2,
            5,
            3
        ]
    },
  ...
]

Usage

python evaluate/evaluate.py \
  --test_json /path/to/your/test_data.json \
  --config_path config/HPSv3_7B.yaml \
  --checkpoint_path checkpoints/HPSv3_7B/model.pth \
  --mode pair \
  --batch_size 8 \
  --num_processes 8

Arguments:

--test_json: (Required) Path to the JSON file containing evaluation data.
--config_path: (Required) Path to the model's configuration file.
--checkpoint_path: (Required) Path to the model checkpoint.
--mode: The evaluation mode. Can be pair or ranking. (Default: pair)
--batch_size: Batch size for inference. (Default: 8)
--num_processes: Number of parallel processes to use. (Default: 8)

Reward Benchmarking (`benchmark.py`)

This script is used to run inference with a reward model over one or more folders of images. It calculates a reward score for each image based on its corresponding text prompt (expected in a .txt file with the same name). The script then outputs statistics (mean, std, min, max) for each folder and saves the detailed results to a JSON file.

It supports multiple reward models through the --model_type argument.

Usage

The script is run using argparse. Below is a command-line example:

python evaluate/benchmark.py \
  --config_path config/HPSv3_7B.yaml \
  --checkpoint_path checkpoints/HPSv3_7B/model.pth \
  --model_type hpsv3 \
  --image_folders /path/to/images/folder1 /path/to/images/folder2 \
  --output_path ./benchmark_results.json \
  --batch_size 16 \
  --num_processes 8

Arguments:

--config_path: (Required) Path to the model's configuration file.
--checkpoint_path: (Required) Path to the model checkpoint.
--model_type: The reward model to use. Choices: hpsv3, hpsv2, imagereward. (Default: hpsv3)
--image_folders: (Required) One or more paths to folders containing the images to benchmark.
--output_path: (Required) Path to save the output JSON file with results.
--batch_size: Batch size for processing. (Default: 16)
--num_processes: Number of parallel processes to use. (Default: 8)
--num_machines: For distributed inference, the total number of machines. (Default: 1)
--machine_id: For distributed inference, the ID of the current machine. (Default: 0)

Model Performance Evaluation (evaluate.py)

Usage

Reward Benchmarking (benchmark.py)

Usage

Model Performance Evaluation (`evaluate.py`)

Reward Benchmarking (`benchmark.py`)