Evaluation Experiments
This section provides instructions for setting up the evaluation environments.
๐ Expected Directory Structure
To ensure the scripts can locate the environments, please organize your files as follows:
SkillNet/
โโโ experiments/
โ โโโ alfworld/ # git clone here
โ โโโ ScienceWorld/ # git clone here
โ โโโ WebShop/ # git clone here
โ โโโ src/
โ โโโ requirements.txt
โ โโโ alfworld_run.py
โ โโโ scienceworld_run.py
โ โโโ webshop_run.py
๐ Quick Start
We suggest configuring separate conda environments for these three datasets to avoid dependency conflicts.
ALFWorld
- Clone & Setup:
cd experiments
git clone https://github.com/alfworld/alfworld.git
cd alfworld
# Follow the official installation steps from the repo (https://github.com/alfworld/alfworld)
- Environment Variable:
Set
ALFWORLD_DATAto the dataset root or editsrc/alfworld/base_config.yamlto point to your local paths:export ALFWORLD_DATA=/path/to/alfworld_data
ScienceWorld
- Clone & Setup:
cd experiments
git clone https://github.com/allenai/ScienceWorld.git
cd ScienceWorld
# Refer to the ScienceWorld repository for environment setup (https://github.com/allenai/ScienceWorld)
WebShop
- Clone & Setup:
cd experiments
git clone https://github.com/princeton-nlp/WebShop.git
cd WebShop
# Refer to the WebShop repository for environment setup (https://github.com/princeton-nlp/WebShop)
For each environment, install common dependencies:
cd experiments
pip install -r requirements.txt
Running
Step 1: Initialize Environment Variables
Before running the scripts, configure your API credentials:
export API_KEY=YOUR_API_KEY
export BASE_URL=YOUR_API_BASE_URL
Step 2: Execution
Run the corresponding evaluation script from the experiments/ directory.
cd experiments
# ALFWorld
python alfworld_run.py --model o4-mini --split dev --max_workers 10 --exp_name alf_test --use_skill
# ScienceWorld
python scienceworld_run.py --model o4-mini --split test --max_workers 5 --exp_name sci_test --use_skill
# WebShop
python webshop_run.py --model o4-mini --max_workers 3 --exp_name web_test --use_skill
๐ ๏ธ Argument Descriptions
--model: The name of the LLM to evaluate.--split: Data split to use (devortest).--max_workers: Number of parallel workers for evaluation.exp_name: results save name.--use_skill: Enable the skill-augmented module.
Xet Storage Details
- Size:
- 2.79 kB
- Xet hash:
- 93671f337a6e6d9faf4be48f9e5dd46112c0ae6e9107a3afb329e9df0652f7d7
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.