Buckets:

fj198602
/

001

fj198602/001 / experiments

154 MB

345 files

Updated 2 months ago

Ctrl+K

Name	Size	Uploaded	Xet hash
src		2 months ago	275 items
.gitignore	23 Bytes xet	2 months ago	34baf7df
README.md	2.79 kB xet	2 months ago	93671f33
alfworld_run.py	14.1 kB xet	2 months ago	4dabf367
requirements.txt	34 Bytes xet	2 months ago	807465d8
scienceworld_run.py	12.4 kB xet	2 months ago	78a51fff
webshop_run.py	11.3 kB xet	2 months ago	792de89e

README.md

Evaluation Experiments

This section provides instructions for setting up the evaluation environments.

📂 Expected Directory Structure

To ensure the scripts can locate the environments, please organize your files as follows:

SkillNet/
├── experiments/
│   ├── alfworld/          # git clone here
│   ├── ScienceWorld/      # git clone here
│   ├── WebShop/           # git clone here
│   ├── src/
│   ├── requirements.txt
│   ├── alfworld_run.py
│   ├── scienceworld_run.py
│   └── webshop_run.py

🚀 Quick Start

We suggest configuring separate conda environments for these three datasets to avoid dependency conflicts.

ALFWorld

Clone & Setup:

cd experiments
git clone https://github.com/alfworld/alfworld.git
cd alfworld
# Follow the official installation steps from the repo (https://github.com/alfworld/alfworld)

Environment Variable:

Set ALFWORLD_DATA to the dataset root or edit src/alfworld/base_config.yaml to point to your local paths:
```
export ALFWORLD_DATA=/path/to/alfworld_data
```

ScienceWorld

Clone & Setup:

cd experiments
git clone https://github.com/allenai/ScienceWorld.git
cd ScienceWorld
# Refer to the ScienceWorld repository for environment setup (https://github.com/allenai/ScienceWorld)

WebShop

Clone & Setup:

cd experiments
git clone https://github.com/princeton-nlp/WebShop.git
cd WebShop
# Refer to the WebShop repository for environment setup (https://github.com/princeton-nlp/WebShop)

For each environment, install common dependencies:

cd experiments
pip install -r requirements.txt

Running

Step 1: Initialize Environment Variables

Before running the scripts, configure your API credentials:

export API_KEY=YOUR_API_KEY
export BASE_URL=YOUR_API_BASE_URL

Step 2: Execution

Run the corresponding evaluation script from the experiments/ directory.

cd experiments

# ALFWorld
python alfworld_run.py --model o4-mini --split dev --max_workers 10 --exp_name alf_test --use_skill

# ScienceWorld
python scienceworld_run.py --model o4-mini --split test --max_workers 5 --exp_name sci_test --use_skill

# WebShop
python webshop_run.py --model o4-mini --max_workers 3 --exp_name web_test --use_skill

🛠️ Argument Descriptions

--model: The name of the LLM to evaluate.
--split: Data split to use (dev or test).
--max_workers: Number of parallel workers for evaluation.
exp_name: results save name.
--use_skill: Enable the skill-augmented module.

Total size: 154 MB

Files: 345

Last updated: Mar 16

Pre-warmed CDN: US EU US EU