YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
BioDSBench Adapter
Source-native evaluation harness for running Claude Code agents on imaging-101, BioDSBench, and BioMniBench tasks. Supports advanced features like true-serial mode with prior-subtask context propagation.
Features
- Source-native agent evaluation: Run tasks with the local Claude Code source, preserving tool state and conversation context across judge feedback rounds.
- True-serial mode: Pass prior subtask results (code, status, judge feedback) to subsequent tasks in a multi-subtask workflow, enabling models to learn from earlier attempts.
- Multiple benchmark adapters:
imaging-101tasks (e.g.,conventional_ptychography,ct_dual_energy,mri_grappa)- BioDSBench Python data-science tasks (118 biomedical analysis scenarios)
- BioMniBench Docker-style
da-*tasks
- Pipelined batch runner: Execute task sets with fixed concurrency, non-blocking queue management.
- TypeScript + Bun runtime: Fast, modern TypeScript tooling.
Quick Start
Prerequisites
- Bun 1.0+: Install Bun
- Node.js 18+ (for some dependencies)
- Python 3.10+ (for BioDSBench/BioMniBench task execution)
- LLM API access: Anthropic API key or compatible endpoint
Installation
git clone https://github.com/starpacker/biodsbench-adapter.git
cd biodsbench-adapter
bun install
Configuration
Set up API credentials:
export ANTHROPIC_API_KEY="your-api-key-here" export ANTHROPIC_BASE_URL="https://api.anthropic.com" # or your proxy export ANTHROPIC_MODEL="[REDACTED]"Optional: Copy
config/llm-config.sh.exampletoconfig/llm-config.shand customize.
Run a Single Task
bun src/harness/evaluation/cli.ts \
--task mri_grappa \
--runs-dir output/runs \
--max-rounds 5 \
--timeout-seconds 2400
Run BioDSBench Tasks
Point --tasks-dir to the BioDSBench-imaging101-format dataset:
bun src/harness/evaluation/cli.ts \
--task 25303977_0 \
--tasks-dir /path/to/BioDSBench-imaging101-format/tasks \
--runs-dir output/biodsbench_runs \
--max-rounds 2
True-Serial Mode (Advanced)
When multiple subtasks share a common context, use true-serial mode to pass prior results to subsequent tasks.
Python Orchestrator Example
See examples/run_imaging101_true_serial.py:
export LLM_API_KEY="your-api-key"
python3 examples/run_imaging101_true_serial.py \
--study-id 25303977 \
--start 0 \
--end 7 \
--max-rounds 2
What it does:
- Each subtask receives a
--prior-contextJSON file with descriptions, code, and judge feedback from earlier subtasks. - The LLM can learn from earlier mistakes and reuse successful patterns.
Documentation:
examples/ARCHITECTURE.md: Serial vs. single-task designexamples/EFFECTIVENESS_REPORT.md: Case study on PMID 25303977
Project Structure
biodsbench-adapter/
βββ src/harness/evaluation/ # Core evaluation CLI
β βββ cli.ts # Main entry point
β βββ sourceTaskLoop.ts # Task orchestration
β βββ sourceContextBuilder.ts # Prompt + prior-context injection
β βββ types.ts # TypeScript interfaces
βββ config/
β βββ llm-config.sh.example # API config template
β βββ task-batch-runner.json # Batch runner config
βββ scripts/
β βββ run-task-batches.ps1 # PowerShell batch orchestrator
βββ examples/
β βββ run_imaging101_true_serial.py # True-serial orchestrator
β βββ ARCHITECTURE.md # Design docs
β βββ EFFECTIVENESS_REPORT.md # Effectiveness study
βββ tests/ # Unit tests
Data Requirements
- BioDSBench tasks: Clone BioDSBench-imaging101-format
git clone https://github.com/starpacker/BioDSBench-imaging101-format.git
CLI Options
| Flag | Description | Default |
|---|---|---|
--task <id> |
Task ID | (required) |
--tasks-dir <path> |
Task definitions root | ./tasks |
--runs-dir <path> |
Output directory | ./output/runs |
--max-rounds <n> |
Judge feedback rounds | 3 |
--timeout-seconds <n> |
Per-round timeout | 1800 |
--prior-context <path> |
Prior-subtask context JSON (true-serial) | (none) |
Development
bun test # Run tests
bun run build # Build TypeScript
Citation
If you use this framework, please cite:
- BioDSBench: Hou et al., "BioDSBench: A Benchmark for Data Science Code Generation in Biology"
Related Repositories:
- BioDSBench-imaging101-format: Dataset with 118 tasks
License
MIT License (see LICENSE file)
Contributing
Contributions welcome! Fork, branch, and submit a PR.
Support
Open an issue for questions or bug reports.