Spaces:

reflectiveattention
/

prompt-prix

Running

App Files Files Community

prompt-prix / examples /README.md

3v324v23

Sync README from main, fix server input race condition

81f2fff 4 months ago

preview code

raw

history blame contribute delete

1.7 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Examples

This directory contains sample input files for prompt-prix.

File Format

See the Battery File Formats section for complete format documentation, including:

Required and optional fields
JSON vs JSONL vs BFCL formats
Validation rules

Minimal Example

{
  "prompts": [
    {"id": "test-1", "user": "What is 2 + 2?"}
  ]
}

Only id and user are required. All other fields have sensible defaults.

tool_competence_tests.json

An illustrative set of 15 test cases for evaluating LLM tool-calling competence. This is not a specification—it's a sample showing the kind of prompts you might fan-out across models.

Categories covered:

Basic tool invocation
Tool selection (choosing the right tool)
Constraint compliance (respecting forbidden tools)
Schema compliance (enums, nested objects, required params)
Tool judgment (knowing when NOT to use tools)
Semantic understanding (ambiguous routing)
Error handling (missing info)
Advanced (parallel calls, chained dependencies)

Usage

Load this file in prompt-prix's Battery tab to compare how different models handle tool-calling scenarios.

Recommended Upstream Benchmarks

For rigorous evaluation, consider these established benchmarks:

Benchmark	Focus	Install
BFCL	Function calling	`pip install bfcl-eval`
Inspect AI	Safety evaluation	`pip install inspect-ai`

See ADR-001 for rationale on using existing benchmarks rather than custom formats.