prompt-prix / examples /README.md
3v324v23's picture
Sync README from main, fix server input race condition
81f2fff

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Examples

This directory contains sample input files for prompt-prix.

File Format

See the Battery File Formats section for complete format documentation, including:

  • Required and optional fields
  • JSON vs JSONL vs BFCL formats
  • Validation rules

Minimal Example

{
  "prompts": [
    {"id": "test-1", "user": "What is 2 + 2?"}
  ]
}

Only id and user are required. All other fields have sensible defaults.

tool_competence_tests.json

An illustrative set of 15 test cases for evaluating LLM tool-calling competence. This is not a specification—it's a sample showing the kind of prompts you might fan-out across models.

Categories covered:

  • Basic tool invocation
  • Tool selection (choosing the right tool)
  • Constraint compliance (respecting forbidden tools)
  • Schema compliance (enums, nested objects, required params)
  • Tool judgment (knowing when NOT to use tools)
  • Semantic understanding (ambiguous routing)
  • Error handling (missing info)
  • Advanced (parallel calls, chained dependencies)

Usage

Load this file in prompt-prix's Battery tab to compare how different models handle tool-calling scenarios.

Recommended Upstream Benchmarks

For rigorous evaluation, consider these established benchmarks:

Benchmark Focus Install
BFCL Function calling pip install bfcl-eval
Inspect AI Safety evaluation pip install inspect-ai

See ADR-001 for rationale on using existing benchmarks rather than custom formats.