|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- instruction-following |
|
|
- llm-evaluation |
|
|
- benchmark |
|
|
- reproducibility |
|
|
- openrouter |
|
|
language: |
|
|
- en |
|
|
pretty_name: LLM Instruction-Following Evaluation Code |
|
|
--- |
|
|
|
|
|
# LLM Instruction-Following Evaluation Framework - Code Repository |
|
|
|
|
|
[](http://arxiv.org/abs/2510.18892) |
|
|
[](https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval) |
|
|
[](https://www.python.org/) |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
|
|
|
This repository contains the complete evaluation framework used in our paper **"When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs"** (arXiv:2510.18892). |
|
|
|
|
|
## π What's Included |
|
|
|
|
|
This code repository provides everything needed to: |
|
|
- β
Reproduce our evaluation of 256 models across 20 diagnostic tests |
|
|
- β
Run the evaluation on new models |
|
|
- β
Add your own custom instruction-following tests |
|
|
- β
Generate publication-quality visualizations |
|
|
- β
Export results to multiple formats (Excel, JSON, LaTeX) |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
# Clone the repository or download files |
|
|
pip install pandas openpyxl requests matplotlib seaborn numpy |
|
|
|
|
|
# Set your OpenRouter API key |
|
|
export OPENROUTER_API_KEY="your_api_key_here" |
|
|
``` |
|
|
|
|
|
### Run Evaluation |
|
|
|
|
|
```bash |
|
|
# Run comprehensive evaluation (256 models Γ 20 tests) |
|
|
python test_comprehensive_20_verified.py |
|
|
|
|
|
# Generate analysis and visualizations |
|
|
python analyze_comprehensive_final.py |
|
|
``` |
|
|
|
|
|
## π Key Files |
|
|
|
|
|
### Core Evaluation |
|
|
- **`test_comprehensive_20_verified.py`** - Main test runner |
|
|
- Evaluates models across all 20 diagnostic tests |
|
|
- Exact-match evaluation with normalized whitespace |
|
|
- Exports results to Excel with multiple sheets |
|
|
- ~6-8 hours for full 256-model evaluation |
|
|
|
|
|
- **`questions.json`** - Complete test bank (20 diagnostic prompts) |
|
|
- Each test includes: prompt, expected output, category, difficulty |
|
|
- Covers 5 categories: String Manipulation, Constraint Compliance, Text Processing, Structured Data, Complex Operations |
|
|
- Frozen version used for paper evaluation |
|
|
|
|
|
- **`models_verified_working_v2_20251014_091649.py`** - Model configuration |
|
|
- 256 verified working models from OpenRouter |
|
|
- Pre-verified for basic functionality |
|
|
- Includes provider information |
|
|
|
|
|
### Analysis & Visualization |
|
|
- **`analyze_comprehensive_final.py`** - Comprehensive analysis pipeline |
|
|
- Generates 4 publication-quality PDF figures |
|
|
- Creates LaTeX tables for paper integration |
|
|
- Computes statistical summaries |
|
|
- Category and provider performance breakdowns |
|
|
|
|
|
### Supporting Files |
|
|
- **`requirements.txt`** - Python dependencies |
|
|
- **`README.md`** - This file (setup and usage instructions) |
|
|
|
|
|
## π§ͺ Test Categories |
|
|
|
|
|
Our 20 diagnostic tests cover five categories: |
|
|
|
|
|
### 1. String Manipulation (Tests 1, 3, 5, 17, 20) - HARDEST |
|
|
- Multi-step text transformations |
|
|
- Average pass rate: 12.0% |
|
|
- Example: Test 5 (Complex String Transformation) - only 2.7% pass rate |
|
|
|
|
|
### 2. Constraint Compliance (Tests 2, 9, 15) - EASIEST |
|
|
- Following exact output specifications |
|
|
- Average pass rate: 66.9% |
|
|
- Example: Test 2 (Exact Output Compliance) - 96.1% pass rate |
|
|
|
|
|
### 3. Text Processing (Test 13) |
|
|
- Targeted text manipulation tasks |
|
|
- Average pass rate: 50.5% |
|
|
|
|
|
### 4. Structured Data (Tests 4, 6, 10, 12, 14) |
|
|
- JSON, Markdown, CSV generation |
|
|
- Average pass rate: 41.1% |
|
|
|
|
|
### 5. Complex Operations (Tests 7, 8, 11, 16, 18, 19) |
|
|
- Multi-step reasoning and computation |
|
|
- Average pass rate: 35.0% |
|
|
|
|
|
## π Evaluation Methodology |
|
|
|
|
|
### Exact Match Evaluation |
|
|
- **Binary Pass/Fail**: No partial credit |
|
|
- **Whitespace Normalized**: Leading/trailing spaces ignored |
|
|
- **Case Sensitive**: Preserves intentional capitalization |
|
|
- **Format Strict**: JSON, tables, special characters must be exact |
|
|
|
|
|
### Why Exact Match? |
|
|
1. **Objectivity** - Eliminates subjective judgment |
|
|
2. **Reproducibility** - Deterministic, verifiable results |
|
|
3. **Clarity** - Binary success/failure (no ambiguity) |
|
|
4. **Efficiency** - No manual review needed |
|
|
5. **Diagnostic Power** - Reveals specific failure modes |
|
|
|
|
|
## π Results Summary |
|
|
|
|
|
From our October 14, 2025 evaluation of 256 models: |
|
|
|
|
|
- **Overall Pass Rate**: 43.7% |
|
|
- **Best Model**: qwen/qwen-plus-2025-07-28:thinking (100%) |
|
|
- **Most Difficult Test**: Test 5 - Complex String Transformation (2.7%) |
|
|
- **Top Provider**: x-ai (79.3% average across 15 models) |
|
|
|
|
|
## π§ Customization |
|
|
|
|
|
### Adding New Tests |
|
|
|
|
|
Edit `questions.json` to add new diagnostic tests: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"id": 21, |
|
|
"test_name": "Your New Test", |
|
|
"category": "Custom Category", |
|
|
"difficulty": "medium", |
|
|
"prompt": "Your instruction prompt here", |
|
|
"expected_output": "Exact expected response", |
|
|
"exact_match": true, |
|
|
"case_sensitive": false |
|
|
} |
|
|
``` |
|
|
|
|
|
### Testing Custom Models |
|
|
|
|
|
Modify `models_verified_working_v2_20251014_091649.py` or create your own model list: |
|
|
|
|
|
```python |
|
|
MODELS = [ |
|
|
{ |
|
|
"name": "provider/model-name", |
|
|
"provider": "provider", |
|
|
"verified": True |
|
|
}, |
|
|
# Add more models... |
|
|
] |
|
|
``` |
|
|
|
|
|
### Adjusting Analysis |
|
|
|
|
|
Customize `analyze_comprehensive_final.py` to: |
|
|
- Change visualization styles |
|
|
- Add new analysis metrics |
|
|
- Modify export formats |
|
|
- Create custom reports |
|
|
|
|
|
## π¦ Output Files |
|
|
|
|
|
The evaluation produces: |
|
|
|
|
|
1. **Excel Workbook** (`comprehensive_20_tests_results_YYYYMMDD_HHMMSS.xlsx`) |
|
|
- Overview sheet with summary statistics |
|
|
- Model rankings (sorted by performance) |
|
|
- Test difficulty analysis |
|
|
- Category performance breakdown |
|
|
- Complete raw results (all 5,120 evaluations) |
|
|
- Test descriptions |
|
|
|
|
|
2. **JSON Export** (`comprehensive_20_tests_results_YYYYMMDD_HHMMSS.json`) |
|
|
- Machine-readable format |
|
|
- Includes metadata and timestamps |
|
|
- All test results with responses |
|
|
|
|
|
3. **PDF Visualizations** |
|
|
- `fig1_heatmap.pdf` - Performance matrix |
|
|
- `fig2_provider.pdf` - Provider comparison |
|
|
- `fig3_difficulty.pdf` - Test difficulty |
|
|
- `fig4_category.pdf` - Category performance |
|
|
|
|
|
4. **LaTeX Tables** (`paper_tables.tex`) |
|
|
- Ready for paper integration |
|
|
- Formatted with booktabs package |
|
|
|
|
|
## π Reproducibility |
|
|
|
|
|
To exactly reproduce our paper results: |
|
|
|
|
|
```bash |
|
|
# Use the frozen model list from October 14, 2025 |
|
|
python test_comprehensive_20_verified.py |
|
|
|
|
|
# Use the frozen test bank |
|
|
# (questions.json is already frozen at 20 tests) |
|
|
|
|
|
# Generate analysis with same parameters |
|
|
python analyze_comprehensive_final.py |
|
|
``` |
|
|
|
|
|
**Note**: Model outputs may vary over time as providers update their models. For exact reproducibility, use the snapshot from our evaluation date. |
|
|
|
|
|
## π‘ Usage Examples |
|
|
|
|
|
### Quick Test (5 models) |
|
|
|
|
|
```python |
|
|
# Edit test_comprehensive_20_verified.py |
|
|
# Change MODELS to a subset: |
|
|
MODELS = [ |
|
|
"openai/gpt-4o", |
|
|
"anthropic/claude-3.7-sonnet", |
|
|
"google/gemini-2.0-flash-exp:free", |
|
|
"meta-llama/llama-3.3-70b-instruct", |
|
|
"qwen/qwen-plus-2025-07-28:thinking" |
|
|
] |
|
|
``` |
|
|
|
|
|
### Single Model Test |
|
|
|
|
|
```python |
|
|
import requests |
|
|
import json |
|
|
|
|
|
# Load questions |
|
|
with open('questions.json', 'r') as f: |
|
|
questions = json.load(f) |
|
|
|
|
|
# Test a single model |
|
|
model = "openai/gpt-4o" |
|
|
for q in questions: |
|
|
response = requests.post( |
|
|
"https://openrouter.ai/api/v1/chat/completions", |
|
|
headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}, |
|
|
json={ |
|
|
"model": model, |
|
|
"messages": [{"role": "user", "content": q["prompt"]}] |
|
|
} |
|
|
) |
|
|
# Evaluate response... |
|
|
``` |
|
|
|
|
|
### Custom Analysis |
|
|
|
|
|
```python |
|
|
import pandas as pd |
|
|
|
|
|
# Load results |
|
|
df = pd.read_excel('results.xlsx', sheet_name='All Results') |
|
|
|
|
|
# Custom analysis |
|
|
top_models = df.groupby('model')['passed'].mean().sort_values(ascending=False).head(10) |
|
|
print(top_models) |
|
|
|
|
|
# Category performance |
|
|
category_perf = df.groupby('category')['passed'].mean() |
|
|
print(category_perf) |
|
|
``` |
|
|
|
|
|
## π Troubleshooting |
|
|
|
|
|
### Common Issues |
|
|
|
|
|
**1. API Rate Limiting** |
|
|
```bash |
|
|
# OpenRouter may rate limit. Add delays between requests: |
|
|
time.sleep(1) # Add to test_comprehensive_20_verified.py |
|
|
``` |
|
|
|
|
|
**2. JSON Serialization Errors** |
|
|
```bash |
|
|
# Use export_json_from_excel.py to convert numpy types |
|
|
python export_json_from_excel.py |
|
|
``` |
|
|
|
|
|
**3. Missing Packages** |
|
|
```bash |
|
|
pip install pandas openpyxl requests matplotlib seaborn numpy |
|
|
``` |
|
|
|
|
|
**4. API Key Not Set** |
|
|
```bash |
|
|
export OPENROUTER_API_KEY="your_key_here" |
|
|
# Or set in Python: os.environ['OPENROUTER_API_KEY'] = "your_key" |
|
|
``` |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this code in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{young2025instruction, |
|
|
title={When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs}, |
|
|
author={Young, Richard J. and Gillins, Brandon and Matthews, Alice M.}, |
|
|
journal={arXiv preprint arXiv:2510.18892}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Related Resources |
|
|
|
|
|
- **Paper**: http://arxiv.org/abs/2510.18892 |
|
|
- **Dataset**: https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval |
|
|
- **Paper Repository**: https://huggingface.co/richardyoung/llm-instruction-following-paper |
|
|
|
|
|
## π Contact |
|
|
|
|
|
**Research Team:** |
|
|
- Richard J. Young - ryoung@unlv.edu |
|
|
- Brandon Gillins - bgillins@unlv.edu |
|
|
- Alice M. Matthews - amatthews@unlv.edu |
|
|
|
|
|
**Affiliation:** University of Nevada, Las Vegas |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **OpenRouter** for unified API access to 256+ models |
|
|
- **Model Providers** (OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and others) |
|
|
- Open source community for evaluation tools and frameworks |
|
|
|
|
|
## π License |
|
|
|
|
|
This code is released under the **MIT License**. |
|
|
|
|
|
``` |
|
|
MIT License |
|
|
|
|
|
Copyright (c) 2025 Richard J. Young, Brandon Gillins, Alice M. Matthews |
|
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
|
of this software and associated documentation files (the "Software"), to deal |
|
|
in the Software without restriction, including without limitation the rights |
|
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
|
copies of the Software, and to permit persons to whom the Software is |
|
|
furnished to do so, subject to the following conditions: |
|
|
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
|
copies or substantial portions of the Software. |
|
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
|
SOFTWARE. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
**Repository Version:** 1.0 |
|
|
**Last Updated:** October 23, 2025 |
|
|
**Evaluation Date:** October 14, 2025 |
|
|
|