Spaces:
Sleeping
Sleeping
| title: RuSimulBench Arena | |
| emoji: 📊 | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.21.0 | |
| app_file: app.py | |
| pinned: false | |
| # Model Response Evaluator | |
| This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity). | |
| ## Features | |
| - Evaluate individual model responses for creativity, diversity, relevance, and stability | |
| - Run batch evaluations on multiple models from a CSV file | |
| - Web interface for easy use | |
| - Command-line interface for scripting and automation | |
| - Combined scoring that balances creativity and stability | |
| ## Installation | |
| 1. Clone this repository | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/) | |
| ## Usage | |
| ### Web Interface | |
| ```bash | |
| python app.py --web | |
| ``` | |
| This will start a Gradio web interface where you can: | |
| - Evaluate single responses | |
| - Upload CSV files for batch evaluation | |
| - View evaluation results | |
| ### Command Line | |
| For batch evaluation of models from a CSV file: | |
| ```bash | |
| python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv | |
| ``` | |
| Optional arguments: | |
| - `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3") | |
| - `--prompt_col`: Column name containing prompts (default: "rus_prompt") | |
| ## CSV Format | |
| Your CSV file should have these columns: | |
| - A prompt column (default: "rus_prompt") | |
| - One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers") | |
| ## Evaluation Metrics | |
| ### Creativity Metrics | |
| - **Креативность (Creativity)**: Uniqueness and originality of the response | |
| - **Разнообразие (Diversity)**: Use of varied linguistic features | |
| - **Релевантность (Relevance)**: How well the response addresses the prompt | |
| ### Stability Metrics | |
| - **Stability Score**: Semantic similarity between prompts and responses | |
| ### Combined Score | |
| - Average of creativity and stability scores | |
| ## Output | |
| The evaluation produces: | |
| - CSV files with detailed per-response evaluations for each model | |
| - A benchmark_results.csv file with aggregated metrics for all models | |
| ## Environment Variables | |
| You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument. |