Spaces:

MrSimple01
/

RuSimulBench_arena

Sleeping

App Files Files Community

MrSimple01 commited on Mar 17, 2025

Commit

3c775cf

verified ·

1 Parent(s): bc48928

Update README.md

Browse files

Files changed (1) hide show

README.md +75 -1

README.md CHANGED Viewed

@@ -8,5 +8,79 @@ sdk_version: 5.21.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 app_file: app.py
 pinned: false
 ---
+# Model Response Evaluator
+This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).
+## Features
+- Evaluate individual model responses for creativity, diversity, relevance, and stability
+- Run batch evaluations on multiple models from a CSV file
+- Web interface for easy use
+- Command-line interface for scripting and automation
+- Combined scoring that balances creativity and stability
+## Installation
+1. Clone this repository
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)
+## Usage
+### Web Interface
+```bash
+python app.py --web
+```
+This will start a Gradio web interface where you can:
+- Evaluate single responses
+- Upload CSV files for batch evaluation
+- View evaluation results
+### Command Line
+For batch evaluation of models from a CSV file:
+```bash
+python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
+```
+Optional arguments:
+- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
+- `--prompt_col`: Column name containing prompts (default: "rus_prompt")
+## CSV Format
+Your CSV file should have these columns:
+- A prompt column (default: "rus_prompt")
+- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")
+## Evaluation Metrics
+### Creativity Metrics
+- **Креативность (Creativity)**: Uniqueness and originality of the response
+- **Разнообразие (Diversity)**: Use of varied linguistic features
+- **Релевантность (Relevance)**: How well the response addresses the prompt
+### Stability Metrics
+- **Stability Score**: Semantic similarity between prompts and responses
+### Combined Score
+- Average of creativity and stability scores
+## Output
+The evaluation produces:
+- CSV files with detailed per-response evaluations for each model
+- A benchmark_results.csv file with aggregated metrics for all models
+## Environment Variables
+You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.