Spaces:

ReviewGrounder
/

GradioDemo

Running

App Files Files Community

eigentom commited on Feb 12

Commit

90c099b

1 Parent(s): 37d42f7

Initial Update

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README copy.md +313 -0
README.md +4 -6
app.py +470 -0
example.py +54 -0
gradio_app/__init__.py +9 -0
gradio_app/app.py +466 -0
gradio_app/components/__init__.py +73 -0
gradio_app/components/formatters.py +504 -0
gradio_app/components/header.py +39 -0
gradio_app/components/results_panel.py +193 -0
gradio_app/components/settings.py +82 -0
gradio_app/components/styles.py +592 -0
gradio_app/components/upload_section.py +117 -0
gradio_app/utils_single_paper_inference.py +276 -0
requirements.txt +26 -0
scripts/gpt_oss_start_vllm_service.sh +47 -0
scripts/start_load_balancer.sh +87 -0
scripts/start_reranker_service.sh +116 -0
scripts/start_vllm_with_balancer.sh +216 -0
scripts/stop_reranker_services.sh +106 -0
scripts/stop_vllm_services.sh +267 -0
shared/configs/config.yaml +97 -0
shared/configs/llm_service_config.yaml +57 -0
shared/configs/prompts.yaml +580 -0
shared/configs/reranker_endpoint_pool.txt +8 -0
shared/configs/vllm_endpoint_pool.txt +7 -0
shared/utils/__init__.py +113 -0
shared/utils/asta_api_key_pool.py +205 -0
shared/utils/gpt_service.py +210 -0
shared/utils/json_parser.py +428 -0
shared/utils/llm_service.py +64 -0
shared/utils/llm_service_factory.py +191 -0
shared/utils/load_balancer.py +382 -0
shared/utils/mock_llm_service.py +280 -0
shared/utils/prompt_loader.py +220 -0
shared/utils/reranker.py +275 -0
shared/utils/reranker_api_service.py +221 -0
shared/utils/reranker_endpoint_pool.py +160 -0
shared/utils/reranker_pool.py +78 -0
shared/utils/review_logger.py +306 -0
shared/utils/vllm_endpoint_pool.py +257 -0
shared/utils/vllm_service.py +314 -0
shared/utils/vllm_service_simple.py +314 -0
src/__init__.py +6 -0
src/evaluator/1_get_rubrics.py +601 -0
src/evaluator/2_evaluate.py +1730 -0
src/evaluator/2_evaluate_agenticreview.py +1866 -0
src/evaluator/2_evaluate_aiscientist.py +1866 -0
src/evaluator/2_evaluate_cyclereviewer.py +1837 -0
src/evaluator/configs.yaml +38 -0

README copy.md ADDED Viewed

	@@ -0,0 +1,313 @@

+# ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
+This repository accompanies the paper: *"ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents"*. It contains the implementation of **ReviewGrounder**, a rubric-guided, tool-integrated multi-agent framework for generating substantive, evidence-grounded academic paper reviews.
+ReviewGrounder addresses the key limitation of existing LLM-based reviewers—their tendency to produce superficial, formulaic comments lacking substantive feedback—by explicitly leveraging reviewer rubrics and contextual grounding in existing work.
+## System Architecture
+ReviewGrounder implements a multi-agent framework with clear role separation:
+### Drafting Agent (`paper_reviewer.py`)
+The **drafter** generates an initial review draft based solely on the paper content. This stage produces a structured review with strengths, weaknesses, suggestions, and questions, but may lack deep contextual grounding.
+### Grounding Agents
+1. **Related Work Searcher** (`related_work_searcher.py`):
+   - Generates search keywords from paper content
+   - Retrieves relevant papers via academic APIs
+   - Summarizes and analyzes related work
+   - Provides context for novelty assessment
+2. **Paper Results Analyzer** (`paper_results_analyzer.py`):
+   - Extracts and analyzes experimental sections
+   - Summarizes experimental setup, results, and findings
+   - Identifies limitations and gaps
+3. **Paper Insight Miner** (`paper_insight_miner.py`):
+   - Extracts key insights and contributions
+   - Identifies technical strengths and weaknesses
+4. **Review Refiner** (`review_refiner.py`):
+   - Synthesizes information from all grounding agents
+   - Refines the initial draft with evidence-based critiques
+   - Ensures suggestions are actionable and well-justified
+   - Maintains consistency across review sections
+### Evaluation System (`src/evaluator/`)
+The **ReviewBench** evaluation framework:
+- **Rubric Generation**: Creates paper-specific rubrics from venue guidelines, paper content, and human reviews
+- **LLM-based Evaluation**: Deep qualitative assessment aligned with rubrics
+- **Rule-based Metrics**: Quantitative metrics (MSE, MAE, Spearman correlation)
+## Installation
+### Prerequisites
+- Python >= 3.8
+- CUDA-capable GPU (for local vLLM deployment, optional if using OpenAI API)
+- Sufficient GPU memory for your chosen model (if using vLLM)
+### Setup
+1. Clone the repository:
+```bash
+git clone <repository-url>
+cd ReviewGrounder
+```
+2. Install dependencies:
+```bash
+uv venv
+source .venv/bin/activate
+uv pip install -r requirements.txt
+```
+3. Configure your API keys and settings:
+   - Copy `shared/configs/config.yaml` and customize as needed
+   - Set environment variables:
+     - `ASTA_API_KEY`: For paper search via Asta API (recommended)
+     - `OPENAI_API_KEY`: If using OpenAI API instead of vLLM
+     - `S2_API_KEY`: Alternative paper search API (optional)
+4. (Optional) If using local vLLM, start your vLLM service:
+```bash
+# Start vLLM service on a single port
+bash scripts/gpt_oss_start_vllm_service.sh
+# Or start multiple services with load balancing
+bash scripts/start_vllm_with_balancer.sh
+```
+## Quick Start
+### Basic Usage
+Generate a review using the command-line interface:
+```bash
+python -m src.reviewer_agent.cli --paper paper.json --output review.json
+```
+Where `paper.json` contains your paper data in JSON format with fields like `title`, `abstract`, `text`, etc.
+### Using the Python API
+For programmatic access:
+```python
+from src.reviewer_agent import review_paper_with_refiner
+# Load your paper data
+paper_data = {
+    "title": "Your Paper Title",
+    "abstract": "Paper abstract...",
+    "text": "Full paper text...",
+    # ... other fields
+}
+# Generate review (drafting + grounding stages)
+review = review_paper_with_refiner(paper_data=paper_data)
+print(review)
+```
+The `review_paper_with_refiner` function implements the full ReviewGrounder pipeline:
+1. **Drafting**: Generates initial review draft
+2. **Grounding**: Retrieves related work, analyzes results, extracts insights
+3. **Refinement**: Synthesizes all information into a refined, evidence-grounded review
+## Usage Examples
+### Generate a Review with Related Work Context
+```bash
+python -m src.reviewer_agent.cli \
+    --paper paper.json \
+    --max-related-papers 15 \
+    --review-format detailed \
+    --output review.json
+```
+### Filter Related Work by Date and Venue
+```bash
+python -m src.reviewer_agent.cli \
+    --paper paper.json \
+    --publication-date-range "2020:" \
+    --venues "ICLR,NeurIPS,ICML" \
+    --output review.json
+```
+### Use Custom vLLM Endpoint
+```bash
+python -m src.reviewer_agent.cli \
+    --paper paper.json \
+    --vllm-url "http://your-server:8000/v1" \
+    --output review.json
+```
+### Evaluate Reviews on ReviewBench
+```python
+# 1. Generate reviews
+from src.reviewer_agent import review_paper_with_refiner
+review = review_paper_with_refiner(paper_data={...})
+# 2. Evaluate reviews using ReviewBench
+from src.evaluator import evaluate_reviews
+results = evaluate_reviews(parquet_path="reviews.parquet")
+```
+## Directory Structure
+```
+anonymize_codebase/
+├── src/
+│   ├── reviewer_agent/          # ReviewGrounder implementation
+│   │   ├── __init__.py
+│   │   ├── paper_reviewer.py    # Drafting agent
+│   │   ├── review_refiner.py    # Grounding agent: review refinement
+│   │   ├── related_work_searcher.py  # Grounding agent: literature search
+│   │   ├── paper_results_summarizer.py  # Grounding agent: results analysis
+│   │   ├── paper_insight_miner.py  # Grounding agent: insight extraction
+│   │   ├── main_pipeline.py     # Full pipeline orchestration
+│   │   ├── cli.py               # Command-line interface
+│   │   └── paper_search/        # Paper search APIs
+│   │       ├── asta_api.py
+│   │       ├── semantic_scholar_api.py
+│   │       └── paper_retriever.py
+│   │
+│   └── evaluator/               # ReviewBench evaluation framework
+│       ├── 1_get_rubrics.py     # Rubric generation
+│       ├── 2_evaluate.py        # Review evaluation
+│       └── ...
+│
+├── shared/
+│   ├── utils/                   # Shared utilities
+│   │   ├── llm_service.py       # LLM service abstraction
+│   │   ├── load_balancer.py     # Load balancing for vLLM
+│   │   ├── reranker.py          # Paper reranking
+│   │   └── ...
+│   │
+│   └── configs/                 # Configuration files
+│       ├── config.yaml          # Main config
+│       ├── llm_service_config.yaml  # LLM service settings
+│       └── prompts.yaml         # Review generation prompts
+│
+├── scripts/                      # Utility scripts
+│   ├── start_vllm_with_balancer.sh
+│   ├── start_load_balancer.sh
+│   └── ...
+│
+├── requirements.txt             # Python dependencies
+└── README.md                    # This file
+```
+## Configuration Guide
+### LLM Service Configuration
+ReviewGrounder supports two LLM backends:
+1. **vLLM** (recommended for local deployment): Fast inference with local GPU
+   - Default: GPT-OSS-120B for grounding stage
+   - Can use smaller models (e.g., Phi-4-14B) for drafting stage
+2. **OpenAI API**: Cloud-based, no local GPU required
+Configure in `shared/configs/llm_service_config.yaml`:
+```yaml
+vllm:
+  base_url: "http://localhost:8000/"
+  model_name: "openai/gpt-oss-120b"
+  max_tokens: 16384
+gpt:
+  enabled: false
+  api_key: "your-api-key-here"
+  model_name: "gpt-4o"
+```
+We offer the option of assigning different backends for each agent.
+```yaml
+llm_assignments:
+  keyword_generator: "vllm"  # For related work search
+  paper_summarizer: "vllm"   # For results summarization
+  reviewer: "vllm"           # For drafting stage
+  refiner: "vllm"            # For grounding/refinement stage
+```
+### Paper Search Configuration
+Configure paper search APIs in `shared/configs/config.yaml`:
+```yaml
+paper_search:
+  asta:
+    api_key: null  # Set via ASTA_API_KEY env var
+    endpoint: "https://asta-tools.allen.ai/mcp/v1"
+  semantic_scholar:
+    api_key: null  # Set via S2_API_KEY env var
+```
+### Review Format Options
+Choose from different review formats:
+- `detailed`: Comprehensive review with all sections (default)
+- `summary`: Concise review summary
+- `structured`: Structured format with specific sections
+- `strict_detailed`: Strict adherence to detailed format requirements
+## Load Balancing for vLLM
+For production use with multiple GPUs, you can set up load balancing:
+```bash
+# Start 4 vLLM services on ports 8000-8003
+bash scripts/gpt_oss_start_vllm_service.sh
+# Start load balancer on port 8004
+python -m shared.utils.load_balancer \
+    --backends http://localhost:8000/v1 http://localhost:8001/v1 http://localhost:8002/v1 http://localhost:8003/v1 \
+    --port 8004 \
+    --strategy round_robin
+```
+Then point your config to `http://localhost:8004/v1`.
+## Evaluation: ReviewBench
+ReviewGrounder is evaluated on **ReviewBench**, a benchmark that:
+- Leverages paper-specific rubrics derived from:
+  - Official venue guidelines (e.g., ACL, ICML, NeurIPS, ICLR)
+  - Paper content
+  - Human-written reviews
+- Evaluates reviews across diverse dimensions:
+  - Evidence-based critique
+  - Constructive tone
+  - Technical depth
+  - And more...
+- Measures both:
+  - Alignment with human judgments (scores, decisions)
+  - Rubric-based quality (beyond just outcome prediction)
+See `src/evaluator/` for the evaluation framework implementation.
+## Citation
+If you use ReviewGrounder in your research, please cite:
+```bibtex
+@inproceedings{reviewgrounder2026,
+  title={ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents},
+  author={Anonymous},
+  booktitle={Proceedings of ACL 2026},
+  year={2026}
+}
+```

README.md CHANGED Viewed

@@ -1,14 +1,12 @@
 ---
-title: InteractiveDemo
-emoji: 📉
-colorFrom: pink
-colorTo: yellow
 sdk: gradio
 sdk_version: 6.5.1
 app_file: app.py
 pinned: false
-license: apache-2.0
-short_description: This is the interactive demo of Review Grounder
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Test Reviewgrounder
+emoji: 💻
+colorFrom: yellow
+colorTo: blue
 sdk: gradio
 sdk_version: 6.5.1
 app_file: app.py
 pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,470 @@

+"""
+Review Grounder - Gradio App
+Main entry point for the Hugging Face Space.
+This module orchestrates the UI components and handles the review pipeline.
+The app allows users to:
+1. Upload a research paper in PDF format
+2. Configure LLM settings (optional, uses OpenAI defaults)
+3. Generate a comprehensive AI-powered review
+4. View intermediate results from each pipeline stage
+Components are organized in the `components/` directory for maintainability.
+"""
+from __future__ import annotations
+import os
+import re
+import tempfile
+from datetime import datetime
+from pathlib import Path
+from typing import Tuple, Iterator
+import gradio as gr
+# Import utility for running the review pipeline
+from gradio_app.utils_single_paper_inference import (
+    run_single_paper_review_from_pdf_stepwise,
+)
+# Import UI components
+from gradio_app.components import (
+    get_custom_css,
+    create_header,
+    create_upload_section,
+    create_advanced_settings,
+    create_results_panel,
+    format_initial_review_html,
+    format_related_work_html,
+    format_results_html,
+    format_insights_html,
+    format_final_review,
+    format_raw_json,
+)
+# ============================================================================
+# App Configuration
+# ============================================================================
+APP_TITLE = "Review Grounder"
+APP_DESCRIPTION = "AI-Powered Research Paper Review"
+def _raw_json_md_to_file(raw_json_md: str) -> str:
+    """
+    Extract JSON from Raw JSON markdown (```json ... ```) and write to a temp file.
+    Returns the file path for gr.DownloadButton.
+    """
+    if not raw_json_md or not raw_json_md.strip():
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False, encoding="utf-8") as f:
+            f.write("{}")
+        return f.name
+    text = raw_json_md.strip()
+    match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text)
+    if match:
+        text = match.group(1).strip()
+    fd, path = tempfile.mkstemp(suffix=".json", prefix="review_")
+    with os.fdopen(fd, "w", encoding="utf-8") as f:
+        f.write(text)
+    return path
+# ============================================================================
+# Environment Check
+# ============================================================================
+def _check_env() -> Tuple[bool, str]:
+    """
+    Check for required environment variables.
+    Returns:
+        Tuple of (success, message)
+    """
+    missing = []
+    if not os.environ.get("ASTA_API_KEY"):
+        missing.append("ASTA_API_KEY")
+    if missing:
+        return False, (
+            "Missing environment variables: "
+            + ", ".join(missing)
+            + ".\nPlease configure them in your Hugging Face Space settings."
+        )
+    return True, "Environment variables detected correctly."
+# ============================================================================
+# Review Pipeline Handler
+# ============================================================================
+def review_pdf_file(
+    file_obj,
+    api_base_url: str,
+    api_key: str,
+    model_name: str,
+    show_log: bool,
+    show_raw_json: bool,
+) -> Iterator[Tuple[str, str, str, str, str, str, str, gr.update, gr.update, gr.update]]:
+    """
+    Main callback: process PDF through the review pipeline with real-time updates.
+    Args:
+        file_obj: Uploaded PDF file
+        api_base_url: LLM API endpoint URL
+        api_key: API key for LLM provider
+        model_name: Model identifier
+        show_log: Whether to display the execution log
+        show_raw_json: Whether to display raw JSON output
+    Yields:
+        Tuple of all output component updates (no overview)
+    """
+    log_lines: list[str] = []
+    def _log(msg: str) -> None:
+        log_lines.append(f"[{datetime.now().strftime('%H:%M:%S')}] {msg}")
+    def _log_text() -> str:
+        return "\n".join(log_lines) if log_lines else ""
+    # Validate file upload
+    if file_obj is None:
+        gr.Warning("Please upload a PDF file to start the review.")
+        _log("⚠️ Please upload a PDF file to start the review.")
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+        return
+    # Check environment
+    ok, msg = _check_env()
+    if not ok:
+        gr.Error(msg)
+        _log(f"❌ {msg}")
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+        return
+    # Start pipeline
+    _log("🚀 Pipeline started.")
+    yield (
+        _log_text(), "", "", "", "", "", "",
+        gr.update(interactive=False),
+        gr.update(visible=show_log),
+        gr.update(visible=show_raw_json),
+    )
+    try:
+        # Normalize file path
+        if isinstance(file_obj, dict) and "name" in file_obj:
+            src_path = Path(file_obj["name"])
+        else:
+            src_path = Path(getattr(file_obj, "name", "") or str(file_obj))
+        if not src_path or not src_path.exists():
+            with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
+                tmp_path = Path(tmp.name)
+                if hasattr(file_obj, "read"):
+                    tmp.write(file_obj.read())
+                src_path = tmp_path
+        # Initialize output variables
+        status = f"📄 Extracting text from PDF: {src_path.name}..."
+        _log(status)
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=False),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+        initial = ""
+        related_html = ""
+        results_html = ""
+        insights_html = ""
+        final_md = ""
+        raw_json = ""
+        # Run the stepwise pipeline
+        for ev in run_single_paper_review_from_pdf_stepwise(
+            str(src_path),
+            api_base_url=api_base_url or None,
+            api_key=api_key or None,
+            model_name=model_name or None,
+            enable_logging=True,
+            verbose=True,
+        ):
+            stage = ev.get("stage")
+            # Handle step-level errors
+            if stage == "results_analysis_error":
+                err = ev.get("error", "Unknown error")
+                gr.Warning(f"Results analysis failed: {err}")
+                _log(f"⚠️ Results analysis failed: {err}")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+                continue
+            if stage == "insights_error":
+                err = ev.get("error", "Unknown error")
+                gr.Warning(f"Insight mining failed: {err}")
+                _log(f"⚠️ Insight mining failed: {err}")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+                continue
+            if stage == "related_work_error":
+                err = ev.get("error", "Unknown error")
+                gr.Warning(f"Related work search failed: {err}")
+                _log(f"⚠️ Related work search failed: {err}")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+                continue
+            # Process each pipeline stage
+            if stage == "extract_pdf":
+                status = f"📄 Extracting text from PDF: {src_path.name}..."
+                _log(status)
+            elif stage == "parsed_pdf_text":
+                _log("✅ Step 0: Extracting PDF text — done")
+                _log("⏳ Step 1: Initial review draft — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "initial_review":
+                tmp = {"initial_review": ev.get("initial_review", {})}
+                tmp["title"] = ev.get("title") or tmp["initial_review"].get("title")
+                tmp["abstract"] = ev.get("abstract") or tmp["initial_review"].get("abstract")
+                initial = format_initial_review_html(tmp)
+                _log("✅ Step 1: Initial review draft — done")
+                _log("⏳ Step 2: Results analysis — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "results_analysis":
+                tmp = {"results_analyzer_json": ev.get("results_analyzer_json")}
+                results_html = format_results_html(tmp)
+                _log("✅ Step 2: Results analysis — done")
+                _log("⏳ Step 3: Insight mining — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "insights":
+                tmp = {"insight_miner_json": ev.get("insight_miner_json")}
+                insights_html = format_insights_html(tmp)
+                _log("✅ Step 3: Insight mining — done")
+                _log("⏳ Step 4: Related work — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "related_work":
+                tmp = {
+                    "related_work_json_list": ev.get("related_work_json_list"),
+                    "search_keywords": ev.get("search_keywords"),
+                }
+                related_html = format_related_work_html(tmp)
+                _log("✅ Step 4: Related work — done")
+                _log("⏳ Step 5: Final refinement — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "final":
+                review = ev.get("review", {}) or {}
+                initial = format_initial_review_html(review)
+                related_html = format_related_work_html(review) if not related_html else related_html
+                results_html = format_results_html(review) if not results_html else results_html
+                insights_html = format_insights_html(review) if not insights_html else insights_html
+                final_md = format_final_review(review)
+                raw_json = format_raw_json(review)
+                _log("✅ Step 5: Final refinement — done")
+                _log(f"🎉 Review complete for: {src_path.name}")
+            else:
+                _log(f"⏳ Working... ({stage})")
+            yield (
+                _log_text(), initial, related_html, results_html,
+                insights_html, final_md, raw_json,
+                gr.update(interactive=False),
+                gr.update(visible=show_log),
+                gr.update(visible=show_raw_json),
+            )
+        # Re-enable button at end
+        yield (
+            _log_text(), initial, related_html, results_html,
+            insights_html, final_md, raw_json,
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+    except Exception as e:
+        import traceback
+        error_msg = f"❌ Error during review: {str(e)}"
+        error_details = traceback.format_exc()
+        gr.Error(f"{error_msg}\n\nDetails: {error_details[:500]}")
+        _log(error_msg)
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+# ============================================================================
+# Build the Gradio App
+# ============================================================================
+with gr.Blocks(
+    title=APP_TITLE,
+    css=get_custom_css(),
+    theme=gr.themes.Soft(),
+) as demo:
+    # Header section
+    create_header()
+    # Main content: two-column layout
+    with gr.Row():
+        # Left column: Upload and settings
+        with gr.Column(scale=2, elem_classes=["panel-card"]):
+            pdf_input, run_button = create_upload_section()
+            # Advanced settings (collapsed by default)
+            (
+                api_base_url_in,
+                api_key_in,
+                model_name_in,
+                show_log_toggle,
+                show_raw_json_toggle,
+            ) = create_advanced_settings()
+        # Right column: Results (built from component)
+        with gr.Column(scale=3, elem_classes=["panel-card", "results-panel"]):
+            (
+                initial_html,
+                results_html,
+                insights_html,
+                related_html,
+                final_md,
+                status_output,
+                raw_json_md,
+                log_accordion,
+                raw_json_tab,
+                download_json_btn,
+            ) = create_results_panel(show_log=False, show_raw_json=False)
+    # Toggle visibility of log accordion
+    show_log_toggle.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=[show_log_toggle],
+        outputs=[log_accordion],
+    )
+    # Toggle visibility of raw JSON tab
+    show_raw_json_toggle.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=[show_raw_json_toggle],
+        outputs=[raw_json_tab],
+    )
+    # Download raw JSON as file
+    download_json_btn.click(
+        fn=_raw_json_md_to_file,
+        inputs=[raw_json_md],
+        outputs=[download_json_btn],
+    )
+    # Main review button click handler
+    run_button.click(
+        fn=review_pdf_file,
+        inputs=[
+            pdf_input,
+            api_base_url_in,
+            api_key_in,
+            model_name_in,
+            show_log_toggle,
+            show_raw_json_toggle,
+        ],
+        outputs=[
+            status_output,
+            initial_html,
+            related_html,
+            results_html,
+            insights_html,
+            final_md,
+            raw_json_md,
+            run_button,
+            log_accordion,
+            raw_json_tab,
+        ],
+    )
+    # Footer
+    gr.HTML("""
+    <div class="app-footer">
+        <p>🔬 Review Grounder · AI-Powered Research Paper Review</p>
+        <p>© 2026 ReviewGrounder. All rights reserved.</p>
+    </div>
+    """)
+# ============================================================================
+# Entry Point
+# ============================================================================
+if __name__ == "__main__":
+    demo.launch()

example.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""
+Example script for running a single-paper review.
+This example is intentionally thin and delegates to the reusable
+`review_single_paper_from_text` helper, which is what a Hugging Face
+Space backend would call after performing PDF-to-text conversion.
+"""
+import json
+import sys
+from pathlib import Path
+import logging
+# Add project root to path for imports
+project_root = Path(__file__).parent.parent
+if str(project_root) not in sys.path:
+    sys.path.insert(0, str(project_root))
+from src.reviewer_agent.single_paper_inference import review_single_paper_from_text
+logging.basicConfig(level=logging.INFO)
+logging.getLogger("httpx").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+def main() -> None:
+    """Run a minimal single-paper review using the first example paper."""
+    json_path = project_root / "examples" / "example_papers.json"
+    with open(json_path, "r") as f:
+        data = json.load(f)
+    # For demonstration we take only the first paper
+    first_paper = data[0]
+    paper_text = first_paper.get("paper_context", "")
+    logger.info("Running single-paper review from example_papers.json...")
+    review = review_single_paper_from_text(
+        paper_text,
+        keywords=first_paper.get("keywords"),
+        # Use config defaults for review_format and verbosity
+        enable_logging=True,
+        verbose=False,
+    )
+    # Save the review content to a JSON file
+    with open("review_content.json", "w") as f:
+        json.dump([review], f, indent=2)
+    logger.info("Review saved to review_content.json")
+if __name__ == "__main__":
+    main()

gradio_app/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""
+Gradio integration package for the anonymized review system.
+This package contains:
+- Lightweight utilities that wrap the single-paper review pipeline for UI use
+- The Gradio app definition used for Hugging Face Spaces deployment
+"""
+__all__ = ["app", "utils_single_paper_inference"]

gradio_app/app.py ADDED Viewed

	@@ -0,0 +1,466 @@

+"""
+Review Grounder - Gradio App
+Main entry point for the Hugging Face Space.
+This module orchestrates the UI components and handles the review pipeline.
+The app allows users to:
+1. Upload a research paper in PDF format
+2. Configure LLM settings (optional, uses OpenAI defaults)
+3. Generate a comprehensive AI-powered review
+4. View intermediate results from each pipeline stage
+Components are organized in the `components/` directory for maintainability.
+"""
+from __future__ import annotations
+import os
+import re
+import tempfile
+from datetime import datetime
+from pathlib import Path
+from typing import Tuple, Iterator
+import gradio as gr
+# Import utility for running the review pipeline
+from utils_single_paper_inference import (
+    run_single_paper_review_from_pdf_stepwise,
+)
+# Import UI components
+from components import (
+    get_custom_css,
+    create_header,
+    create_upload_section,
+    create_advanced_settings,
+    create_results_panel,
+    format_initial_review_html,
+    format_related_work_html,
+    format_results_html,
+    format_insights_html,
+    format_final_review,
+    format_raw_json,
+)
+# ============================================================================
+# App Configuration
+# ============================================================================
+APP_TITLE = "Review Grounder"
+APP_DESCRIPTION = "AI-Powered Research Paper Review"
+def _raw_json_md_to_file(raw_json_md: str) -> str:
+    """
+    Extract JSON from Raw JSON markdown (```json ... ```) and write to a temp file.
+    Returns the file path for gr.DownloadButton.
+    """
+    if not raw_json_md or not raw_json_md.strip():
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False, encoding="utf-8") as f:
+            f.write("{}")
+        return f.name
+    text = raw_json_md.strip()
+    match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text)
+    if match:
+        text = match.group(1).strip()
+    fd, path = tempfile.mkstemp(suffix=".json", prefix="review_")
+    with os.fdopen(fd, "w", encoding="utf-8") as f:
+        f.write(text)
+    return path
+# ============================================================================
+# Environment Check
+# ============================================================================
+def _check_env() -> Tuple[bool, str]:
+    """
+    Check for required environment variables.
+    Returns:
+        Tuple of (success, message)
+    """
+    missing = []
+    if not os.environ.get("ASTA_API_KEY"):
+        missing.append("ASTA_API_KEY")
+    if missing:
+        return False, (
+            "Missing environment variables: "
+            + ", ".join(missing)
+            + ".\nPlease configure them in your Hugging Face Space settings."
+        )
+    return True, "Environment variables detected correctly."
+# ============================================================================
+# Review Pipeline Handler
+# ============================================================================
+def review_pdf_file(
+    file_obj,
+    api_base_url: str,
+    api_key: str,
+    model_name: str,
+    show_log: bool,
+    show_raw_json: bool,
+) -> Iterator[Tuple[str, str, str, str, str, str, str, gr.update, gr.update, gr.update]]:
+    """
+    Main callback: process PDF through the review pipeline with real-time updates.
+    Args:
+        file_obj: Uploaded PDF file
+        api_base_url: LLM API endpoint URL
+        api_key: API key for LLM provider
+        model_name: Model identifier
+        show_log: Whether to display the execution log
+        show_raw_json: Whether to display raw JSON output
+    Yields:
+        Tuple of all output component updates (no overview)
+    """
+    log_lines: list[str] = []
+    def _log(msg: str) -> None:
+        log_lines.append(f"[{datetime.now().strftime('%H:%M:%S')}] {msg}")
+    def _log_text() -> str:
+        return "\n".join(log_lines) if log_lines else ""
+    # Validate file upload
+    if file_obj is None:
+        gr.Warning("Please upload a PDF file to start the review.")
+        _log("⚠️ Please upload a PDF file to start the review.")
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+        return
+    # Check environment
+    ok, msg = _check_env()
+    if not ok:
+        gr.Error(msg)
+        _log(f"❌ {msg}")
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+        return
+    # Start pipeline
+    _log("🚀 Pipeline started.")
+    yield (
+        _log_text(), "", "", "", "", "", "",
+        gr.update(interactive=False),
+        gr.update(visible=show_log),
+        gr.update(visible=show_raw_json),
+    )
+    try:
+        # Normalize file path
+        if isinstance(file_obj, dict) and "name" in file_obj:
+            src_path = Path(file_obj["name"])
+        else:
+            src_path = Path(getattr(file_obj, "name", "") or str(file_obj))
+        if not src_path or not src_path.exists():
+            with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
+                tmp_path = Path(tmp.name)
+                if hasattr(file_obj, "read"):
+                    tmp.write(file_obj.read())
+                src_path = tmp_path
+        # Initialize output variables
+        status = f"📄 Extracting text from PDF: {src_path.name}..."
+        _log(status)
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=False),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+        initial = ""
+        related_html = ""
+        results_html = ""
+        insights_html = ""
+        final_md = ""
+        raw_json = ""
+        # Run the stepwise pipeline
+        for ev in run_single_paper_review_from_pdf_stepwise(
+            str(src_path),
+            api_base_url=api_base_url or None,
+            api_key=api_key or None,
+            model_name=model_name or None,
+            enable_logging=True,
+            verbose=True,
+        ):
+            stage = ev.get("stage")
+            # Handle step-level errors
+            if stage == "results_analysis_error":
+                err = ev.get("error", "Unknown error")
+                gr.Warning(f"Results analysis failed: {err}")
+                _log(f"⚠️ Results analysis failed: {err}")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+                continue
+            if stage == "insights_error":
+                err = ev.get("error", "Unknown error")
+                gr.Warning(f"Insight mining failed: {err}")
+                _log(f"⚠️ Insight mining failed: {err}")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+                continue
+            if stage == "related_work_error":
+                err = ev.get("error", "Unknown error")
+                gr.Warning(f"Related work search failed: {err}")
+                _log(f"⚠️ Related work search failed: {err}")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+                continue
+            # Process each pipeline stage
+            if stage == "extract_pdf":
+                status = f"📄 Extracting text from PDF: {src_path.name}..."
+                _log(status)
+            elif stage == "parsed_pdf_text":
+                _log("✅ Step 0: Extracting PDF text — done")
+                _log("⏳ Step 1: Initial review draft — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "initial_review":
+                tmp = {"initial_review": ev.get("initial_review", {})}
+                tmp["title"] = ev.get("title") or tmp["initial_review"].get("title")
+                tmp["abstract"] = ev.get("abstract") or tmp["initial_review"].get("abstract")
+                initial = format_initial_review_html(tmp)
+                _log("✅ Step 1: Initial review draft — done")
+                _log("⏳ Step 2: Results analysis — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "results_analysis":
+                tmp = {"results_analyzer_json": ev.get("results_analyzer_json")}
+                results_html = format_results_html(tmp)
+                _log("✅ Step 2: Results analysis — done")
+                _log("⏳ Step 3: Insight mining — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "insights":
+                tmp = {"insight_miner_json": ev.get("insight_miner_json")}
+                insights_html = format_insights_html(tmp)
+                _log("✅ Step 3: Insight mining — done")
+                _log("⏳ Step 4: Related work — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "related_work":
+                tmp = {
+                    "related_work_json_list": ev.get("related_work_json_list"),
+                    "search_keywords": ev.get("search_keywords"),
+                }
+                related_html = format_related_work_html(tmp)
+                _log("✅ Step 4: Related work — done")
+                _log("⏳ Step 5: Final refinement — started")
+                yield (
+                    _log_text(), initial, related_html, results_html,
+                    insights_html, final_md, raw_json,
+                    gr.update(interactive=False),
+                    gr.update(visible=show_log),
+                    gr.update(visible=show_raw_json),
+                )
+            elif stage == "final":
+                review = ev.get("review", {}) or {}
+                initial = format_initial_review_html(review)
+                related_html = format_related_work_html(review) if not related_html else related_html
+                results_html = format_results_html(review) if not results_html else results_html
+                insights_html = format_insights_html(review) if not insights_html else insights_html
+                final_md = format_final_review(review)
+                raw_json = format_raw_json(review)
+                _log("✅ Step 5: Final refinement — done")
+                _log(f"🎉 Review complete for: {src_path.name}")
+            else:
+                _log(f"⏳ Working... ({stage})")
+            yield (
+                _log_text(), initial, related_html, results_html,
+                insights_html, final_md, raw_json,
+                gr.update(interactive=False),
+                gr.update(visible=show_log),
+                gr.update(visible=show_raw_json),
+            )
+        # Re-enable button at end
+        yield (
+            _log_text(), initial, related_html, results_html,
+            insights_html, final_md, raw_json,
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+    except Exception as e:
+        import traceback
+        error_msg = f"❌ Error during review: {str(e)}"
+        error_details = traceback.format_exc()
+        gr.Error(f"{error_msg}\n\nDetails: {error_details[:500]}")
+        _log(error_msg)
+        yield (
+            _log_text(), "", "", "", "", "", "",
+            gr.update(interactive=True),
+            gr.update(visible=show_log),
+            gr.update(visible=show_raw_json),
+        )
+# ============================================================================
+# Build the Gradio App
+# ============================================================================
+with gr.Blocks(title=APP_TITLE) as demo:
+    # Header section
+    create_header()
+    # Main content: two-column layout
+    with gr.Row():
+        # Left column: Upload and settings
+        with gr.Column(scale=2, elem_classes=["panel-card"]):
+            pdf_input, run_button = create_upload_section()
+            # Advanced settings (collapsed by default)
+            (
+                api_base_url_in,
+                api_key_in,
+                model_name_in,
+                show_log_toggle,
+                show_raw_json_toggle,
+            ) = create_advanced_settings()
+        # Right column: Results (built from component)
+        with gr.Column(scale=3, elem_classes=["panel-card", "results-panel"]):
+            (
+                initial_html,
+                results_html,
+                insights_html,
+                related_html,
+                final_md,
+                status_output,
+                raw_json_md,
+                log_accordion,
+                raw_json_tab,
+                download_json_btn,
+            ) = create_results_panel(show_log=False, show_raw_json=False)
+    # Toggle visibility of log accordion
+    show_log_toggle.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=[show_log_toggle],
+        outputs=[log_accordion],
+    )
+    # Toggle visibility of raw JSON tab
+    show_raw_json_toggle.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=[show_raw_json_toggle],
+        outputs=[raw_json_tab],
+    )
+    # Download raw JSON as file
+    download_json_btn.click(
+        fn=_raw_json_md_to_file,
+        inputs=[raw_json_md],
+        outputs=[download_json_btn],
+    )
+    # Main review button click handler
+    run_button.click(
+        fn=review_pdf_file,
+        inputs=[
+            pdf_input,
+            api_base_url_in,
+            api_key_in,
+            model_name_in,
+            show_log_toggle,
+            show_raw_json_toggle,
+        ],
+        outputs=[
+            status_output,
+            initial_html,
+            related_html,
+            results_html,
+            insights_html,
+            final_md,
+            raw_json_md,
+            run_button,
+            log_accordion,
+            raw_json_tab,
+        ],
+    )
+    # Footer
+    gr.HTML("""
+    <div class="app-footer">
+        <p>🔬 Review Grounder · AI-Powered Research Paper Review</p>
+        <p>© 2026 ReviewGrounder. All rights reserved.</p>
+    </div>
+    """)
+# ============================================================================
+# Entry Point
+# ============================================================================
+if __name__ == "__main__":
+    demo.launch(css=get_custom_css(), theme=gr.themes.Soft())

gradio_app/components/__init__.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Components package for the Review Grounder Gradio app.
+This package contains modular UI components for building
+the Review Grounder interface.
+Modules:
+- styles: Custom CSS styles
+- formatters: Data formatting utilities
+- header: App header component
+- upload_section: PDF upload and instructions
+- settings: Advanced settings panel
+- results_panel: Review results display
+"""
+from .styles import get_custom_css
+from .formatters import (
+    safe_json_parse,
+    format_overview,
+    format_initial_review,
+    format_initial_review_html,
+    format_related_work_html,
+    format_results_html,
+    format_insights_html,
+    format_final_review,
+    format_raw_json,
+)
+from .header import create_header
+from .upload_section import (
+    create_how_it_works,
+    create_upload_area,
+    create_action_buttons,
+    create_upload_section,
+)
+from .settings import (
+    create_advanced_settings,
+    DEFAULT_API_ENDPOINT,
+    DEFAULT_MODEL_NAME,
+)
+from .results_panel import (
+    create_results_placeholder,
+    create_results_panel,
+)
+__all__ = [
+    # Styles
+    "get_custom_css",
+    # Formatters
+    "safe_json_parse",
+    "format_overview",
+    "format_initial_review",
+    "format_initial_review_html",
+    "format_related_work_html",
+    "format_results_html",
+    "format_insights_html",
+    "format_final_review",
+    "format_raw_json",
+    # Header
+    "create_header",
+    # Upload section
+    "create_how_it_works",
+    "create_upload_area",
+    "create_action_buttons",
+    "create_upload_section",
+    # Settings
+    "create_advanced_settings",
+    "DEFAULT_API_ENDPOINT",
+    "DEFAULT_MODEL_NAME",
+    # Results panel
+    "create_results_placeholder",
+    "create_results_panel",
+]

gradio_app/components/formatters.py ADDED Viewed

	@@ -0,0 +1,504 @@

+"""
+Formatting utilities for the Review Grounder Gradio app.
+This module contains all functions for formatting review data
+into displayable HTML or Markdown for the UI components.
+"""
+from __future__ import annotations
+import json
+from typing import Any
+def safe_json_parse(value: Any) -> Any:
+    """
+    Safely parse a JSON string or return the value if already parsed.
+    Args:
+        value: A JSON string or already-parsed object
+    Returns:
+        Parsed JSON object or None if parsing fails
+    """
+    if value is None:
+        return None
+    try:
+        if isinstance(value, str):
+            return json.loads(value)
+        return value
+    except Exception as e:
+        # print out the exact error
+        print(f"Error parsing JSON: {e}")
+        return None
+def format_overview(review: dict) -> str:
+    """
+    Format high-level overview: scores and keywords only.
+    Args:
+        review: The review dictionary containing scores and metadata
+    Returns:
+        Formatted Markdown string with scores and search keywords
+    """
+    if not review:
+        return "No review data."
+    scores = review.get("scores", {}) or {}
+    rating = scores.get("rating") or review.get("rating")
+    confidence = scores.get("confidence") or review.get("confidence")
+    decision = scores.get("decision") or review.get("decision")
+    parts = [
+        "### Scores",
+        f"- **Rating**: {rating if rating is not None else 'N/A'}",
+        f"- **Confidence**: {confidence if confidence is not None else 'N/A'}",
+        f"- **Decision**: {decision or 'N/A'}",
+    ]
+    keywords = review.get("search_keywords")
+    if keywords:
+        parts.append("")
+        parts.append("### Search Keywords")
+        parts.append("".join(f"- {k}\n" for k in keywords).rstrip())
+    return "\n".join(parts)
+def _escape_html(text: str) -> str:
+    """Escape HTML special characters for safe display."""
+    if not text:
+        return ""
+    return (
+        str(text)
+        .replace("&", "&amp;")
+        .replace("<", "&lt;")
+        .replace(">", "&gt;")
+        .replace('"', "&quot;")
+    )
+def format_initial_review(review: dict) -> str:
+    """
+    Format the initial draft review as plain text (legacy).
+    Prefer format_initial_review_html for UI display.
+    """
+    initial = review.get("initial_review")
+    if not initial:
+        return "Initial draft review not available (pipeline may have failed early)."
+    text = initial.get("review") or ""
+    if not text:
+        return json.dumps(initial, indent=2, ensure_ascii=False)
+    return text
+def format_initial_review_html(review: dict) -> str:
+    """
+    Format the initial draft review as styled HTML cards (never raw JSON string).
+    Renders Summary, scores, Strengths, Weaknesses, Questions in card/section
+    layout. If initial_review comes as a JSON string, parses it first then renders cards.
+    Args:
+        review: The review dictionary containing initial_review
+    Returns:
+        HTML string for display in gr.HTML
+    """
+    initial = review.get("initial_review")
+    if not initial:
+        return "<p class='review-message'>Initial draft review not available (pipeline may have failed early).</p>"
+    # If backend passed a JSON string, parse to dict so we can render cards
+    if isinstance(initial, str):
+        initial = safe_json_parse(initial) or {}
+        if not initial:
+            return "<p class='review-message'>Initial draft data could not be parsed.</p>"
+    # Prefer structured HTML when we have typical draft fields
+    if _looks_like_raw_json(initial):
+        pass  # fall through to structured format below
+    else:
+        print(f"[WARN] Failed to parse initial draft as JSON, treating it as single text.")
+        # Single "review" text only
+        text = initial.get("review") or ""
+        if text:
+            return f'<div class="review-draft-content"><div class="review-text">{_nl2br(text)}</div></div>'
+    # Structured format: summary, scores, strengths, weaknesses, questions, etc.
+    html = '<div class="initial-draft-card card-grid"><div class="card"><h4>📝 Initial Draft</h4>'
+    summary = initial.get("summary") or ""
+    if summary:
+        html += f'<div class="kv"><div class="k">Summary</div><div class="v">{_escape_html(summary)}</div></div>'
+    score_fields = [
+        ("soundness", "Soundness"),
+        ("presentation", "Presentation"),
+        ("contribution", "Contribution"),
+        ("rating", "Rating"),
+        ("confidence", "Confidence"),
+        ("decision", "Decision"),
+    ]
+    for key, label in score_fields:
+        val = initial.get(key)
+        if val is not None and val != "":
+            html += f'<div class="kv"><div class="k">{label}</div><div class="v">{_escape_html(str(val))}</div></div>'
+    strengths = initial.get("strengths") or ""
+    if strengths:
+        html += f'<div class="kv"><div class="k">Strengths</div><div class="v">{_nl2br(strengths)}</div></div>'
+    weaknesses = initial.get("weaknesses") or ""
+    if weaknesses:
+        html += f'<div class="kv"><div class="k">Weaknesses</div><div class="v">{_nl2br(weaknesses)}</div></div>'
+    questions = initial.get("questions")
+    if questions:
+        if isinstance(questions, list):
+            q_text = "\n".join(f"• {q}" for q in questions if q)
+        else:
+            q_text = str(questions)
+        if q_text:
+            html += f'<div class="kv"><div class="k">Questions</div><div class="v">{_nl2br(q_text)}</div></div>'
+    html += "</div></div>"
+    return html
+def _looks_like_raw_json(obj: Any) -> bool:
+    """Heuristic: dict has typical review keys (summary, strengths) then treat as structured."""
+    if not isinstance(obj, dict):
+        return False
+    return any(k in obj for k in ("summary", "strengths", "weaknesses", "soundness", "rating"))
+def _nl2br(text: str) -> str:
+    """Escape HTML and convert newlines to <br> for safe display."""
+    if not text:
+        return ""
+    return _escape_html(text).replace("\n", "<br>\n")
+def format_related_work_html(review: dict) -> str:
+    """
+    Format related work as HTML with styled cards.
+    Args:
+        review: The review dictionary containing related_work_json_list
+    Returns:
+        HTML string with related work cards
+    """
+    rw = review.get("related_work_json_list")
+    if not rw:
+        return "<p>No related work information available.</p>"
+    try:
+        data = json.loads(rw) if isinstance(rw, str) else rw
+    except Exception:
+        return f"<p>Error parsing related work data: {str(rw)[:200]}</p>"
+    if not data:
+        return "<p>No related work summaries found.</p>"
+    html = '<div class="related-work-container"><h3>Related Work Summaries</h3>'
+    for idx, item in enumerate(data, start=1):
+        summary = item.get("summary", "").strip()
+        main_methods = item.get("main_methods", "").strip()
+        key_findings = item.get("key_findings", "").strip()
+        relation = item.get("relation", "").strip()
+        html += f'<div class="related-paper-card">'
+        html += f'<div class="paper-header">{idx}. {summary[:100] or "Related paper"}...</div>'
+        if summary:
+            html += f'''
+            <div class="paper-field">
+                <div class="paper-field-label">Summary</div>
+                <div class="paper-field-value">{summary}</div>
+            </div>
+            '''
+        if main_methods:
+            html += f'''
+            <div class="paper-field">
+                <div class="paper-field-label">Main Methods</div>
+                <div class="paper-field-value">{main_methods}</div>
+            </div>
+            '''
+        if key_findings:
+            html += f'''
+            <div class="paper-field">
+                <div class="paper-field-label">Key Findings</div>
+                <div class="paper-field-value">{key_findings}</div>
+            </div>
+            '''
+        if relation:
+            html += f'''
+            <div class="paper-field">
+                <div class="paper-field-label">Relation</div>
+                <div class="paper-field-value">{relation}</div>
+            </div>
+            '''
+        html += "</div>"
+    html += "</div>"
+    return html
+def _render_review_issues_html(issues: dict) -> str:
+    """
+    Render review issues (incorrect, missing, needs specificity) as HTML.
+    Args:
+        issues: Dictionary containing issue categories
+    Returns:
+        HTML string with formatted issues
+    """
+    if not issues:
+        return ""
+    def render_issue_list(title: str, items: list) -> str:
+        if not items:
+            return ""
+        blocks = []
+        for it in items:
+            if isinstance(it, dict):
+                head = (
+                    it.get("review_claim")
+                    or it.get("what_missing")
+                    or it.get("review_text")
+                    or "Issue"
+                )
+                body_parts = []
+                for k in ["why_wrong", "why_important", "how_to_fix", "evidence"]:
+                    if it.get(k):
+                        body_parts.append(
+                            f"<div class='k'>{k.replace('_', ' ').title()}</div>"
+                            f"<div class='v'>{it.get(k)}</div>"
+                        )
+                body = "".join(body_parts) or (
+                    f"<div class='v mono'>{json.dumps(it, indent=2, ensure_ascii=False)}</div>"
+                )
+                blocks.append(f"<details><summary>{head}</summary>{body}</details>")
+            else:
+                blocks.append(f"<div class='v'>{str(it)}</div>")
+        return (
+            f'<div class="kv"><div class="k">{title}</div><div class="v">'
+            + "".join(blocks)
+            + "</div></div>"
+        )
+    html = ""
+    html += render_issue_list("Incorrect / Hallucinated", issues.get("incorrect_or_hallucinated", []))
+    html += render_issue_list("Missing Key Points", issues.get("missing_key_points", []))
+    html += render_issue_list("Needs Specificity", issues.get("needs_specificity", []))
+    return html
+def _render_rewrite_suggestions_html(suggestions: list) -> str:
+    """
+    Render rewrite suggestions as collapsible HTML blocks.
+    Args:
+        suggestions: List of rewrite suggestion dictionaries
+    Returns:
+        HTML string with formatted suggestions
+    """
+    if not suggestions:
+        return ""
+    blocks = []
+    for s in suggestions:
+        if isinstance(s, dict):
+            head = f"{s.get('apply_to', 'Rewrite')} · {s.get('target', '')}".strip(" ·")
+            suggested = s.get("suggested_text", "")
+            evidence = s.get("evidence", "")
+            body = ""
+            if suggested:
+                body += f"<div class='k'>Suggested Text</div><div class='v'>{suggested}</div>"
+            if evidence:
+                body += f"<div class='k'>Evidence</div><div class='v'>{evidence}</div>"
+            blocks.append(
+                f"<details><summary>{head or 'Rewrite suggestion'}</summary>{body}</details>"
+            )
+        else:
+            blocks.append(f"<div class='v'>{str(s)}</div>")
+    return (
+        '<div class="kv"><div class="k">Rewrite Suggestions</div><div class="v">'
+        + "".join(blocks)
+        + "</div></div>"
+    )
+def format_results_html(review: dict) -> str:
+    """
+    Format results analyzer output as structured HTML cards.
+    Args:
+        review: The review dictionary containing results_analyzer_json
+    Returns:
+        HTML string with formatted results analysis
+    """
+    parsed = safe_json_parse(review.get("results_analyzer_json"))
+    if not parsed:
+        return "<p>Unable to parse the results analysis data.</p>"
+    facts = parsed.get("facts", {}) if isinstance(parsed, dict) else {}
+    datasets = facts.get("datasets", [])
+    metrics = facts.get("metrics", [])
+    baselines = facts.get("baselines", [])
+    key_results = facts.get("key_results", [])
+    review_issues = parsed.get("review_issues", {}) if isinstance(parsed, dict) else {}
+    rewrite_suggestions = parsed.get("rewrite_suggestions", []) if isinstance(parsed, dict) else []
+    html = '<div class="card-grid">'
+    html += '<div class="card"><h4>Results Analysis <span class="pill">structured</span></h4>'
+    if datasets:
+        html += (
+            '<div class="kv"><div class="k">Datasets</div><div class="v">'
+            + "\n".join(f"- {x}" for x in datasets)
+            + "</div></div>"
+        )
+    if metrics:
+        html += (
+            '<div class="kv"><div class="k">Metrics</div><div class="v">'
+            + "\n".join(f"- {x}" for x in metrics)
+            + "</div></div>"
+        )
+    if baselines:
+        html += (
+            '<div class="kv"><div class="k">Baselines</div><div class="v">'
+            + "\n".join(f"- {x}" for x in baselines)
+            + "</div></div>"
+        )
+    if key_results:
+        items = []
+        for kr in key_results:
+            if isinstance(kr, dict):
+                claim = kr.get("claim", "")
+                evidence = kr.get("evidence", "")
+                block = (
+                    f"<details><summary>{claim or 'Key result'}</summary>"
+                    f"<div class='v'>{evidence}</div></details>"
+                )
+                items.append(block)
+        if items:
+            html += (
+                '<div class="kv"><div class="k">Key Results (claim → evidence)</div>'
+                '<div class="v">' + "".join(items) + "</div></div>"
+            )
+    if review_issues:
+        html += _render_review_issues_html(review_issues)
+    if rewrite_suggestions:
+        html += _render_rewrite_suggestions_html(rewrite_suggestions)
+    html += "</div></div>"
+    return html
+def format_insights_html(review: dict) -> str:
+    """
+    Format insight miner output as structured HTML cards.
+    Args:
+        review: The review dictionary containing insight_miner_json
+    Returns:
+        HTML string with formatted insights
+    """
+    parsed = safe_json_parse(review.get("insight_miner_json"))
+    if not parsed:
+        return "<p>Unable to parse the insights data.</p>"
+    facts = parsed.get("facts", {}) if isinstance(parsed, dict) else {}
+    review_issues = parsed.get("review_issues", {}) if isinstance(parsed, dict) else {}
+    rewrite_suggestions = parsed.get("rewrite_suggestions", []) if isinstance(parsed, dict) else []
+    def render_list(title: str, items: list) -> str:
+        if not items:
+            return ""
+        blocks = []
+        for it in items:
+            if isinstance(it, dict):
+                head = (
+                    it.get("claim")
+                    or it.get("point")
+                    or it.get("item")
+                    or it.get("what_missing")
+                    or "Item"
+                )
+                evidence = (
+                    it.get("evidence")
+                    or it.get("why_important")
+                    or it.get("why_wrong")
+                    or it.get("how_to_fix")
+                    or ""
+                )
+                blocks.append(
+                    f"<details><summary>{head}</summary><div class='v'>{evidence}</div></details>"
+                )
+            else:
+                blocks.append(f"<div class='v'>{str(it)}</div>")
+        return (
+            f'<div class="kv"><div class="k">{title}</div><div class="v">'
+            + "".join(blocks)
+            + "</div></div>"
+        )
+    html = '<div class="card-grid">'
+    html += '<div class="card"><h4>Paper Insights <span class="pill">structured</span></h4>'
+    html += render_list("Core Contributions", facts.get("core_contributions", []))
+    html += render_list("Method Summary", facts.get("method_summary", []))
+    html += render_list("Assumptions & Scope", facts.get("assumptions_and_scope", []))
+    html += render_list("Novelty Claims (paper)", facts.get("novelty_claims_in_paper", []))
+    if review_issues:
+        html += _render_review_issues_html(review_issues)
+    if rewrite_suggestions:
+        html += _render_rewrite_suggestions_html(rewrite_suggestions)
+    html += "</div></div>"
+    return html
+def format_final_review(review: dict) -> str:
+    """
+    Extract and return the final review markdown.
+    Args:
+        review: The review dictionary containing the final review
+    Returns:
+        The final review markdown string
+    """
+    return review.get("review_markdown") or review.get("review") or "Final review markdown missing."
+def format_raw_json(review: dict) -> str:
+    """
+    Format the complete review as a JSON code block.
+    Args:
+        review: The complete review dictionary
+    Returns:
+        JSON formatted as a Markdown code block
+    """
+    try:
+        return "```json\n" + json.dumps(review, indent=2, ensure_ascii=False) + "\n```"
+    except Exception:
+        return str(review)

gradio_app/components/header.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""
+Header component for the Review Grounder Gradio app.
+This module provides the app header with gradient background,
+title, BETA badge, and privacy notice banner.
+"""
+import gradio as gr
+def create_header() -> None:
+    """
+    Create the app header with gradient background and privacy notice.
+    Renders:
+    - Purple gradient header with title and BETA badge
+    - Privacy notice banner explaining the demo mode
+    """
+    # Main header with gradient background
+    gr.HTML("""
+    <div class="app-header">
+        <div class="app-header-content">
+            <h1 class="app-title" style="margin-bottom: 12px;">
+                🔬 Review Grounder
+                <span class="beta-badge">BETA</span>
+            </h1>
+        </div>
+        <div class="privacy-notice">
+            <span class="privacy-icon">🔒</span>
+            <div class="privacy-content">
+                <h4 style="color: white;">Privacy Notice: Anonymous Demo</h4>
+                <p style="color: white;">This is an anonymous demonstration. We do not save your PDF file, paper information, or any uploaded content.</p>
+            </div>
+        </div>
+    </div>
+    """)

gradio_app/components/results_panel.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""
+Results panel component for the Review Grounder Gradio app.
+This module provides the right panel with tabs for displaying
+different stages of the review pipeline results.
+"""
+import gradio as gr
+from typing import Tuple
+def create_results_placeholder() -> str:
+    """
+    Return the HTML for the initial placeholder state.
+    Returns:
+        HTML string for the "Ready to Review" placeholder
+    """
+    return """
+    <div class="results-placeholder">
+        <div class="results-placeholder-icon">📋</div>
+        <h3>Ready to Review</h3>
+        <p>Upload your PDF and click "🚀 Generate AI Review" to get comprehensive feedback on your research.</p>
+    </div>
+    """
+def create_initial_draft_placeholder() -> str:
+    """
+    Return the HTML for the initial draft placeholder state.
+    Returns:
+        HTML string for the initial draft placeholder
+    """
+    return """
+    <div class="results-placeholder">
+        <div class="results-placeholder-icon">📝</div>
+        <h3>Initial Draft</h3>
+        <p>The draft of the paper review will appear here.</p>
+    </div>
+    """
+def create_results_analyzer_placeholder() -> str:
+    """
+    Return the HTML for the results analyzer placeholder state.
+    Returns:
+        HTML string for the results analyzer placeholder
+    """
+    return """
+    <div class="results-placeholder">
+        <div class="results-placeholder-icon">📈</div>
+        <h3>Results Analyzer</h3>
+        <p>The analysis of the paper's experiments and results will appear here.</p>
+    </div>
+    """
+def create_insights_miner_placeholder() -> str:
+    """
+    Return the HTML for the insights miner placeholder state.
+    Returns:
+        HTML string for the insights miner placeholder
+    """
+    return """
+    <div class="results-placeholder">
+        <div class="results-placeholder-icon">💡</div>
+        <h3>Insight Miner</h3>
+        <p>The insights retrieved from the paper's content will appear here.</p>
+    </div>
+    """
+def create_related_work_placeholder() -> str:
+    """
+    Return the HTML for the related work placeholder state.
+    Returns:
+        HTML string for the related work placeholder
+    """
+    return """
+    <div class="results-placeholder">
+        <div class="results-placeholder-icon">📚</div>
+        <h3>Related Work</h3>
+        <p>The curated research papers and their summaries related to your uploaded paper will appear here.</p>
+    </div>
+    """
+def create_final_review_placeholder() -> str:
+    """
+    Return the HTML for the final review placeholder state.
+    Returns:
+        HTML string for the final review placeholder
+    """
+    return """
+    <div class="results-placeholder">
+        <div class="results-placeholder-icon">🎯</div>
+        <h3>Final Review</h3>
+        <p>The final refined review of the paper will appear here. It is the refinement of the initial draft based on the joint information from the results analyzer, insight miner, and related work.</p>
+    </div>
+    """
+def create_results_panel(
+    show_log: bool = False,
+    show_raw_json: bool = False,
+) -> Tuple[gr.HTML, gr.HTML, gr.HTML, gr.HTML, gr.Markdown, gr.Textbox, gr.Markdown, gr.Accordion, gr.Tab, gr.DownloadButton]:
+    """
+    Create the results panel with tabs for different pipeline stages.
+    Returns:
+        Tuple including download_json_btn for downloading raw JSON.
+    """
+    gr.HTML('<div class="panel-title">📝 AI Review Results</div>')
+    # Status log (conditionally visible)
+    with gr.Accordion("📋 Pipeline Log", open=False, visible=show_log) as log_accordion:
+        status_output = gr.Textbox(
+            value="Ready. Upload a PDF and click '🚀 Generate AI Review' to start.",
+            lines=8,
+            max_lines=20,
+            interactive=False,
+            autoscroll=True,
+            elem_classes=["status-log"],
+            show_label=False,
+        )
+    # Main results tabs
+    with gr.Tabs():
+        with gr.Tab("🎯 Final Review", id="final"):
+            gr.HTML("""
+            <div class="final-review-toolbar">
+                <button type="button" class="copy-final-btn" onclick="(function(){
+                    var el = document.getElementById('final-review-md');
+                    if (!el) el = document.querySelector('[id=\\'final-review-md\\']');
+                    if (!el) el = document.querySelector('.final-review-wrap .gr-markdown, .final-review-wrap [class*=\\'markdown\\']');
+                    var text = el ? (el.innerText || el.textContent || '') : '';
+                    if (text && navigator.clipboard && navigator.clipboard.writeText) {
+                        navigator.clipboard.writeText(text).then(function(){ alert('Copied to clipboard'); }).catch(function(){ alert('Copy failed'); });
+                    } else { alert('Nothing to copy'); }
+                })();" title="Copy full review text">📋 Copy to clipboard</button>
+            </div>
+            """)
+            with gr.Group(elem_classes=["final-review-wrap"]):
+                final_md = gr.Markdown(
+                    value=create_results_placeholder(),
+                    label="Final Review",
+                    elem_id="final-review-md",
+                )
+        with gr.Tab("📝 Initial Draft", id="initial"):
+            initial_html = gr.HTML(
+                value=create_initial_draft_placeholder(),
+                label="Initial Draft",
+            )
+        with gr.Tab("📈 Results Analyzer", id="results"):
+            results_html = gr.HTML(
+                value=create_results_analyzer_placeholder(),
+                label="Results Analyzer",
+            )
+        with gr.Tab("💡 Insight Miner", id="insights"):
+            insights_html = gr.HTML(
+                value=create_insights_miner_placeholder(),
+                label="Insight Miner",
+            )
+        with gr.Tab("📚 Related Work", id="related"):
+            related_html = gr.HTML(
+                value=create_related_work_placeholder(),
+                label="Related Work",
+            )
+        # Raw JSON tab (conditionally visible based on toggle)
+        with gr.Tab("🔧 Raw JSON", id="raw_json", visible=show_raw_json) as raw_json_tab:
+            raw_json_md = gr.Markdown(
+                value="Raw JSON output for debugging will appear here.",
+                label="Raw JSON",
+            )
+            download_json_btn = gr.DownloadButton("⬇️ Download as JSON")
+    return (
+        initial_html,
+        results_html,
+        insights_html,
+        related_html,
+        final_md,
+        status_output,
+        raw_json_md,
+        log_accordion,
+        raw_json_tab,
+        download_json_btn,
+    )

gradio_app/components/settings.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+Settings component for the Review Grounder Gradio app.
+This module provides the advanced settings panel with:
+- LLM endpoint configuration
+- API key input
+- Model name selection
+- Toggle for showing/hiding log and raw JSON
+"""
+import gradio as gr
+from typing import Tuple
+# Default values for OpenAI API
+DEFAULT_API_ENDPOINT = "https://api.openai.com/v1"
+DEFAULT_MODEL_NAME = "gpt-4o"
+def create_advanced_settings() -> Tuple[gr.Textbox, gr.Textbox, gr.Textbox, gr.Checkbox, gr.Checkbox]:
+    """
+    Create the advanced settings accordion with LLM configuration.
+    The accordion is collapsed by default to hide technical details
+    from casual users, while allowing power users to customize.
+    Returns:
+        Tuple containing:
+        - api_base_url_in: Textbox for LLM endpoint URL
+        - api_key_in: Textbox for API key (password masked)
+        - model_name_in: Textbox for model name
+        - show_log_toggle: Checkbox for showing/hiding log
+        - show_raw_json_toggle: Checkbox for showing/hiding raw JSON
+    """
+    with gr.Accordion(
+        "⚙️ Advanced Settings",
+        open=False,
+        elem_classes=["advanced-settings"]
+    ):
+        gr.Markdown("""
+        Configure your LLM provider and display preferences.
+        Leave fields empty to use environment variables.
+        """)
+        api_base_url_in = gr.Textbox(
+            label="🔗 LLM Endpoint (base_url)",
+            placeholder="https://api.openai.com/v1",
+            value=DEFAULT_API_ENDPOINT,
+            info="The base URL for your LLM API (OpenAI, OpenRouter, local, etc.)",
+        )
+        api_key_in = gr.Textbox(
+            label="🔑 API Key",
+            type="password",
+            placeholder="sk-...",
+            value="",
+            info="Your API key (leave empty to use OPENAI_API_KEY env var)",
+        )
+        model_name_in = gr.Textbox(
+            label="🤖 Model Name",
+            placeholder="gpt-4o",
+            value=DEFAULT_MODEL_NAME,
+            info="Model identifier (e.g., gpt-4o, gpt-4-turbo, claude-3-opus)",
+        )
+        gr.Markdown("---")
+        gr.Markdown("**Display Options**")
+        show_log_toggle = gr.Checkbox(
+            label="📋 Show Pipeline Log",
+            value=False,
+            info="Display detailed execution log during processing",
+        )
+        show_raw_json_toggle = gr.Checkbox(
+            label="📄 Show Raw JSON Output",
+            value=False,
+            info="Display raw JSON data in results (for debugging)",
+        )
+    return api_base_url_in, api_key_in, model_name_in, show_log_toggle, show_raw_json_toggle

gradio_app/components/styles.py ADDED Viewed

	@@ -0,0 +1,592 @@

+"""
+Custom CSS styles for the Review Grounder Gradio app.
+This module provides all custom CSS needed for the modern UI design,
+including the gradient header, card layouts, buttons, and other visual elements.
+"""
+def get_custom_css() -> str:
+    """
+    Return the complete custom CSS for the app.
+    Includes styles for:
+    - Gradient header with purple theme
+    - Privacy notice banner
+    - How it works section with numbered steps
+    - Card-based layouts
+    - Custom button styling
+    - Results panel tabs
+    """
+    return """
+    /* ===== Global Styles ===== */
+    .gradio-container {
+        max-width: 1400px !important;
+        margin: 0 auto !important;
+        font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif !important;
+    }
+    /* ===== Header Styles ===== */
+    .app-header {
+        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+        padding: 20px 30px;
+        border-radius: 12px;
+        margin-bottom: 20px;
+        position: relative;
+        overflow: hidden;
+        display: flex;
+        flex-direction: column;
+        width: calc(100% + 24px) !important;
+        margin-left: -12px;
+        margin-right: -12px;
+        align-items: stretch; /* Stretch children to fill container width */
+        justify-content: flex-start; /* Align children from top to bottom */
+    }
+    .app-header::before {
+        content: '';
+        position: absolute;
+        top: 0;
+        left: 0;
+        right: 0;
+        bottom: 0;
+        background-image: radial-gradient(circle at 20% 50%, rgba(255,255,255,0.1) 1px, transparent 1px),
+                          radial-gradient(circle at 80% 50%, rgba(255,255,255,0.1) 1px, transparent 1px);
+        background-size: 40px 40px;
+    }
+    .app-header-content {
+        position: relative;
+        z-index: 1;
+        display: flex;
+        justify-content: space-between;
+        align-items: flex-start;
+        gap: 12px;
+    }
+    .app-title {
+        color: white;
+        font-size: 1.8em;
+        font-weight: 700;
+        margin: 0;
+        display: flex;
+        align-items: center;
+        gap: 12px;
+    }
+    .beta-badge {
+        background: rgba(255,255,255,0.2);
+        color: white;
+        padding: 4px 12px;
+        border-radius: 20px;
+        font-size: 0.5em;
+        font-weight: 600;
+        text-transform: uppercase;
+        letter-spacing: 0.5px;
+    }
+    .back-button {
+        background: rgba(255,255,255,0.2);
+        color: white;
+        padding: 8px 16px;
+        border-radius: 8px;
+        text-decoration: none;
+        font-size: 0.9em;
+        transition: background 0.2s;
+    }
+    .back-button:hover {
+        background: rgba(255,255,255,0.3);
+    }
+    /* ===== Privacy Notice Banner ===== */
+    .privacy-notice {
+        background: linear-gradient(90deg, #5b4fa8 0%, #7c3aed 100%);
+        color: white;
+        padding: 12px 20px;
+        border-radius: 8px;
+        margin-top: 20px;
+        margin-bottom: 20px;
+        display: flex;
+        align-items: flex-start;
+        gap: 12px;
+    }
+    .privacy-icon {
+        font-size: 1.2em;
+        margin-top: 2px;
+    }
+    .privacy-content h4 {
+        margin: 0 0 4px 0;
+        font-size: 1em;
+        font-weight: 600;
+    }
+    .privacy-content p {
+        margin: 0;
+        font-size: 0.85em;
+        opacity: 0.9;
+    }
+    /* ===== Main Panel Cards ===== */
+    .panel-card {
+        background: white;
+        border-radius: 12px;
+        padding: 24px;
+        box-shadow: 0 2px 8px rgba(0,0,0,0.08);
+        border: 1px solid #e5e7eb;
+        overflow: hidden;
+    }
+    # /* Unify width: Gradio theme often gives HTML/Accordion blocks a max-width.
+    #    Force all blocks in the panel to span full width like the button. */
+    # .panel-card .gr-block,
+    # .panel-card .block,
+    # .panel-card > div {
+    #     max-width: none !important;
+    #     width: 100% !important;
+    #     box-sizing: border-box;
+    # }
+    # .panel-card .how-it-works {
+    #     width: 100% !important;
+    #     box-sizing: border-box;
+    # }
+    .panel-title {
+        font-size: 1.1em;
+        font-weight: 600;
+        color: #1f2937;
+        margin: 0 0 20px 0;
+        display: flex;
+        align-items: center;
+        gap: 8px;
+    }
+    /* ===== How It Works Section ===== */
+    /* Stretch to same width as button: cancel panel padding with negative margin, then add inner padding */
+    .how-it-works {
+        background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%);
+        border-radius: 10px;
+        padding: 16px 24px;
+        margin-left: -12px;
+        margin-right: -12px;
+        margin-bottom: 20px;
+        width: calc(100% + 24px) !important;
+        max-width: none;
+        box-sizing: border-box;
+    }
+    .how-it-works-title {
+        color: #d97706;
+        font-weight: 600;
+        font-size: 0.95em;
+        margin-bottom: 12px;
+        display: flex;
+        align-items: center;
+        gap: 6px;
+    }
+    .step-item {
+        display: flex;
+        align-items: flex-start;
+        gap: 10px;
+        margin-bottom: 8px;
+        font-size: 0.9em;
+        color: #374151;
+    }
+    .step-number {
+        background: #f97316;
+        color: white;
+        width: 20px;
+        height: 20px;
+        border-radius: 50%;
+        display: flex;
+        align-items: center;
+        justify-content: center;
+        font-size: 0.75em;
+        font-weight: 600;
+        flex-shrink: 0;
+    }
+    /* ===== Upload Area ===== */
+    .upload-area {
+        border: 2px dashed #d1d5db;
+        border-radius: 12px;
+        padding: 40px 20px;
+        text-align: center;
+        background: #f9fafb;
+        transition: all 0.2s;
+        margin-bottom: 20px;
+    }
+    .upload-area:hover {
+        border-color: #9ca3af;
+        background: #f3f4f6;
+    }
+    /* Hide default Gradio file upload prompt text ("Click to upload or drag and drop", etc.) */
+    .file-upload-minimal .gr-formatted-text,
+    .file-upload-minimal .gr-box > div:not([class*="file"]):not([class*="preview"]),
+    #pdf-upload .gr-formatted-text,
+    #pdf-upload .wrap-inner .gr-formatted-text {
+        display: none !important;
+    }
+    .upload-icon {
+        font-size: 3em;
+        color: #9ca3af;
+        margin-bottom: 12px;
+    }
+    .upload-text {
+        color: #6b7280;
+        font-size: 0.95em;
+    }
+    .upload-link {
+        color: #6366f1;
+        font-weight: 600;
+        cursor: pointer;
+    }
+    .upload-hint {
+        color: #9ca3af;
+        font-size: 0.85em;
+        margin-top: 8px;
+    }
+    /* ===== Primary Action Button ===== */
+    .primary-btn {
+        background: linear-gradient(135deg, #fb923c 0%, #f97316 100%) !important;
+        color: white !important;
+        padding: 14px 28px !important;
+        border-radius: 10px !important;
+        font-weight: 600 !important;
+        font-size: 1em !important;
+        border: none !important;
+        cursor: pointer !important;
+        transition: all 0.2s !important;
+        width: 100% !important;
+        box-shadow: 0 4px 12px rgba(249, 115, 22, 0.3) !important;
+    }
+    .primary-btn:hover {
+        transform: translateY(-1px) !important;
+        box-shadow: 0 6px 16px rgba(249, 115, 22, 0.4) !important;
+    }
+    .primary-btn:disabled {
+        opacity: 0.6 !important;
+        cursor: not-allowed !important;
+        transform: none !important;
+    }
+    /* ===== Secondary Buttons (Disabled State) ===== */
+    .secondary-btn {
+        background: #f3f4f6 !important;
+        color: #9ca3af !important;
+        padding: 12px 24px !important;
+        border-radius: 10px !important;
+        font-weight: 500 !important;
+        border: 1px solid #e5e7eb !important;
+        cursor: not-allowed !important;
+        width: 100% !important;
+        margin-top: 8px !important;
+    }
+    .unavailable-badge {
+        background: #e5e7eb;
+        color: #9ca3af;
+        padding: 2px 8px;
+        border-radius: 4px;
+        font-size: 0.7em;
+        margin-left: 8px;
+        text-transform: uppercase;
+    }
+    /* ===== Advanced Settings Accordion ===== */
+    /* Stretch to same width as button: cancel panel padding, align inner padding with panel */
+    .panel-card .advanced-settings,
+    .advanced-settings {
+        margin-left: -12px !important;
+        margin-right: -12px !important;
+        width: calc(100% + 24px) !important;
+        max-width: none !important;
+        box-sizing: border-box;
+    }
+    .advanced-settings .label-wrap {
+        background: #f9fafb !important;
+        border-radius: 8px !important;
+        padding: 12px 24px !important;
+    }
+    .advanced-settings .label-wrap span {
+        font-weight: 500 !important;
+        color: #4b5563 !important;
+    }
+    /* Accordion content area: same horizontal padding as panel */
+    .advanced-settings .gr-group,
+    .advanced-settings .wrap {
+        padding-left: 24px !important;
+        padding-right: 24px !important;
+        box-sizing: border-box;
+    }
+    /* ===== Results Panel ===== */
+    .results-panel {
+        background: white;
+        border-radius: 12px;
+        padding: 24px;
+        box-shadow: 0 2px 8px rgba(0,0,0,0.08);
+        border: 1px solid #e5e7eb;
+        min-height: 500px;
+    }
+    .results-placeholder {
+        text-align: center;
+        padding: 60px 20px;
+        color: #9ca3af;
+    }
+    .results-placeholder-icon {
+        font-size: 4em;
+        margin-bottom: 16px;
+    }
+    .results-placeholder h3 {
+        color: #374151;
+        margin: 0 0 8px 0;
+        font-weight: 600;
+    }
+    .results-placeholder p {
+        margin: 0;
+        font-size: 0.95em;
+    }
+    /* ===== Tab Styling ===== */
+    .tabs .tab-nav {
+        border-bottom: 2px solid #e5e7eb !important;
+    }
+    .tabs .tab-nav button {
+        font-weight: 500 !important;
+        color: #6b7280 !important;
+        padding: 12px 20px !important;
+        border: none !important;
+        background: transparent !important;
+    }
+    .tabs .tab-nav button.selected {
+        color: #6366f1 !important;
+        border-bottom: 2px solid #6366f1 !important;
+    }
+    /* ===== Initial Draft & Final Review content ===== */
+    .review-message {
+        color: #6b7280;
+        font-style: italic;
+        padding: 1em;
+    }
+    .review-draft-content,
+    .initial-draft-card {
+        max-width: 100%;
+    }
+    .review-text {
+        white-space: pre-wrap;
+        word-break: break-word;
+        line-height: 1.6;
+        color: #1f2937;
+    }
+    /* Final Review: toolbar with copy button at top right */
+    .final-review-toolbar {
+        display: flex;
+        justify-content: flex-end;
+        margin-bottom: 12px;
+    }
+    .copy-final-btn {
+        padding: 8px 16px;
+        border-radius: 8px;
+        border: 1px solid #e5e7eb;
+        background: #f9fafb;
+        color: #374151;
+        font-size: 0.9em;
+        cursor: pointer;
+        transition: background 0.2s, color 0.2s;
+    }
+    .copy-final-btn:hover {
+        background: #f3f4f6;
+        color: #111827;
+    }
+    /* Final Review markdown area: improve readability */
+    .results-panel .gr-markdown,
+    .results-panel .prose {
+        line-height: 1.7 !important;
+        color: #1f2937 !important;
+        max-width: 100% !important;
+    }
+    .results-panel .gr-markdown h1,
+    .results-panel .gr-markdown h2,
+    .results-panel .gr-markdown h3 {
+        margin-top: 1em !important;
+        margin-bottom: 0.5em !important;
+        color: #111827 !important;
+    }
+    .results-panel .gr-markdown ul,
+    .results-panel .gr-markdown ol {
+        padding-left: 1.5em !important;
+        margin: 0.5em 0 !important;
+    }
+    .results-panel .gr-markdown p {
+        margin: 0.5em 0 !important;
+    }
+    /* ===== Card Grid for Results ===== */
+    .card-grid {
+        display: flex;
+        flex-direction: column;
+        gap: 12px;
+    }
+    .card {
+        background: #f8f9fa;
+        border: 1px solid #e9ecef;
+        border-radius: 10px;
+        padding: 14px 16px;
+    }
+    .card h4 {
+        margin: 0 0 8px 0;
+        font-size: 1.05em;
+    }
+    .kv {
+        margin: 8px 0;
+        padding: 10px;
+        background: #ffffff;
+        border-radius: 8px;
+        border: 1px solid #eef1f4;
+    }
+    .k {
+        font-weight: 650;
+        color: #495057;
+        margin-bottom: 4px;
+    }
+    .v {
+        color: #212529;
+        line-height: 1.55;
+        white-space: pre-wrap;
+        word-break: break-word;
+    }
+    details {
+        background: #ffffff;
+        border: 1px solid #eef1f4;
+        border-radius: 8px;
+        padding: 10px 12px;
+    }
+    summary {
+        cursor: pointer;
+        font-weight: 650;
+        color: #212529;
+    }
+    .mono {
+        font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace;
+        font-size: 12.5px;
+    }
+    .pill {
+        display: inline-block;
+        padding: 2px 8px;
+        border-radius: 999px;
+        background: #e7f1ff;
+        color: #0b5ed7;
+        font-size: 12px;
+        margin-left: 8px;
+    }
+    /* ===== Related Work Cards ===== */
+    .related-work-container {
+        font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+        max-width: 100%;
+    }
+    .related-paper-card {
+        background: #f8f9fa;
+        border-left: 4px solid #007bff;
+        border-radius: 6px;
+        padding: 16px;
+        margin-bottom: 16px;
+        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        transition: transform 0.2s, box-shadow 0.2s;
+    }
+    .related-paper-card:hover {
+        transform: translateY(-2px);
+        box-shadow: 0 4px 8px rgba(0,0,0,0.15);
+    }
+    .paper-header {
+        font-weight: 600;
+        font-size: 1.1em;
+        color: #212529;
+        margin-bottom: 12px;
+    }
+    .paper-field {
+        margin: 8px 0;
+        padding: 8px;
+        background: white;
+        border-radius: 4px;
+    }
+    .paper-field-label {
+        font-weight: 600;
+        color: #495057;
+        font-size: 0.9em;
+        margin-bottom: 4px;
+    }
+    .paper-field-value {
+        color: #212529;
+        line-height: 1.5;
+    }
+    /* ===== Log/Status Text Area ===== */
+    .status-log textarea {
+        font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace !important;
+        font-size: 0.85em !important;
+        background: #1f2937 !important;
+        color: #10b981 !important;
+        border-radius: 8px !important;
+    }
+    /* ===== Footer ===== */
+    .app-footer {
+        text-align: center;
+        padding: 20px;
+        color: #9ca3af;
+        font-size: 0.85em;
+        border-top: 1px solid #e5e7eb;
+        margin-top: 30px;
+    }
+    """

gradio_app/components/upload_section.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""
+Upload section component for the Review Grounder Gradio app.
+This module provides the left panel with:
+- "How it works" instructions
+- PDF upload area
+- Action buttons
+"""
+import gradio as gr
+from typing import Tuple
+def create_how_it_works() -> None:
+    """
+    Create the "How it works" instruction section.
+    Displays a numbered list of steps explaining the review process.
+    """
+    gr.HTML("""
+    <div class="how-it-works">
+        <div class="how-it-works-title">
+            ⚡ How it works
+        </div>
+        <div class="step-item">
+            <span class="step-number">1</span>
+            <span>Upload your research paper in PDF format</span>
+        </div>
+        <div class="step-item">
+            <span class="step-number">2</span>
+            <span>Configure your LLM settings (or use defaults)</span>
+        </div>
+        <div class="step-item">
+            <span class="step-number">3</span>
+            <span>Click "🚀 Generate AI Review" to start</span>
+        </div>
+        <div class="step-item">
+            <span class="step-number">4</span>
+            <span>Watch as our AI analyzes your paper in real-time</span>
+        </div>
+        <div class="step-item">
+            <span class="step-number">5</span>
+            <span>Get comprehensive, grounded feedback on your research</span>
+        </div>
+    </div>
+    """)
+def create_upload_area() -> gr.File:
+    """
+    Create the PDF upload area component.
+    Returns:
+        gr.File: The file upload component for PDF files
+    """
+    pdf_input = gr.File(
+        label="",
+        file_types=[".pdf"],
+        type="filepath",
+        elem_classes=["upload-area", "file-upload-minimal"],
+        elem_id="pdf-upload",
+        show_label=False,
+    )
+    gr.HTML("""
+    <div style="text-align: center; color: #9ca3af; font-size: 0.85em; margin-top: -10px; margin-bottom: 15px;">
+        PDF files only (max 10MB)
+    </div>
+    """)
+    return pdf_input
+def create_action_buttons() -> gr.Button:
+    """
+    Create the action buttons for starting the review.
+    Returns:
+        gr.Button: The primary "Generate Review" button
+    """
+    run_button = gr.Button(
+        "🚀 Generate AI Review",
+        variant="primary",
+        elem_classes=["primary-btn"],
+    )
+    # # Placeholder buttons for future features (disabled)
+    # gr.HTML("""
+    # <button class="secondary-btn" disabled>
+    #     📊 DeepReviewer
+    #     <span class="unavailable-badge">COMING SOON</span>
+    # </button>
+    # <button class="secondary-btn" disabled>
+    #     🛡️ SafeReviewer
+    #     <span class="unavailable-badge">COMING SOON</span>
+    # </button>
+    # """)
+    return run_button
+def create_upload_section() -> Tuple[gr.File, gr.Button]:
+    """
+    Create the complete upload section panel.
+    Returns:
+        Tuple containing:
+        - pdf_input: The file upload component
+        - run_button: The generate review button
+    """
+    gr.HTML('<div class="panel-title">📁 Upload Your Paper</div>')
+    create_how_it_works()
+    pdf_input = create_upload_area()
+    run_button = create_action_buttons()
+    return pdf_input, run_button

gradio_app/utils_single_paper_inference.py ADDED Viewed

	@@ -0,0 +1,276 @@

+"""
+Utility wrappers and a minimal CLI for single-paper inference using:
+- ASTA API key from environment variable `ASTA_API_KEY`
+- OpenAI/OpenRouter endpoint and key from environment variables
+This module is designed to be imported by Gradio or other web frontends,
+while still remaining executable as a standalone CLI tool for debugging.
+"""
+import argparse
+import logging
+import sys
+from pathlib import Path
+import json
+from typing import Any, Dict, Iterator, Optional
+PROJECT_ROOT = Path(__file__).parent.parent
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.reviewer_agent.single_paper_inference import (
+    extract_text_from_pdf,
+    _split_paper_latex_sections,
+    _init_single_paper_pipeline,
+)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def run_single_paper_review_from_pdf(
+    pdf_path: str,
+    *,
+    enable_logging: bool = True,
+    verbose: bool = True,
+    api_base_url: str | None = None,
+    api_key: str | None = None,
+    model_name: str | None = None,
+) -> dict:
+    """
+    High-level utility to run the single-paper review pipeline on a PDF path.
+    This is the main entry point intended to be called by Gradio or other UIs.
+    It delegates to `review_single_paper_from_pdf` which uses:
+    - ASTA API key from `ASTA_API_KEY`
+    - LLM settings and OpenAI/OpenRouter keys from environment/config files,
+      but can be overridden via `api_base_url`, `api_key`, and `model_name`.
+    """
+    pdf_path = str(Path(pdf_path).expanduser())
+    logger.info(f"Running single-paper review for PDF: {pdf_path}")
+    # Keep the original one-shot behavior for backward compatibility.
+    # For true streaming updates, use `run_single_paper_review_from_pdf_stepwise`.
+    from src.reviewer_agent.single_paper_inference import review_single_paper_from_pdf
+    review = review_single_paper_from_pdf(
+        pdf_path,
+        enable_logging=enable_logging,
+        verbose=verbose,
+        gpt_api_key=api_key,
+        gpt_base_url=api_base_url,
+        gpt_model_name=model_name,
+    )
+    return review
+def _normalize_base_url(base_url: Optional[str]) -> Optional[str]:
+    """
+    Normalize an OpenAI-compatible base_url.
+    Your local gateway expects requests at:
+        http://localhost:8000/chat/completions
+    The OpenAI client will append `/chat/completions` to whatever `base_url`
+    we pass in. That means:
+        base_url = "http://localhost:8000"
+        -> "http://localhost:8000/chat/completions"  ✅
+    If a user accidentally includes `/chat/completions` in the textbox, we
+    strip that suffix so the final URL is still correct.
+    """
+    if not base_url:
+        return None
+    u = base_url.strip()
+    if not u:
+        return None
+    # Strip trailing slash for normalization
+    u = u.rstrip("/")
+    # If user pasted the full path (…/chat/completions), strip it back to the host.
+    if u.endswith("/chat/completions"):
+        u = u[: -len("/chat/completions")]
+    return u
+def run_single_paper_review_from_pdf_stepwise(
+    pdf_path: str,
+    *,
+    enable_logging: bool = True,
+    verbose: bool = True,
+    api_base_url: str | None = None,
+    api_key: str | None = None,
+    model_name: str | None = None,
+) -> Iterator[Dict[str, Any]]:
+    """
+    Stepwise (streamable) single-paper pipeline.
+    Yields dict events like:
+      {"stage": "extract_pdf", ...}
+      {"stage": "initial_review", "initial_review": {...}}
+      {"stage": "results_analysis", "results_analyzer_json": "..."}
+      {"stage": "insights", "insight_miner_json": "..."}
+      {"stage": "related_work", "related_work_json_list": [...], "search_keywords": [...]}
+      {"stage": "final", "review": {...}}
+    """
+    pdf_path = str(Path(pdf_path).expanduser())
+    yield {"stage": "extract_pdf", "pdf_path": pdf_path}
+    paper_text = extract_text_from_pdf(pdf_path)
+    yield {"stage": "parsed_pdf_text", "text_len": len(paper_text)}
+    sections = _split_paper_latex_sections(paper_text)
+    title = (sections.get("title") or "").strip()
+    abstract = (sections.get("abstract") or "").strip()
+    content = (sections.get("content") or "").strip()
+    yield {"stage": "parsed_sections", "title": title, "abstract": abstract}
+    reviewer, refiner, related_work_searcher, paper_results_analyzer, paper_insight_miner = (
+        _init_single_paper_pipeline(
+            enable_logging=enable_logging,
+            use_test_llm=False,
+            gpt_api_key=api_key,
+            gpt_base_url=api_base_url,
+            gpt_model_name=model_name,
+        )
+    )
+    # Step 1: initial draft (reviewer)
+    initial_review = reviewer.review_paper(
+        title=title,
+        abstract=abstract,
+        content=content,
+        keywords=None,
+        review_format="ai_researcher",
+        auto_save_log=False,
+        verbose=verbose,
+    )
+    yield {"stage": "initial_review", "initial_review": initial_review}
+    # Helper: format initial review for analyzers.
+    try:
+        initial_review_text = (
+            refiner._format_review_dict(initial_review, "detailed")
+            if hasattr(refiner, "_format_review_dict")
+            else str(initial_review)
+        )
+    except Exception:
+        initial_review_text = str(initial_review)
+    # Step 2a: results analyzer
+    results_analyzer_json = None
+    if paper_results_analyzer and content:
+        try:
+            results_analyzer_json = paper_results_analyzer.analyze_paper_results(
+                content, initial_review_text
+            )
+        except Exception as e:
+            results_analyzer_json = None
+            yield {"stage": "results_analysis_error", "error": str(e)}
+    yield {"stage": "results_analysis", "results_analyzer_json": results_analyzer_json}
+    # Step 2b: insight miner
+    insight_miner_json = None
+    if paper_insight_miner and content:
+        try:
+            insight_miner_json = paper_insight_miner.mine_paper_insights(
+                content, initial_review_text
+            )
+        except Exception as e:
+            insight_miner_json = None
+            yield {"stage": "insights_error", "error": str(e)}
+    yield {"stage": "insights", "insight_miner_json": insight_miner_json}
+    # Step 2c: related work (structured list)
+    related_work_list = []
+    search_keywords = None
+    if related_work_searcher:
+        try:
+            related_work_list = related_work_searcher.generate_related_work_json_list(
+                title=title,
+                abstract=abstract,
+                content=content,
+                keywords=None,
+                publication_date_range=None,
+                venues=None,
+            )
+            search_keywords = getattr(related_work_searcher, "last_keywords", None)
+        except Exception as e:
+            related_work_list = []
+            yield {"stage": "related_work_error", "error": str(e)}
+    yield {
+        "stage": "related_work",
+        "related_work_json_list": related_work_list,
+        "search_keywords": search_keywords,
+    }
+    # Step 3: refine (final)
+    related_work_json_str = json.dumps(related_work_list, ensure_ascii=False)
+    refined = refiner.refine_review(
+        initial_review=initial_review,
+        insight_miner_json=insight_miner_json,
+        results_analyzer_json=results_analyzer_json,
+        related_work_json_list=related_work_json_str,
+        title=title,
+        abstract=abstract,
+        content=content,
+        review_format="detailed",
+        verbose=verbose,
+    )
+    if search_keywords is not None:
+        refined["search_keywords"] = search_keywords
+    yield {"stage": "final", "review": refined}
+def main() -> None:
+    """
+    Simple CLI wrapper mainly for local debugging.
+    Example:
+        python -m gradio.test_single_paper_openrouter \\
+            --pdf /path/to/paper.pdf
+    """
+    parser = argparse.ArgumentParser(
+        description="Run single-paper review on a PDF file."
+    )
+    parser.add_argument(
+        "--pdf",
+        type=str,
+        required=True,
+        help="Path to the PDF file to review.",
+    )
+    parser.add_argument(
+        "--no-logging",
+        action="store_true",
+        help="Disable on-disk logging for this run.",
+    )
+    parser.add_argument(
+        "--quiet",
+        action="store_true",
+        help="Reduce console output verbosity.",
+    )
+    args = parser.parse_args()
+    from time import time
+    start_time = time()
+    review = run_single_paper_review_from_pdf(
+        args.pdf,
+        enable_logging=not args.no_logging,
+        verbose=not args.quiet,
+    )
+    end_time = time()
+    print(f"Time taken: {end_time - start_time:.2f} seconds")
+    print("\n=== Review keys ===")
+    print(list(review.keys()))
+    print("\n=== Review Markdown ===")
+    print(review.get("review_markdown", ""))
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+# Unified Review System Requirements
+# Core dependencies
+pandas>=1.5.0
+pyarrow>=10.0.0
+pyyaml>=6.0.0
+python-dotenv>=1.0.0
+# LLM and ML dependencies
+vllm>=0.6.0
+openai>=1.0.0
+transformers>=4.35.0
+torch>=2.0.0
+numpy>=1.24.0
+FlagEmbedding>=1.2.0
+# Utilities
+requests>=2.31.0
+pydantic>=2.0.0
+tqdm>=4.66.0
+scipy>=1.10.0
+scikit-learn>=1.3.0
+# PDF + UI (for Gradio / Hugging Face Space)
+pdfminer.six>=20221105
+gradio>=4.0.0

scripts/gpt_oss_start_vllm_service.sh ADDED Viewed

	@@ -0,0 +1,47 @@

+# Script to start vLLM service for Qwen3-235B-A22B-Instruct-2507
+# optional: limit GPU usage
+# export CUDA_VISIBLE_DEVICES=0,1,2,3
+export CUDA_VISIBLE_DEVICES=4,5,6,7
+# Configuration
+# MODEL_NAME="Qwen/Qwen3-235B-A22B-Instruct-2507"
+MODEL_NAME="openai/gpt-oss-120b"
+PORT=${VLLM_PORT:-8000}
+TP_SIZE=${TP_SIZE:-4}  # Tensor parallelism size, smaller or equal to the number of available GPUs
+GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.85} # ideally 0.85
+MAX_MODEL_LEN=${MAX_MODEL_LEN:-131072}  # Native context length, can extend to 1010000
+# Check if model path is provided
+if [ -z "$MODEL_PATH" ]; then
+    MODEL_PATH="$MODEL_NAME"
+    echo "Using HuggingFace model: $MODEL_PATH"
+else
+    echo "Using local model: $MODEL_PATH"
+fi
+echo "Starting vLLM service..."
+echo "Model: $MODEL_PATH"
+echo "Port: $PORT"
+echo "Tensor Parallelism: $TP_SIZE"
+echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
+echo "Max Model Length: $MAX_MODEL_LEN"
+# python3 -m vllm.entrypoints.openai.api_server \
+#     --model "$MODEL_PATH" \
+#     --port $PORT \
+#     --tensor-parallel-size $TP_SIZE \
+#     --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
+#     --max-model-len $MAX_MODEL_LEN \
+#     --trust-remote-code \
+#     # --dtype bfloat16
+vllm serve openai/gpt-oss-120b \
+    --port $PORT \
+    --tensor-parallel-size $TP_SIZE \
+    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
+    --max-model-len $MAX_MODEL_LEN \
+    --trust-remote-code \
+    --dtype bfloat16

scripts/start_load_balancer.sh ADDED Viewed

	@@ -0,0 +1,87 @@

+#!/bin/bash
+# Start Python Load Balancer for vLLM and Reranker services
+# Usage: ./scripts/start_load_balancer.sh [service_type] [num_instances] [base_port] [lb_port]
+set -e
+SERVICE_TYPE="${1:-vllm}"  # vllm or reranker
+NUM_INSTANCES="${2:-4}"
+BASE_PORT="${3:-8000}"
+LB_PORT="${4:-$BASE_PORT}"
+echo "Starting Load Balancer for $SERVICE_TYPE"
+echo "Number of instances: $NUM_INSTANCES"
+echo "Base port: $BASE_PORT"
+echo "Load balancer port: $LB_PORT"
+echo ""
+# Get script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+cd "$PROJECT_ROOT"
+# Activate virtual environment if it exists
+if [ -d ".venv" ]; then
+    source .venv/bin/activate
+fi
+# Check if FastAPI is installed
+python3 -c "import fastapi" 2>/dev/null || {
+    echo "Error: FastAPI not installed. Install with: pip install fastapi uvicorn httpx"
+    exit 1
+}
+# Build backend list
+BACKENDS=()
+for i in $(seq 0 $((NUM_INSTANCES - 1))); do
+    PORT=$((BASE_PORT + i))
+    if [ "$SERVICE_TYPE" = "vllm" ]; then
+        BACKENDS+=("http://localhost:${PORT}/v1")
+    else
+        BACKENDS+=("http://localhost:${PORT}")
+    fi
+done
+# Create logs directory based on service type
+if [ "$SERVICE_TYPE" = "vllm" ]; then
+    LB_LOG_DIR="logs/vllm"
+else
+    LB_LOG_DIR="logs/reranker"
+fi
+mkdir -p "$LB_LOG_DIR"
+echo "Backends:"
+for backend in "${BACKENDS[@]}"; do
+    echo "  - $backend"
+done
+echo ""
+# Start load balancer
+echo "Starting load balancer..."
+python3 -m shared.utils.load_balancer \
+    --backends "${BACKENDS[@]}" \
+    --host 0.0.0.0 \
+    --port "$LB_PORT" \
+    --strategy round_robin \
+    --health-check-interval 10.0 \
+    > "${LB_LOG_DIR}/load_balancer_${SERVICE_TYPE}_port${LB_PORT}.log" 2>&1 &
+LB_PID=$!
+# Save PID to file based on service type
+if [ "$SERVICE_TYPE" = "vllm" ]; then
+    PID_FILE="logs/vllm/vllm_lb_pid.txt"
+    mkdir -p logs/vllm
+else
+    PID_FILE="logs/reranker/reranker_lb_pid.txt"
+    mkdir -p logs/reranker
+fi
+echo "$LB_PID" > "$PID_FILE"
+echo "Load balancer started with PID: $LB_PID"
+echo "Load balancer URL: http://localhost:${LB_PORT}"
+echo "PID saved to: $PID_FILE"
+echo ""
+echo "To check status: curl http://localhost:${LB_PORT}/health"
+echo "To stop: ./scripts/stop_vllm_services.sh (for vllm) or ./scripts/stop_reranker_services.sh (for reranker)"
+echo "Or manually: kill $LB_PID"

scripts/start_reranker_service.sh ADDED Viewed

	@@ -0,0 +1,116 @@

+#!/bin/bash
+# Start Reranker API Service on multiple GPUs
+# Usage: ./scripts/start_reranker_service.sh [model_path] [num_gpus] [base_port]
+set -e
+# Default values
+# MODEL_PATH="${1:-OpenScholar/OpenScholar_Reranker}"
+# NUM_GPUS="${2:-4}"
+# BASE_PORT="${3:-8005}"
+# MODEL_PATH="BAAI/bge-reranker-base"
+# MODEL_PATH="BAAI/bge-reranker-large"
+MODEL_PATH="${1:-OpenScholar/OpenScholar_Reranker}"
+NUM_GPUS=8
+BASE_PORT=8008
+echo "Starting Reranker API Service"
+echo "Model: $MODEL_PATH"
+echo "Number of GPUs: $NUM_GPUS"
+echo "Base port: $BASE_PORT"
+echo ""
+# Get script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+cd "$PROJECT_ROOT"
+# Activate virtual environment if it exists
+if [ -d ".venv" ]; then
+    source .venv/bin/activate
+fi
+# Check if FastAPI is installed
+python3 -c "import fastapi" 2>/dev/null || {
+    echo "Error: FastAPI not installed. Install with: pip install fastapi uvicorn"
+    exit 1
+}
+# Check if FlagEmbedding is installed
+python3 -c "from FlagEmbedding import FlagReranker" 2>/dev/null || {
+    echo "Error: FlagEmbedding not installed. Install with: pip install FlagEmbedding"
+    exit 1
+}
+# Create logs directory
+mkdir -p logs
+# PID file for stopping services later
+PID_FILE="logs/reranker/reranker_pids.txt"
+LB_PID_FILE="logs/reranker/reranker_lb_pid.txt"
+# Start services on each GPU
+PIDS=()
+ENDPOINTS=()
+# Ensure we use GPUs 0, 1, 2, 3 (explicitly)
+# now we use gpus: 1,2,3,4,5,6,7
+for i in $(seq 0 $((NUM_GPUS - 1))); do
+    PORT=$((BASE_PORT + i))
+    GPU_ID=$i  # Use GPU 0, 1, 2, 3 explicitly
+    echo "Starting reranker service on GPU $GPU_ID, port $PORT..."
+    # Set CUDA device (each service will see only one GPU)
+    export CUDA_VISIBLE_DEVICES=$GPU_ID
+    # Start service in background
+    # Note: When CUDA_VISIBLE_DEVICES is set, cuda:0 refers to the visible GPU
+    nohup python3 -m shared.utils.reranker_api_service \
+        --model_path "$MODEL_PATH" \
+        --host 0.0.0.0 \
+        --port "$PORT" \
+        --use_fp16 \
+        --device "cuda:0" \
+        > "logs/reranker/reranker_service_gpu${GPU_ID}_port${PORT}.log" 2>&1 &
+    PID=$!
+    PIDS+=($PID)
+    ENDPOINTS+=("http://localhost:${PORT}")
+    echo "  Started with PID: $PID"
+    echo "  Endpoint: http://localhost:${PORT}"
+    sleep 2  # Give service time to start
+done
+echo ""
+echo "All reranker services started!"
+echo ""
+echo "Endpoints:"
+for endpoint in "${ENDPOINTS[@]}"; do
+    echo "  - $endpoint"
+done
+# Create endpoint pool file
+ENDPOINT_POOL_FILE="shared/configs/reranker_endpoint_pool.txt"
+mkdir -p "$(dirname "$ENDPOINT_POOL_FILE")"
+printf "%s\n" "${ENDPOINTS[@]}" > "$ENDPOINT_POOL_FILE"
+echo ""
+echo "Endpoint pool file created: $ENDPOINT_POOL_FILE"
+# Save PIDs to file (one per line)
+printf "%s\n" "${PIDS[@]}" > "$PID_FILE"
+echo ""
+echo "PIDs saved to: $PID_FILE"
+echo ""
+echo "To stop these specific reranker services, run:"
+echo "  ./scripts/stop_reranker_services.sh"
+echo ""
+echo "This will only kill the processes listed above, not other reranker services."
+echo ""
+echo "To check service status, run:"
+for endpoint in "${ENDPOINTS[@]}"; do
+    echo "curl $endpoint/health"
+done

scripts/start_vllm_with_balancer.sh ADDED Viewed

	@@ -0,0 +1,216 @@

+#!/bin/bash
+# Script to start vLLM services on GPU 4,5,6,7 and load balancer
+# Usage: ./scripts/start_vllm_with_balancer.sh
+set -e
+# Get script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+cd "$PROJECT_ROOT"
+# Configuration
+# MODEL_NAME="ZhuofengLi/Qwen3-4B-Instruct-2507-DeepReview-lora-sft" #
+MODEL_NAME="openai/gpt-oss-120b"
+# GPU_CONFIG="2:8001,3:8002,4:8003,5:8004"  # GPU:PORT pairs
+# GPU_CONFIG="1:8001,2:8002,3:8003,4:8004, 5:8005, 6:8006, 7:8007, 0:7999"  # GPU:PORT pairs
+GPU_CONFIG="0:8001,1:8002,2:8003,3:8004, 5:8005, 6:8006, 7:8007"  # GPU:PORT pairs
+TP_SIZE=1  # Tensor parallelism size per instance
+GPU_MEMORY_UTILIZATION=0.85
+MAX_MODEL_LEN=131072
+# Load balancer configuration
+LB_PORT=8000  # Load balancer port
+LB_STRATEGY="round_robin"  # or "least_conn"
+LB_HEALTH_CHECK_INTERVAL=10.0
+# Log directory
+LOG_DIR="./logs/vllm"
+mkdir -p "$LOG_DIR"
+# Endpoint pool file
+ENDPOINT_POOL_FILE="shared/configs/vllm_endpoint_pool.txt"
+mkdir -p "$(dirname "$ENDPOINT_POOL_FILE")"
+echo "=========================================="
+echo "Starting vLLM Services + Load Balancer"
+echo "=========================================="
+echo "Model: $MODEL_NAME"
+echo "GPU Configuration: $GPU_CONFIG"
+echo "Load Balancer Port: $LB_PORT"
+echo "Log Directory: $LOG_DIR"
+echo ""
+# Step 1: Start vLLM services
+echo "=== Step 1: Starting vLLM services ==="
+echo ""
+# Clear existing endpoints
+> "$ENDPOINT_POOL_FILE"
+# Parse GPU configuration
+IFS=',' read -ra GPU_CONFIGS <<< "$GPU_CONFIG"
+# Array to store PIDs
+VLLM_PIDS=()
+for gpu_config in "${GPU_CONFIGS[@]}"; do
+    IFS=':' read -r gpu_id port <<< "$gpu_config"
+    echo "Starting vLLM on GPU $gpu_id, port $port..."
+    # Set CUDA_VISIBLE_DEVICES for this specific GPU
+    export CUDA_VISIBLE_DEVICES=$gpu_id
+    # Log file
+    LOG_FILE="$LOG_DIR/vllm_gpu${gpu_id}_port${port}.log"
+    # Start vLLM service in background
+    (
+        echo "=== GPU $gpu_id, Port $port ===" >> "$LOG_FILE"
+        echo "Starting at $(date)" >> "$LOG_FILE"
+        echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" >> "$LOG_FILE"
+        echo "" >> "$LOG_FILE"
+        vllm serve "$MODEL_NAME" \
+            --port "$port" \
+            --tensor-parallel-size "$TP_SIZE" \
+            --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
+            --max-model-len "$MAX_MODEL_LEN" \
+            --trust-remote-code \
+            --dtype bfloat16 \
+            >> "$LOG_FILE" 2>&1
+    ) &
+    PID=$!
+    VLLM_PIDS+=($PID)
+    # Add endpoint to pool file (for load balancer)
+    echo "http://localhost:$port/v1" >> "$ENDPOINT_POOL_FILE"
+    echo "  -> Started with PID $PID"
+    echo "  -> Endpoint: http://localhost:$port/v1"
+    echo "  -> Log: $LOG_FILE"
+    # Wait a bit before starting next service
+    sleep 3
+done
+# Save PIDs (one per line for easier parsing)
+printf "%s\n" "${VLLM_PIDS[@]}" > "$LOG_DIR/vllm_pids.txt"
+echo ""
+echo "vLLM service PIDs saved to: $LOG_DIR/vllm_pids.txt"
+echo ""
+# Step 2: Wait for services to be ready
+echo "=== Step 2: Waiting for vLLM services to be ready ==="
+echo "Waiting 90 seconds for services to initialize..."
+sleep 90
+# Check service health
+echo ""
+echo "Checking service health..."
+HEALTHY_COUNT=0
+for gpu_config in "${GPU_CONFIGS[@]}"; do
+    IFS=':' read -r gpu_id port <<< "$gpu_config"
+    if curl -s "http://localhost:$port/v1/models" > /dev/null 2>&1; then
+        echo "  GPU $gpu_id (port $port): HEALTHY"
+        HEALTHY_COUNT=$((HEALTHY_COUNT + 1))
+    else
+        echo "  GPU $gpu_id (port $port): NOT READY (may still be initializing)"
+    fi
+done
+if [ $HEALTHY_COUNT -eq 0 ]; then
+    echo ""
+    echo "WARNING: No services are healthy yet. They may still be loading the model."
+    echo "You can check logs in $LOG_DIR/ for progress."
+fi
+echo ""
+# Step 3: Start load balancer
+echo "=== Step 3: Starting Load Balancer ==="
+echo ""
+# Build backend URLs
+BACKEND_URLS=()
+for gpu_config in "${GPU_CONFIGS[@]}"; do
+    IFS=':' read -r gpu_id port <<< "$gpu_config"
+    BACKEND_URLS+=("http://localhost:$port/v1")
+done
+echo "Load Balancer Configuration:"
+echo "  Port: $LB_PORT"
+echo "  Strategy: $LB_STRATEGY"
+echo "  Backends: ${BACKEND_URLS[*]}"
+echo ""
+# Activate virtual environment if it exists
+if [ -d ".venv" ]; then
+    source .venv/bin/activate
+fi
+# Check if FastAPI is installed
+python3 -c "import fastapi" 2>/dev/null || {
+    echo "Error: FastAPI not installed. Install with: pip install fastapi uvicorn httpx"
+    exit 1
+}
+# Start load balancer in background
+echo "Starting load balancer..."
+nohup python3 -m shared.utils.load_balancer \
+    --backends "${BACKEND_URLS[@]}" \
+    --host 0.0.0.0 \
+    --port "$LB_PORT" \
+    --strategy "$LB_STRATEGY" \
+    --health-check-interval "$LB_HEALTH_CHECK_INTERVAL" \
+    > "$LOG_DIR/load_balancer_port${LB_PORT}.log" 2>&1 &
+LB_PID=$!
+# Save load balancer PID
+echo "$LB_PID" > "$LOG_DIR/vllm_lb_pid.txt"
+echo "  -> Load balancer started with PID $LB_PID"
+echo "  -> Endpoint: http://localhost:$LB_PORT"
+echo "  -> Log: $LOG_DIR/load_balancer_port${LB_PORT}.log"
+echo "  -> PID saved to: $LOG_DIR/vllm_lb_pid.txt"
+# Wait a bit for load balancer to start
+sleep 5
+# Check load balancer health
+echo ""
+echo "Checking load balancer health..."
+if curl -s "http://localhost:$LB_PORT/health" > /dev/null 2>&1; then
+    echo "  Load balancer: HEALTHY"
+    curl -s "http://localhost:$LB_PORT/health" | python3 -m json.tool 2>/dev/null || curl -s "http://localhost:$LB_PORT/health"
+else
+    echo "  Load balancer: NOT READY (check log: $LOG_DIR/load_balancer_port${LB_PORT}.log)"
+fi
+echo ""
+echo "=========================================="
+echo "Deployment Complete!"
+echo "=========================================="
+echo ""
+echo "vLLM Services:"
+for i in "${!GPU_CONFIGS[@]}"; do
+    gpu_config="${GPU_CONFIGS[$i]}"
+    IFS=':' read -r gpu_id port <<< "$gpu_config"
+    PID="${VLLM_PIDS[$i]}"
+    echo "  GPU $gpu_id: http://localhost:$port/v1 (PID: $PID)"
+done
+echo ""
+echo "Load Balancer:"
+echo "  http://localhost:$LB_PORT (PID: $LB_PID)"
+echo ""
+echo "Configuration:"
+echo "  Update llm_service_config.yaml: base_url: \"http://localhost:$LB_PORT/v1\""
+echo ""
+echo "To stop these specific services, run:"
+echo "  ./scripts/stop_vllm_services.sh"
+echo ""
+echo "This will only kill the processes listed above, not other vLLM services."
+echo ""

scripts/stop_reranker_services.sh ADDED Viewed

	@@ -0,0 +1,106 @@

+#!/bin/bash
+# Script to stop reranker services and load balancer (only the ones we started)
+# Usage: ./scripts/stop_reranker_services.sh
+set -e
+# Get script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+cd "$PROJECT_ROOT"
+LOG_DIR="./logs/reranker"
+PID_FILE="$LOG_DIR/reranker_pids.txt"
+LB_PID_FILE="$LOG_DIR/reranker_lb_pid.txt"
+echo "=== Stopping Reranker Services and Load Balancer ==="
+echo ""
+# Step 1: Stop load balancer (if PID file exists)
+echo "Step 1: Stopping reranker load balancer..."
+if [ -f "$LB_PID_FILE" ]; then
+    LB_PID=$(cat "$LB_PID_FILE" 2>/dev/null | head -1)
+    if [ -n "$LB_PID" ] && ps -p $LB_PID > /dev/null 2>&1; then
+        echo "  Killing load balancer PID $LB_PID..."
+        kill -TERM $LB_PID 2>/dev/null || true
+        sleep 2
+        if ps -p $LB_PID > /dev/null 2>&1; then
+            echo "  Force killing load balancer PID $LB_PID..."
+            kill -KILL $LB_PID 2>/dev/null || true
+        fi
+        echo "  Load balancer stopped"
+        rm -f "$LB_PID_FILE"
+    else
+        echo "  Load balancer PID from file not found (may have already terminated)"
+        rm -f "$LB_PID_FILE"
+    fi
+else
+    echo "  No load balancer PID file found ($LB_PID_FILE)"
+    echo "  If load balancer is running, you may need to find and kill it manually"
+fi
+echo ""
+# Step 2: Stop reranker services (ONLY the ones we started)
+echo "Step 2: Stopping reranker services (only the ones we started)..."
+if [ -f "$PID_FILE" ]; then
+    echo "  Reading PIDs from $PID_FILE"
+    KILLED_COUNT=0
+    NOT_FOUND_COUNT=0
+    while IFS= read -r pid || [ -n "$pid" ]; do
+        # Skip empty lines
+        [ -z "$pid" ] && continue
+        if ps -p $pid > /dev/null 2>&1; then
+            echo "  Killing reranker service PID $pid..."
+            kill -TERM $pid 2>/dev/null || true
+            KILLED_COUNT=$((KILLED_COUNT + 1))
+        else
+            echo "  PID $pid: Process not found (may have already terminated)"
+            NOT_FOUND_COUNT=$((NOT_FOUND_COUNT + 1))
+        fi
+    done < "$PID_FILE"
+    if [ $KILLED_COUNT -gt 0 ]; then
+        echo "  Waiting 3 seconds for graceful shutdown..."
+        sleep 3
+        # Force kill if still running
+        while IFS= read -r pid || [ -n "$pid" ]; do
+            [ -z "$pid" ] && continue
+            if ps -p $pid > /dev/null 2>&1; then
+                echo "  Force killing reranker service PID $pid..."
+                kill -KILL $pid 2>/dev/null || true
+            fi
+        done < "$PID_FILE"
+        echo "  Stopped $KILLED_COUNT reranker service(s)"
+    else
+        echo "  No running processes found from saved PIDs"
+    fi
+    if [ $NOT_FOUND_COUNT -gt 0 ]; then
+        echo "  ($NOT_FOUND_COUNT process(es) were already terminated)"
+    fi
+    # Remove PID file after stopping
+    rm -f "$PID_FILE"
+else
+    echo "  WARNING: $PID_FILE not found!"
+    echo "  Cannot safely stop services without PID file."
+    echo "  If you know the PIDs, you can manually kill them."
+    echo "  To avoid affecting other users, DO NOT use pkill!"
+fi
+echo ""
+echo "  NOTE: Only processes from reranker_pids.txt were killed."
+echo "  Other reranker services (if any) were NOT affected."
+echo ""
+echo "=== Checking GPU status ==="
+nvidia-smi --query-gpu=index,memory.used --format=csv,noheader | grep -E '^ 0,|^ 1,|^ 2,|^ 3,' || echo "GPU 0,1,2,3 status:"
+echo ""
+echo "Done!"

scripts/stop_vllm_services.sh ADDED Viewed

	@@ -0,0 +1,267 @@

+#!/bin/bash
+# Script to stop vLLM services and load balancer
+# Usage: ./scripts/stop_vllm_services.sh
+# Don't use set -e here because we want to continue even if some kills fail
+# set -e
+# Get script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+cd "$PROJECT_ROOT"
+LOG_DIR="./logs/vllm"
+# Function to recursively collect all descendant PIDs of a given PID
+# Returns space-separated list of all PIDs in the process tree
+collect_descendant_pids() {
+    local root_pid=$1
+    local all_pids="$root_pid"
+    local to_check="$root_pid"
+    local new_pids=""
+    # Iteratively collect all descendants until no new children are found
+    while [ -n "$to_check" ]; do
+        new_pids=""
+        for pid in $to_check; do
+            # Find direct children of this PID
+            local children=$(ps -o pid --no-headers --ppid $pid 2>/dev/null | tr '\n' ' ')
+            if [ -n "$children" ]; then
+                # Add children to the list
+                all_pids="$all_pids $children"
+                new_pids="$new_pids $children"
+            fi
+        done
+        to_check="$new_pids"
+    done
+    echo "$all_pids"
+}
+# Function to collect log files opened by a process and its descendants
+# Returns newline-separated list of log file paths
+collect_process_log_files() {
+    local root_pid=$1
+    local log_files=""
+    # Collect all descendant PIDs (including the root)
+    local all_pids=$(collect_descendant_pids $root_pid)
+    # Use lsof to find all log files opened by these processes
+    # Look for files in the log directory that are opened by any of these PIDs
+    for pid in $all_pids; do
+        [ -z "$pid" ] && continue
+        if ps -p $pid > /dev/null 2>&1; then
+            # Find log files opened by this PID (files with .log extension in LOG_DIR)
+            # lsof output format: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
+            # We need the last field (NAME) which is the file path
+            # Try both absolute and relative paths
+            local log_dir_abs=$(cd "$PROJECT_ROOT" && cd "$LOG_DIR" && pwd 2>/dev/null || echo "$LOG_DIR")
+            local pid_logs=$(lsof -p $pid 2>/dev/null | awk 'NR>1 {print $NF}' | grep -E "\.log$" | grep -E "(^$log_dir_abs/|$LOG_DIR/)" | sort -u)
+            if [ -n "$pid_logs" ]; then
+                log_files="$log_files"$'\n'"$pid_logs"
+            fi
+        fi
+    done
+    # Remove duplicates and empty lines, return unique log files
+    echo "$log_files" | grep -v '^$' | sort -u
+}
+# Function to kill a PID and all its descendants
+# This ensures all child processes (including GPU processes) are terminated
+kill_process_tree() {
+    local root_pid=$1
+    local signal=${2:-TERM}
+    if ! ps -p $root_pid > /dev/null 2>&1; then
+        return 1
+    fi
+    # Collect all descendant PIDs (including the root)
+    local all_pids=$(collect_descendant_pids $root_pid)
+    # Kill all processes
+    # For TERM, we kill from leaves to root (reverse order) for graceful shutdown
+    # For KILL, order doesn't matter
+    if [ "$signal" = "KILL" ]; then
+        # Force kill all processes
+        for pid in $all_pids; do
+            [ -z "$pid" ] && continue
+            kill -KILL $pid 2>/dev/null || true
+        done
+    else
+        # Graceful shutdown: kill children first, then parent
+        # Convert to array and kill in reverse order
+        local pids_array=($all_pids)
+        for ((idx=${#pids_array[@]}-1; idx>=0; idx--)); do
+            pid=${pids_array[$idx]}
+            [ -z "$pid" ] && continue
+            kill -TERM $pid 2>/dev/null || true
+        done
+    fi
+}
+echo "=== Stopping vLLM Services and Load Balancer ==="
+echo ""
+# Step 1: Stop load balancer (if PID file exists)
+echo "Step 1: Stopping vLLM load balancer..."
+LB_PID_FILE="$LOG_DIR/vllm_lb_pid.txt"
+LB_LOG_FILES=""
+if [ -f "$LB_PID_FILE" ]; then
+    LB_PID=$(cat "$LB_PID_FILE" 2>/dev/null | head -1)
+    if [ -n "$LB_PID" ] && ps -p $LB_PID > /dev/null 2>&1; then
+        echo "  Killing load balancer PID $LB_PID..."
+        # Collect log files before killing
+        LB_LOG_FILES=$(collect_process_log_files $LB_PID)
+        # Also try to find load balancer log files by pattern (fallback if lsof doesn't work)
+        if [ -z "$LB_LOG_FILES" ]; then
+            LB_LOG_FILES=$(find "$LOG_DIR" -maxdepth 1 -name "load_balancer*.log" -type f 2>/dev/null)
+        fi
+        kill -TERM $LB_PID 2>/dev/null || true
+        sleep 2
+        if ps -p $LB_PID > /dev/null 2>&1; then
+            echo "  Force killing load balancer PID $LB_PID..."
+            kill -KILL $LB_PID 2>/dev/null || true
+        fi
+        echo "  Load balancer stopped"
+        rm -f "$LB_PID_FILE"
+        # Remove load balancer log files
+        if [ -n "$LB_LOG_FILES" ]; then
+            echo "  Removing load balancer log files..."
+            while IFS= read -r log_file; do
+                [ -z "$log_file" ] && continue
+                if [ -f "$log_file" ]; then
+                    rm -f "$log_file"
+                    echo "    Removed: $log_file"
+                fi
+            done <<< "$LB_LOG_FILES"
+        else
+            echo "  Note: Could not detect load balancer log file (process may have already terminated)"
+        fi
+    else
+        echo "  Load balancer PID from file not found (may have already terminated)"
+        rm -f "$LB_PID_FILE"
+    fi
+else
+    echo "  No load balancer PID file found ($LB_PID_FILE)"
+    echo "  If load balancer is running, you may need to find and kill it manually"
+fi
+echo ""
+# Step 2: Stop vLLM services (ONLY the ones we started)
+echo "Step 2: Stopping vLLM services (only the ones we started)..."
+# Try to read PIDs from file
+if [ -f "$LOG_DIR/vllm_pids.txt" ]; then
+    echo "  Reading PIDs from $LOG_DIR/vllm_pids.txt"
+    # Read all PIDs into an array
+    pids_array=()
+    while IFS= read -r pid || [ -n "$pid" ]; do
+        # Skip empty lines
+        [ -z "$pid" ] && continue
+        pids_array+=($pid)
+    done < "$LOG_DIR/vllm_pids.txt"
+    KILLED_COUNT=0
+    NOT_FOUND_COUNT=0
+    # Collect log files for all vLLM services before killing
+    vllm_log_files=""
+    for pid in "${pids_array[@]}"; do
+        if ps -p $pid > /dev/null 2>&1; then
+            # Collect log files for this PID
+            pid_logs=$(collect_process_log_files $pid)
+            if [ -n "$pid_logs" ]; then
+                vllm_log_files="$vllm_log_files"$'\n'"$pid_logs"
+            fi
+        fi
+    done
+    # First pass: graceful shutdown (TERM signal)
+    for pid in "${pids_array[@]}"; do
+        if ps -p $pid > /dev/null 2>&1; then
+            echo "  Killing vLLM service PID $pid and all its descendant processes..."
+            # Collect and show how many processes will be killed
+            descendant_pids=$(collect_descendant_pids $pid)
+            pid_count=$(echo $descendant_pids | wc -w)
+            echo "    Found $pid_count process(es) in the process tree"
+            # Use our recursive function to kill the entire process tree
+            kill_process_tree $pid TERM
+            KILLED_COUNT=$((KILLED_COUNT + 1))
+        else
+            echo "  PID $pid: Process not found (may have already terminated)"
+            NOT_FOUND_COUNT=$((NOT_FOUND_COUNT + 1))
+        fi
+    done
+    if [ $KILLED_COUNT -gt 0 ]; then
+        echo "  Waiting 3 seconds for graceful shutdown..."
+        sleep 3
+        # Second pass: force kill (KILL signal) if still running
+        for pid in "${pids_array[@]}"; do
+            if ps -p $pid > /dev/null 2>&1; then
+                echo "  Force killing vLLM service PID $pid and all its descendant processes..."
+                # Collect and show how many processes will be force killed
+                descendant_pids=$(collect_descendant_pids $pid)
+                pid_count=$(echo $descendant_pids | wc -w)
+                echo "    Force killing $pid_count process(es) in the process tree"
+                # Use our recursive function to force kill the entire process tree
+                kill_process_tree $pid KILL
+            fi
+        done
+        echo "  Stopped $KILLED_COUNT vLLM service(s)"
+    else
+        echo "  No running processes found from saved PIDs"
+    fi
+    if [ $NOT_FOUND_COUNT -gt 0 ]; then
+        echo "  ($NOT_FOUND_COUNT process(es) were already terminated)"
+    fi
+    # Remove vLLM log files
+    if [ -n "$vllm_log_files" ]; then
+        echo ""
+        echo "  Removing vLLM service log files..."
+        removed_count=0
+        while IFS= read -r log_file; do
+            [ -z "$log_file" ] && continue
+            if [ -f "$log_file" ]; then
+                rm -f "$log_file"
+                echo "    Removed: $log_file"
+                removed_count=$((removed_count + 1))
+            fi
+        done <<< "$vllm_log_files"
+        if [ $removed_count -eq 0 ] && [ -n "$vllm_log_files" ]; then
+            echo "    (No log files found to remove - they may have already been deleted)"
+        fi
+    else
+        echo ""
+        echo "  Note: Could not detect vLLM log files (processes may have already terminated)"
+    fi
+    # Remove PID file after stopping
+    rm -f "$LOG_DIR/vllm_pids.txt"
+else
+    echo "  WARNING: $LOG_DIR/vllm_pids.txt not found!"
+    echo "  Cannot safely stop services without PID file."
+    echo "  If you know the PIDs, you can manually kill them."
+    echo "  To avoid affecting other users, DO NOT use pkill!"
+fi
+echo ""
+echo "  NOTE: Only processes from vllm_pids.txt were killed."
+echo "  Other vLLM services (if any) were NOT affected."
+echo ""
+echo "=== Checking GPU status ==="
+nvidia-smi --query-gpu=index,memory.used --format=csv,noheader | grep -E '^ 4,|^ 5,|^ 6,|^ 7,' || echo "GPU 4,5,6,7 status:"
+echo ""
+echo "Done!"

shared/configs/config.yaml ADDED Viewed

	@@ -0,0 +1,97 @@

+# Configuration file for Paper Reviewer Agent
+#
+# Note: LLM service configuration is in configs/llm_service_config.yaml
+# Note: Prompts configuration is in configs/prompts.yaml
+# Global verbose mode - controls all step outputs (Step 1, Step 2, Step 3, etc.)
+# Set to false to suppress intermediate progress output for faster execution
+verbose: false
+# Paper Search Configuration
+paper_search:
+  # Asta API Configuration
+  asta:
+    # Single API key (backward compatible, lower priority)
+    api_key: null  # Set via ASTA_API_KEY env var
+    # API key pool
+    api_key_pool_path: "asta_api_pool.txt"  # path relative to shared/configs/ or absolute path
+    endpoint: "https://asta-tools.allen.ai/mcp/v1"
+  # Semantic Scholar API Configuration (alternative)
+  semantic_scholar:
+    api_key: null  # Set via S2_API_KEY env var
+  # Reranker Configuration
+  reranker:
+    # Reranker model path (for direct mode)
+    model: "OpenScholar/OpenScholar_Reranker"  # e.g., "OpenScholar/OpenScholar_Reranker" or "BAAI/bge-reranker-base"
+    use_fp16: true
+    # Reranker API Configuration (for API mode with load balancing)
+    # If base_url is set, use API mode with load balancer
+    # If endpoint_pool_path is set, use API mode with endpoint pool
+    # If both are None, use direct mode (load model directly)
+    api:
+      # Base URL for reranker API service (load balancer address)
+      # Example: "http://localhost:8009" (load balancer that distributes to 8005-8008)
+      # If set, will use API mode with load balancer
+      base_url: "http://localhost:8008"
+      # Endpoint pool file path (alternative to base_url)
+      # Example: "reranker_endpoint_pool.txt" (contains list of endpoints: http://localhost:8005, http://localhost:8006, ...)
+      # If set, will use API mode with endpoint pool (round-robin load balancing)
+      # endpoint_pool_path: "reranker_endpoint_pool.txt"  # Set to use API mode with endpoint pool
+      endpoint_pool_path: null
+      # Request timeout in seconds
+      timeout: 30.0
+  # Retrieval Configuration
+  retrieval:
+    top_n: 10
+    use_abstract: true
+    norm_cite: false
+    min_citation: null
+    limit_per_keyword: 20
+# Related Work Searcher Configuration
+related_work_searcher:
+  max_related_papers: 10
+  max_parallel_summaries: 1
+  publication_date_range: null  # e.g., "2020:" for papers from 2020 onwards
+  venues: null  # e.g., "ICLR,NeurIPS"
+  verbose: false  # Set to false to suppress intermediate progress output (faster, less output)
+# Paper Reviewer Configuration
+paper_reviewer:
+  review_format: "ai_researcher"  # Options: "detailed", "summary", "structured", "ai_researcher"
+  max_tokens: 16384  # Maximum tokens for reviewer output (increased for detailed reviews)
+# Review Refiner Configuration
+review_refiner:
+  review_format: "strict_detailed"  # Options: "detailed", "summary", "structured", "strict_detailed" - should match paper_reviewer format
+  max_tokens: 16384  # Maximum tokens for refiner output (increased for detailed reviews)
+# Output Configuration
+output:
+  save_reviews: true
+  output_dir: "./outputs"
+  format: "json"  # Options: "json", "markdown", "txt"
+# Evaluation Configuration
+evaluation:
+  # Default number of worker threads for concurrent evaluation
+  max_workers: 16
+  # Component name for LLM service assignment (used with llm_service_config.yaml)
+  # Options: "keyword_generator", "paper_summarizer", "reviewer", "refiner"
+  # Defaults to "reviewer" if not specified
+  llm_component: "reviewer"
+  # Default model name (can be overridden by command line args or llm_service_config.yaml)
+  default_model_name: "Qwen/Qwen2.5-72B-Instruct"
+  # Prompt versions
+  rubric_generation_prompt_version: "v2"  # Options: "v1", "v2"
+  evaluator_prompt_version: "v1"  # Options: "v0", "v1"

shared/configs/llm_service_config.yaml ADDED Viewed

	@@ -0,0 +1,57 @@

+# vLLM Service Configuration
+# vLLM server settings
+vllm:
+  # Base URL for vLLM service
+  # In production, this should point to a load balancer (nginx/HAProxy) that distributes
+  # requests across multiple vLLM service instances running on different GPUs.
+  # Example: "http://localhost:8000/v1" (load balancer address)
+  base_url: "http://localhost:8000/"  # directly to the load balancer, 8000-8003 are used by the vllm services
+  api_key: "dummy-key"  # Not used for local vLLM, but required by OpenAI client
+  model_name: "openai/gpt-oss-120b"
+  # model_name: "Qwen/Qwen3-4B-Instruct-2507"
+  # model_name: "Qwen/Qwen3-235B-A22B-Instruct-2507"
+  timeout: 300
+  # Rate limiting: Maximum concurrent requests to vLLM server
+  # Lower this if you're getting 500 errors (suggests server overload)
+  # Recommended: 4-8 for small models, 2-4 for large models
+  max_concurrent_requests: 64
+  # Retry configuration for server errors
+  max_retries: 3  # Number of retries for 500/502/503/504 errors
+  retry_delay: 1.0  # Initial delay in seconds
+  retry_backoff: 2.0  # Exponential backoff multiplier
+  # Default sampling parameters
+  temperature: 0.7
+  top_p: 0.8
+  top_k: 20
+  max_tokens: 16384
+  presence_penalty: 0.0
+# GPT / OpenRouter API Configuration
+gpt:
+  enabled: true
+  # Leave api_key null so it is taken from OPENAI_API_KEY or OPENROUTER_API_KEY
+  api_key: null
+  # Use the OpenRouter model you specified
+  model_name: "openai/gpt-oss-120b"
+  # Point the OpenAI-compatible client to OpenRouter
+  base_url: "http://localhost:8000/"
+  timeout: 300
+  # Default sampling parameters
+  temperature: 0.7
+  top_p: 0.95
+  max_tokens: 16384
+  presence_penalty: 0.0
+# LLM Service Assignment
+# Specify which LLM service to use for each component
+llm_assignments:
+  keyword_generator: "gpt"   # Options: "vllm", "gpt"
+  paper_summarizer: "gpt"    # Options: "vllm", "gpt"
+  reviewer: "gpt"            # Options: "vllm", "gpt"
+  refiner: "gpt"             # Options: "vllm", "gpt" - defaults to reviewer if not specified

shared/configs/prompts.yaml ADDED Viewed

	@@ -0,0 +1,580 @@

+# Prompt templates for the paper reviewer agent
+# Keyword generation prompt
+# Based on OpenScholar's keyword extraction prompt, adapted for JSON output format
+keyword_generation:
+  system: "You are an experienced research assistant helping to find related work for a paper. Always respond with valid JSON."
+  user: |
+    Suggest search queries to retrieve relevant papers related to the following paper. The search queries must be simple, short, and comma separated. Focus on core technical concepts, methods, and key techniques that would help find related work.
+    Here's an example:
+    ##
+    Paper: How have prior work incorporated personality attributes to train personalized dialogue generation models?
+    Search queries: personalized dialogue generation, personalized language models, personalized dialogue
+    ##
+    Paper: How do retrieval-augmented LMs perform well in knowledge-intensive tasks?
+    Search queries: retrieval-augmented LMs, knowledge-intensive tasks, large language models for knowledge-intensive tasks, retrieval-augmented generation
+    ##
+    Paper information:
+    {context}
+    IMPORTANT: You must respond with valid JSON only. Use this format:
+    {{
+      "keywords": ["keyword1", "keyword2", "keyword3", "keyword4", "keyword5"]
+    }}
+    Return only the JSON, no additional text or explanation. Generate 3-5 short, comma-separated keywords as search queries.
+# DOMAIN-SPECIFICAGENTS
+# Paper summarization prompt
+# Generate structured summary of each related paper: summary, main methods, key findings, relation with the target paper
+paper_summarization:
+  user: |
+    You are an senior research assistant who is proficient at identifying the main contribution, key findings of papers and the relations between different works.
+    For the reference paper below:
+    {reference_paper}
+    You are given this paper as a related work to the reference paper:
+    {related_paper}
+    Now, you need to provide concise information on what the related work is about, its main methods, results and the relation between the reference paper, to make it easier for the supervisor to write the review.
+    Focusing on the relationship between the reference paper and the related work, summarize the related work's main methods, results and the relation between them in a concise way.
+    IMPORTANT: You must respond with valid JSON only. Use this format:
+    {{
+      "summary": "Your concise summary here in 2-3 sentences.",
+      "main_methods": "The main methods of the related work.",
+      "key_findings": "The key findings of the related work.",
+      "relation": "The relation between the related work and the paper you are reviewing, such as how they share similar ideas, solving the same problem, have diverged claims, etc."
+    }}
+    Return only the JSON, no additional text or explanation.
+# Paper Insight Miner Prompt
+paper_insight_miner:
+  user: |
+    You are an expert research assistant. Your task is to help refine the method/contribution parts of a candidate review, using the paper content as the source of truth.
+    SCOPE (strict):
+    - ONLY cover: core contributions, technical approach, model/algorithm design, mathematical formulation, assumptions, optimization/training, implementation details, and limitations of the method.
+    - Novelty: ONLY assess novelty claims AS PRESENTED IN THE PAPER (no external knowledge, no web search, no comparing to papers not mentioned in the text).
+    - Do NOT comment on experimental results, benchmarks, or score/decision fields (handled by another module).
+    - Do NOT do external related-work positioning (handled by another module).
+    Paper content:
+    {content}
+    Candidate review:
+    {candidate_review}
+    What to do:
+    1) Extract the paper’s core contributions and method details (paper-grounded).
+    2) Check the candidate review’s method/contribution claims against the paper and identify:
+       - incorrect/hallucinated/contradicted claims,
+       - missing key technical points,
+       - vague or generic statements that should be made specific.
+    3) Provide short rewrite suggestions WITH evidence anchors (Section/Equation/Algorithm/Figure/snippet if available).
+    Output JSON only:
+    {
+      "facts": {
+        "core_contributions": [
+          {"claim": "...", "evidence": "..."}
+        ],
+        "method_summary": [
+          {"point": "key component / step / design choice", "evidence": "..."}
+        ],
+        "assumptions_and_scope": [
+          {"item": "...", "evidence": "..."}
+        ],
+        "novelty_claims_in_paper": [
+          {"claim": "as stated by the authors", "evidence": "..."}
+        ]
+      },
+      "review_issues": {
+        "incorrect_or_hallucinated": [
+          {"review_claim": "...", "why_wrong": "...", "evidence": "..."}
+        ],
+        "missing_key_points": [
+          {"what_missing": "...", "why_important": "...", "evidence": "..."}
+        ],
+        "needs_specificity": [
+          {"review_text": "...", "how_to_fix": "name the component/assumption/equation/step", "evidence": "..."}
+        ]
+      },
+      "rewrite_suggestions": [
+        {
+          "apply_to": "Summary|Strengths|Weaknesses|Suggestions|Questions (method-related only)",
+          "target": "Core Contribution Accuracy|Evidence-Based Critique|Critique Clarity",
+          "suggested_text": "1-2 sentences",
+          "evidence": "..."
+        }
+      ]
+    }
+    Rules:
+    - If you cannot find support in the paper text, set evidence to "not_found_in_text"; do NOT assert the paper is missing it.
+    - Keep each list short (<=5 items). Prefer the most important contributions/components/issues.
+    - Return JSON only. No extra text.
+# Paper results summarization prompt for Result Analyzer Agent
+# according to the paper content and candidate review, pinpoint the issues and provide refinement suggestions with concrete evidence to the candidate review.
+paper_results_analyzer:
+  user: |
+    You are an expert research assistant. Your task is to help refine the experiment/evaluation parts of a candidate review, using the paper content as the source of truth.
+    SCOPE (strict):
+    - ONLY cover experimental evaluation: datasets, baselines, metrics, tables/figures, quantitative results, statistical evidence, ablations.
+    - Do NOT comment on novelty, related-work positioning, writing/presentation quality, or overall recommendation. Other agents will handle those.
+    Paper content:
+    {content}
+    Candidate review:
+    {candidate_review}
+    What to do:
+    1) Extract key experimental facts from the paper.
+    2) Check experiment-related claims in the candidate review and identify:
+       - incorrect/hallucinated/contradicted claims,
+       - missing key experimental points,
+       - vague statements that should be made specific.
+    3) Provide short rewrite suggestions WITH evidence anchors (Table/Figure/Section/snippet if available).
+    Output JSON only:
+    {
+      "facts": {
+        "datasets": ["..."],
+        "metrics": ["..."],
+        "baselines": ["..."],
+        "key_results": [
+          {"claim": "...", "evidence": "..."}
+        ]
+      },
+      "review_issues": {
+        "incorrect_or_hallucinated": [
+          {"review_claim": "...", "why_wrong": "...", "evidence": "..."}
+        ],
+        "missing_key_points": [
+          {"what_missing": "...", "why_important": "...", "evidence": "..."}
+        ],
+        "needs_specificity": [
+          {"review_text": "...", "how_to_fix": "...", "evidence": "..."}
+        ]
+      },
+      "rewrite_suggestions": [
+        {
+          "apply_to": "Summary|Strengths|Weaknesses|Suggestions|Questions (experiment-related only)",
+          "target": "Results Interpretation|Evidence-Based Critique",
+          "suggested_text": "1-2 sentences",
+          "evidence": "..."
+        }
+      ]
+    }
+    Rules:
+    - If you cannot find support in the paper text, set evidence to "not_found_in_text"; do NOT assert the paper is missing it.
+    - Keep each list short (<=5 items). Prefer the most important issues/results.
+    - Return JSON only. No extra text.
+# Reviewer model prompts (this may never been used)
+# Need to be adapted to the actual review model
+review_prompts:
+  detailed: |
+    You are reviewing a research paper. Please provide a detailed review in markdown format covering the following sections in this exact order:
+    ## Summary
+    A brief summary of the paper's contributions and main findings.
+    ## Soundness
+    Rate the technical soundness and correctness of the methodology (provide a score from 1 to 5, where 1 is very poor and 5 is excellent). Do not add any explanation.
+    ## Presentation
+    Rate the clarity and quality of presentation (provide a score from 1 to 5, where 1 is very poor and 5 is excellent). Do not add any explanation.
+    ## Contribution
+    Rate the significance and novelty of the contribution (provide a score from 1 to 5, where 1 is very poor and 5 is excellent). Do not add any explanation.
+    ## Strengths
+    What are the main strengths of this paper? Consider:
+    - Novelty and originality
+    - Technical soundness
+    - Experimental validation
+    - Clarity of presentation
+    - Significance of contributions
+    ## Weaknesses
+    What are the main weaknesses or concerns? Consider:
+    - Methodological issues
+    - Missing experiments or baselines
+    - Limitations not acknowledged
+    - Clarity issues
+    - Reproducibility concerns
+    ## Questions
+    Any questions you have for the authors that would help clarify aspects of the work.
+    ## Rating
+    Overall rating of the paper (provide a score from 1 to 10, following the reviewer scale). Do not add any explanation.
+    ## Confidence
+    Your confidence in your assessment (provide a score from 1 to 5, following the reviewer scale). Do not add any explanation.
+    ## Decision
+    Your recommendation: "accept", "reject", or "undecided". Do not add any explanation.
+    IMPORTANT: Write your review in markdown format with the exact section headers above (## Summary, ## Soundness, etc.). Include scores and explanations for each scoring section. Be constructive, specific, and fair in your review.
+  ai_researcher: |
+    You are an expert academic reviewer. Your task is to provide a thorough, structured, and balanced review of the following research paper.
+    Step 1: Read and Analyze the Paper Carefully
+    - Read the paper paragraph by paragraph.
+    - For each paragraph:
+        - Perform detailed analysis and document your thought process using <think></think> tags.
+        - Identify strengths, weaknesses, unclear points, logical flaws, technical inconsistencies, or missing references.
+        - Highlight both strengths and weaknesses.
+    - Ensure all observations are supported by reasoning inside <think></think> tags.
+    Step 2: Conduct the Review
+    - After completing paragraph-by-paragraph analysis, provide an overall assessment following the structure below.
+    - Provide scores, recommendations, and a final decision in the strict JSON format.
+    - Do **not** include <think> tags inside JSON format.
+    - Be concise yet sufficiently detailed for an academic review.
+    Step 3: Organize your reviews into the following JSON format:
+    {
+      "summary": [Concise, detailed summary covering methodology, key ideas, and results.],
+      "soundness": [Score 1-5],
+      "presentation": [Score 1-5],
+      "contribution": [Score 1-5],
+      "strengths": [List major strengths],
+      "weaknesses": [List major weaknesses, with confidence levels if applicable],
+      "suggestions": [Concrete recommendations to address weaknesses],
+      "questions": [Outstanding questions or clarifications needed],
+      "rating": [Overall score, e.g., 1-10],
+      "confidence": [Confidence in assessment, e.g., 1-5],
+      "decision": [Accept, Reject]
+    }
+    Few-shot example (for format reference and not copy the example):
+    {
+      "summary": "This paper introduces a novel algorithm for modeling shared dynamics between multiple observation processes, validated on both simulated and real-world data.",
+      "soundness": 3.0,
+      "presentation": 3.0,
+      "contribution": 3.0,
+      "strengths": "- Novel decomposition approach.\n - Separation of shared and residual dynamics.\n - Validated on real and simulated data.",
+      "weaknesses": "- Strong linearity assumption (high confidence).\n - Limited experiments (medium confidence).",
+      "suggestions": "- Test on nonlinear systems.\n - Expand evaluation datasets.",
+      "questions": "- Sensitivity to deviations from assumed structure?\n - Performance on nonlinear data?",
+      "rating": 6.5,
+      "confidence": 3.0,
+      "decision": "Accept"
+    }
+    Few-shot example (strictly follow this structure and do not copy the example):
+    {
+      "summary": "This paper introduces PG-LDS-ID, a novel algorithm designed to model the shared dynamics between two observation processes: a continuous-time Gaussian process and a discrete-time Poisson process. The core idea is to use a latent state-space model to capture the underlying dynamics that influence both observation streams, while also accounting for residual dynamics unique to the Poisson process. The authors propose a two-stage approach, where the first stage identifies the shared dynamics using a covariance-based subspace identification method, and the second stage identifies the residual dynamics that are only observable through the Poisson process. A key contribution is the introduction of a block-structured system matrix, which facilitates the separation of shared and residual dynamics. The method is motivated by applications in neuroscience, where one might want to model the relationship between continuous behavioral trajectories and discrete neural spiking activity. The authors validate their approach using both simulated data and a real-world dataset of non-human primate neural spiking activity and arm movements. The simulation results demonstrate the algorithm's ability to accurately recover the shared dynamics, and the real data experiment shows improved prediction accuracy compared to a prior method, PLDSID. The paper's significance lies in its ability to handle coupled observation processes with different statistical properties, while explicitly disentangling shared and residual dynamics, a capability not simultaneously offered by existing analytical methods. However, the paper's reliance on strong assumptions, such as linearity and a specific block structure, and the limited scope of its experimental validation, raise important questions about its broader applicability and robustness.",
+      "soundness": 2.8,
+      "presentation": 2.4,
+      "contribution": 2.6,
+      "strengths": "One of the primary strengths of this paper is the novel decomposition technique introduced through Equation (6), which employs a block-structured system matrix. This decomposition is a significant contribution as it simplifies the modeling of shared dynamics between data streams with different statistical properties, specifically Gaussian and Poisson processes. By breaking down the problem into manageable components, the authors enhance the overall approach's ease of handling and implementation. This is particularly valuable in practical scenarios where dealing with coupled dynamics can be complex. Furthermore, the paper's focus on explicitly disentangling shared and residual dynamics is a notable advancement. Existing methods often model the collective dynamics of multiple modalities in the same latent states, whereas PG-LDS-ID explicitly separates the shared dynamics from those unique to the Poisson process. This distinction is crucial for understanding the underlying mechanisms that drive different observation streams. The authors demonstrate the practical applicability of their method through both simulated and real-world experiments. The simulation results show that the proposed method can accurately recover the shared dynamics, and the real data experiment on non-human primate data shows that PG-LDS-ID achieves better prediction accuracies compared to PLDSID, a prior method. This empirical validation provides evidence for the effectiveness of the algorithm in a realistic setting. Finally, the method's ability to handle generalized linear processes, as opposed to being limited to Gaussian processes, is another strength. By using second-order moments, the proposed method can now deal with a broader class of observation models, making it more versatile and applicable to a wider range of problems.",
+      "weaknesses": "After a thorough review of the paper and the reviewer comments, I have identified several key weaknesses that significantly impact the paper's conclusions and broader applicability. First, the paper's strong reliance on the assumption of latent linear dynamics is a major limitation. The entire method is built upon linear dynamical state-space models, which, as the authors acknowledge in the 'Limitations' section, can only provide an approximation of nonlinear dynamics. This assumption is particularly concerning given that many real-world systems, especially those in neuroscience, exhibit nonlinear behavior. The authors do not provide any experimental results or analysis of how the method performs when applied to observations generated by a latent nonlinear system. This lack of evaluation makes it difficult to assess the method's robustness and applicability in real-world scenarios. The confidence level for this weakness is high, as the paper explicitly states its reliance on linear models and lacks any analysis of nonlinear systems. Second, the paper introduces a specific block structure in Equation (6) for the system matrices, which is a critical assumption for the method's ability to dissociate shared and residual dynamics. While the authors justify this structure as a design choice to facilitate the separation of dynamics, they do not sufficiently discuss the conditions under which this decomposition can be effectively implemented, or the consequences of deviations from this structure. Specifically, the paper does not explore what happens if the true coefficient matrix has non-zero values in the upper right block, which would violate the assumed block structure. The practical implications of this choice are not fully explored, and the paper lacks any sensitivity analysis to assess the robustness of the method to such deviations. The confidence level for this weakness is high, as the paper introduces the block structure as a key design choice without addressing its limitations or potential for misapplication. Third, the paper lacks a detailed comparison with recent, relevant subspace identification methods that also leverage multimodal data. The authors compare their method against PLDSID, a method from 2012, but do not compare against more recent techniques such as those presented in Ahmadipour et al. (2023) and Vahidi et al. (2023). This lack of comparison makes it difficult to assess the novelty and specific advantages of the proposed method compared to the current state-of-the-art. The paper mentions that existing methods do not explicitly tease apart shared and residual dynamics, but a more thorough comparison is needed to justify the contribution of this work. The confidence level for this weakness is high, as the paper does not include a comparison with recent, relevant methods. Fourth, the paper does not adequately address the estimation of the Gaussian observation noise variance. While the optimization procedure in Section 3.2.3 ensures valid noise statistics, the explicit estimation of the noise variance of the Gaussian observation process is not clearly outlined as a separate step before the optimization. This omission raises concerns about the method's sensitivity to variations in the noise variance and its impact on the accuracy of the estimated latent states. The confidence level for this weakness is medium, as the paper implicitly addresses noise statistics but does not explicitly detail the estimation of the Gaussian noise variance. Fifth, the paper's experimental evaluation is limited in scope. The authors primarily compare their method against PLDSID and do not include comparisons with more recent and competitive methods. This limited evaluation makes it difficult to assess the proposed algorithm's strengths and weaknesses in the current research landscape. Furthermore, the paper uses only one real-world dataset (NHP data), which limits the assessment of the model's broader applicability. The confidence level for this weakness is high, as the experimental section lacks comparisons with recent methods and uses a limited number of datasets. Finally, the paper claims that the algorithm can be generalized to non-Poisson/non-Gaussian models but does not provide any experimental evidence to support this claim. The paper states that the moment transformation step is key to extending the method, but no results are shown for any other distributions. This lack of empirical evidence makes the claim of generalizability unsubstantiated. The confidence level for this weakness is high, as the claim is made without any supporting experimental results.",
+      "suggestions": "To address the identified weaknesses, I recommend several concrete improvements. First, the authors should conduct a thorough analysis of the method's sensitivity to violations of the linearity assumption. This could involve simulating data from a variety of nonlinear dynamical systems and assessing the accuracy of the estimated latent states and their dimensions. For example, they could use simple nonlinear systems like the Duffing oscillator or the Lorenz attractor to generate synthetic data and then apply their method to this data. The performance of the method could be evaluated by comparing the estimated latent states and their dimensions to the true values. Furthermore, it would be beneficial to explore how the method's performance changes as the degree of nonlinearity increases. This analysis would provide a more comprehensive understanding of the method's limitations and its applicability to real-world scenarios where nonlinearities are common. Second, the authors should provide a more detailed analysis of the method's sensitivity to deviations from the assumed block structure in Equation (6). This could involve simulations where the true coefficient matrix has small non-zero values in the upper right block and assessing whether the method still converges to a reasonable estimate. A sensitivity analysis exploring the robustness of the method to such deviations would be crucial. Furthermore, the paper should provide more guidance on how to choose the dimensions of the latent spaces ($n_1$ and $n_x$). The current description is somewhat vague, and a more concrete procedure, perhaps based on information criteria or cross-validation, would be highly valuable. Third, the authors should include a detailed comparison of their approach with recent subspace identification techniques that use both behavioral and neural data, such as those presented in [1] and [2]. This comparison should include a discussion of the assumptions made by each method, the optimization procedures used, and the types of data that can be handled. For example, the authors should clearly explain how their method differs from the approaches presented in [1] and [2] in terms of the way they model the shared and residual dynamics. They should also discuss the advantages and disadvantages of their method compared to these existing techniques. Fourth, the authors should clarify the role of the noise variance of the Gaussian observation process in their method. They should provide a detailed analysis of the method's sensitivity to variations in the noise variance. This analysis could include simulations with different noise levels and a quantitative assessment of the error in latent state estimation and dimensionality identification. Furthermore, they should discuss how the method's performance is affected by the choice of the noise model. Fifth, the experimental evaluation should be expanded to include comparisons with more recent and competitive methods. While PLDSID is a relevant baseline, the field has seen significant advancements since 2012. Including comparisons with state-of-the-art methods, such as more recent deep learning approaches for time series modeling, would provide a more comprehensive assessment of the proposed algorithm's performance. This would not only highlight the strengths of the proposed method but also reveal its limitations and areas for future improvement. Finally, the authors should provide empirical support for their claim that the algorithm can be generalized to non-Poisson/non-Gaussian models. This could involve testing the method on synthetic datasets from simple models with alternative distributions. The authors should also consider including a simple example of how the moment transformation would be derived for a different distribution, such as Bernoulli, to further support their claim.",
+      "questions": "Several key uncertainties remain after my review of this paper. First, I am particularly interested in the justification for the block structure assumed in Equation (6). While the authors claim this structure does not lose generality, I would like to understand the practical implications of this choice more thoroughly. Specifically, how does the method behave when the true underlying system deviates from this block structure, even slightly? What happens if there are small non-zero values in the upper right block of the coefficient matrix? Does the method still converge to a reasonable estimate, or does it break down? A more detailed explanation of the assumptions underlying this block structure, and a sensitivity analysis exploring its robustness, would be highly beneficial. Second, I am curious about the method's performance when applied to nonlinear systems. The paper acknowledges the limitation of assuming linear dynamics but does not provide any analysis of how the method performs when this assumption is violated. How does the method perform when the underlying system is nonlinear? How does the accuracy of the estimated latent states and their dimensions change as the degree of nonlinearity increases? I would like to see more systematic evaluations of the method's performance under nonlinear conditions. Third, I would like to understand how the method compares to existing approaches that use subspace identification for multimodal data, specifically those mentioned in [1] and [2]. How does the proposed method differ in terms of the assumptions made, the optimization procedures used, and the types of data that can be handled? A more detailed comparison with these methods is needed to justify the specific contribution of this work. Fourth, I am interested in the role of the noise variance of the Gaussian observation process in the method. How does the noise variance affect the accuracy of the estimated latent states and their dimensions? How does the method's performance change as the noise variance varies? A more thorough analysis of the method's sensitivity to variations in the noise variance would be valuable. Finally, I would like to understand the practical limitations of the proposed method. What are the assumptions underlying the method, and when might these assumptions be violated in practice? Are there specific types of dynamical systems for which the method is not suitable? A clear discussion of these limitations would help readers understand the scope of the method and avoid misapplications.",
+      "rating": 6.8,
+      "confidence": 2.6,
+      "decision": "Reject"
+    }
+    NOTE: Output the JSON format ONLY, DO NOT output anything other than the json, DO NOT include your thinking process.
+# System message for reviewer
+reviewer_system: "You are an expert academic reviewer with deep knowledge in the field. Always respond with valid JSON format."
+# Refiner prompts
+refiner_prompts:
+  detailed: |
+    You are a senior researcher refining an existing peer review. Your job is to improve factual grounding, coverage, and usefulness while preserving the draft’s structure and intent. Treat the paper text as the source of truth.
+    You will be given:
+    (1) Paper text (plain text converted from PDF)
+    (2) Draft review (structured)
+    (3) Method/Contribution audit report (from Paper Insight Miner; paper-grounded)
+    (4) Experiments/Results audit report (from Paper Results Analyzer; paper-grounded)
+    (5) Related-work summaries (each item is a JSON summary of one retrieved paper, written relative to the target paper)
+    ========================
+    Primary objectives (what to improve)
+    ========================
+    Refine the review to satisfy these content-quality dimensions:
+    1) Core Contribution Accuracy
+    2) Results Interpretation
+    3) Comparative Analysis / Positioning
+    4) Evidence-Based Critique
+    5) Critique Clarity
+    6) Completeness Coverage
+    7) Constructive Tone
+    8) Avoid False or Contradictory Claims (critical)
+    ========================
+    Hard constraints (must follow)
+    ========================
+    A. Paper-grounded correctness is mandatory:
+    - If the audit reports mark a draft claim as incorrect/hallucinated/contradicted, you MUST fix or remove it.
+    - Do NOT introduce new factual claims about the paper unless you can anchor them to the paper text or the audit reports’ evidence.
+    B. Evidence anchoring rule:
+    - Every major critique (esp. in Weaknesses/Suggestions/Questions) must include a verifiable anchor:
+      section name, table/figure identifier, equation/algorithm reference, dataset/metric name, or a short quote snippet (<= 20 words).
+    - If you cannot find support, convert the statement into a question or a suggestion for clarification (do not assert absence).
+    C. Related-work usage rule (anti-leak / anti-overclaim):
+    - Retrieved related-work summaries are NOT guaranteed to be cited by the submission.
+    - Never claim “the paper compares to/cites X” unless the paper text actually contains X.
+    - When using retrieved works, attribute them as external context:
+      ��The related-work search suggests …; it would help to clarify/compare …”
+    - Use related work to: (i) sharpen positioning, (ii) propose missing baselines/comparisons, (iii) raise targeted questions.
+    D. Minimal-change policy:
+    - Keep the original structure and as much of the draft wording as possible.
+    - Do NOT shorten aggressively; do NOT rewrite into a totally new review.
+    - Prefer targeted edits, insertions, and corrections.
+    E. Numeric fields policy (IMPORTANT):
+    - Default: keep ALL numeric fields and the decision unchanged.
+    - Change numeric fields ONLY if the refined textual assessment would otherwise be clearly inconsistent, or if a major factual correction materially changes the evaluation.
+    - If you change any numeric field: change the minimum number of fields, and keep changes small unless necessary.
+    ========================
+    How to use the tool reports (operational)
+    ========================
+    1) Apply Paper Insight Miner (method/contribution):
+    - Use `review_issues.incorrect_or_hallucinated` to remove/correct wrong claims in Summary/Strengths/Weaknesses.
+    - Use `missing_key_points` and `needs_specificity` to improve technical specificity.
+    - Incorporate `rewrite_suggestions` where appropriate (method-related only).
+    2) Apply Paper Results Analyzer (experiments/results):
+    - Correct any wrong result interpretation.
+    - Add missing datasets/baselines/metrics/key results if they are important and supported.
+    - Convert vague experiment critiques into concrete, testable suggestions with anchors.
+    - Incorporate `rewrite_suggestions` where appropriate (experiment-related only).
+    3) Use Related-work summaries:
+    - Use each item’s `relation` to craft 1–3 concrete positioning points:
+      - what is similar/different,
+      - what comparisons would strengthen the paper,
+      - what claims need clarification.
+    - Do NOT dump a bibliography; only mention the most relevant comparisons (typically <= 3 items).
+    - Phrase as external suggestions, not accusations.
+    ========================
+    Refinement checklist (do in order)
+    ========================
+    Step 1: Fix incorrect/hallucinated statements flagged by the two audit reports.
+    Step 2: Improve Summary and Strengths with paper-grounded method + results highlights.
+    Step 3: Strengthen Weaknesses with evidence anchors and clearer critique.
+    Step 4: Add actionable Suggestions (each mapped to a weakness).
+    Step 5: Improve Questions to resolve uncertainties (especially when evidence is not found).
+    Step 6: Improve Comparative Analysis using related-work summaries with proper attribution.
+    Step 7: Ensure constructive tone and completeness across method / experiments / positioning.
+    ========================
+    Output format (JSON ONLY)
+    ========================
+    Return a JSON object with the following keys ONLY.
+    - Numeric fields must be numbers (not strings).
+    - decision must be one of: "accept", "reject".
+    - Do not output any text outside JSON.
+    {
+      "summary": "...",
+      "strengths": "...",
+      "weaknesses": "...",
+      "questions": "...",
+      "soundness": 0,
+      "presentation": 0,
+      "contribution": 0,
+      "rating": 0,
+      "confidence": 0,
+      "decision": "your_decision"
+    }
+    ========================
+    Inputs
+    ========================
+    [Paper Text]
+    <<paper_text>>
+    [Draft Review]
+    <<draft_review>>
+    [Paper Insight Miner Output (JSON)]
+    <<insight_miner_json>>
+    [Paper Results Analyzer Output (JSON)]
+    <<results_analyzer_json>>
+    [Related-work Summaries (JSON list)]
+    <<related_work_json_list>>
+# System message for refiner
+refiner_system: "You are an expert review refiner with deep knowledge in academic review quality standards and meta rubrics."
+# Evaluation prompts for review-based rubric generation and evaluation
+# Rubric template for generating paper-specific rubrics
+rubrics: |
+    {
+        "title": "Core Contribution Accuracy",
+        "description": "Essential Criteria: Identifies whether the review accurately describes the paper's main contributions and central methodological innovations in the summary and strengths sections without misinterpretation.",
+        "weight": 1
+    },
+    {
+        "title": "Results Interpretation",
+        "description": "Important Criteria: Explains whether the review correctly interprets the empirical results in the summary and strengths sections, including tables, figures, and statistical comparisons.",
+        "weight": 1
+    },
+    {
+        "title": "Comparative Analysis",
+        "description": "Important Criteria: States whether the review appropriately discusses comparisons with baselines and related work presented in the paper.",
+        "weight": 1
+    },
+    {
+        "title": "Evidence-Based Critique",
+        "description": "Essential Criteria: Identifies whether weaknesses and criticisms in the Weaknesses, Suggestions, and Questions sections are supported by specific references to paper content such as sections, equations, tables, or figures.",
+        "weight": 1
+    },
+    {
+        "title": "Critique Clarity",
+        "description": "Important Criteria: Explains whether the identified weaknesses in the Weaknesses, Suggestions, and Questions sections are stated clearly and specifically enough for authors to understand what needs improvement.",
+        "weight": 1
+    },
+    {
+        "title": "Completeness Coverage",
+        "description": "Important Criteria: Identifies whether the review addresses all major components of the paper including methodology, theory, experiments, and related work.",
+        "weight": 1
+    },
+    {
+        "title": "Constructive Tone",
+        "description": "Optional Criteria: States whether the review maintains a professional and constructive tone that encourages improvement rather than discouragement.",
+        "weight": 1
+    },
+    {
+        "title": "False or Contradictory Claims",
+        "description": "Pitfall Criteria: Does not mention content or experiments that are actually absent from the paper, incorrectly claim something is missing when it exists, or make statements that contradict the paper's stated results, conclusions, or explicitly documented design choices.",
+        "weight": -1
+    }
+# Rubric generation prompt (v2 - uses template)
+v2_rubric_generation_prompt: |
+    You are an expert rubric writer for evaluating the quality of paper reviews in AI-related academic fields, including:
+    - Machine Learning
+    - Deep Learning
+    - Natural Language Processing (NLP)
+    - Computer Vision
+    - Robotics
+    - Reinforcement Learning
+    - Optimization
+    - Data-centric AI
+    - Related subdisciplines
+    You are given a paper content, a golden review, and a review evaluation rubric template.
+    According the paper content, golden review, and review evaluation rubric template, your task is to combine the rubric template with the paper content and golden review, to form a complete review evaluation rubric set, so that it can be used to judge if other candidates reviews captures the key contents in the golden review and candidly reflect the understanding, strengths, and weaknesses of the paper.
+    The combined rubrics should be precise, comprehensive, and actionable.
+    ## Rubric Construction Rules
+    ### Total Items
+    - Rewrite each rubric item in the template as a self-contained evaluation criterion following the above areas and criteria, adhere to the content of the golden review dynamically. Keep the original weight of the rubric item.
+    - Do NOT add or remove any rubric item. Strictly adhere to the areas and criteria in the template.
+    ### Category Guidance
+    - **Essential:** Critical facts or safety checks; missing this invalidates the response.
+    - **Important:** Key reasoning, completeness, or clarity; strongly affects quality.
+    - **Optional:** Helpful stylistic or depth additions; not required
+    - **Pitfall:** Common important mistakes; each must begin with "Pitfall Criteria: Does not mention …" or "Pitfall Criteria: Recommends …"
+    ---
+    ## Output Requirements
+    - Provide a **JSON array** of rubric objects.
+    - Each object must contain **exactly three keys**:`{ "title": "...", "description": "...", "weight": ... }`
+    - No additional keys allowed.
+    - Do **not** copy large blocks of the question or reference answer.
+    - Every description must **start with its category prefix**.
+    Now, provided is the golden review for you to refer on:
+    <<golden_review>>
+    Following is the paper content for you to refer on:
+    <<paper_context>>
+    And below is the rubric template as json for you to refer on:
+    <<rubric_template>>
+    Please start your rubric generation following the above information provided and the instructions.
+# Evaluator prompt (v1 - for evaluating reviews using rubrics)
+v1_evaluator_prompt: |
+    You are an expert academic reviewer tasked with evaluating a research paper review following a list of rubrics.
+    Coupled with the paper content and the review, you need to score the review on each rubric and provide corresponding rationales.
+    The rubrics are as follows:
+    {rubrics_json}
+    The score should be in the range of -2 to 2, do NOT refer to the value of the weight when assigning the score during evaluation. Treat them as indications of whether this rubric is positive or negative.
+    If the weight is positive, this rubric is positive. If the weight is negative, this rubric is negative.
+    For each rubric:
+    - If this rubric is positive:
+        - If the review meet none of the key points in this rubric, assign 0.
+        - If the review meet at least half of the key points in this rubric, assign 1.
+        - If the review meet all of the key points in this rubric, assign 2.
+    - If this rubric is negative:
+        - If the review does NOT suffer from the pitfall (good), assign 0.
+        - If it DOES suffer any of the key points in this pitfall rubric, assign -1.
+        - If it DOES suffer all of the key points in this pitfall rubric, assign -2.
+    Your output format should be:
+    {{
+        "<first_rubric_title>": {{
+            "score": <-2 to 2>,
+            "rationale": "<rationale explaining the score>"
+        }},
+        ...
+        "<last_rubric_title>": {{
+            "score": <-2 to 2>,
+            "rationale": "<rationale explaining the score>"
+        }}
+    }}
+    DO NOT include any other text in your output. Output the JSON string ONLY.
+    Now, provided is the paper content:
+    <<paper_content>>
+    Now, provided is the review:
+    <<review>>
+    Please start your evaluation following the above information provided and the instructions.

shared/configs/reranker_endpoint_pool.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+http://localhost:8008
+http://localhost:8009
+http://localhost:8010
+http://localhost:8011
+http://localhost:8012
+http://localhost:8013
+http://localhost:8014
+http://localhost:8015

shared/configs/vllm_endpoint_pool.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+http://localhost:8001/v1
+http://localhost:8002/v1
+http://localhost:8003/v1
+http://localhost:8004/v1
+http://localhost:8005/v1
+http://localhost:8006/v1
+http://localhost:8007/v1

shared/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,113 @@

+"""
+Shared utilities for the unified review system.
+"""
+# Core utilities (always available)
+from .json_parser import (
+    parse_review_markdown,
+    parse_keywords_json,
+    parse_summary_json,
+    parse_json_response,
+)
+from .prompt_loader import get_prompt_loader
+# API Key Pool and Endpoint Pool (always available)
+try:
+    from .asta_api_key_pool import AstaAPIKeyPool
+    _all_pools = ['AstaAPIKeyPool']
+except ImportError:
+    AstaAPIKeyPool = None
+    _all_pools = []
+try:
+    from .vllm_endpoint_pool import VLLMEndpointPool
+    _all_pools.append('VLLMEndpointPool')
+except ImportError:
+    VLLMEndpointPool = None
+if _all_pools:
+    __all__ = _all_pools
+# Lazy imports for heavy dependencies (LLM-related)
+# These may fail if dependencies are not installed, but that's okay
+def _lazy_import_llm_services():
+    """Lazy import LLM services to avoid dependency issues"""
+    try:
+        from .llm_service import LLMService, ChatMessage
+        from .llm_service_factory import (
+            get_llm_service_factory,
+            LLMServiceFactory,
+            load_api_key_from_config,
+        )
+        return {
+            'LLMService': LLMService,
+            'ChatMessage': ChatMessage,
+            'get_llm_service_factory': get_llm_service_factory,
+            'LLMServiceFactory': LLMServiceFactory,
+            'load_api_key_from_config': load_api_key_from_config,
+        }
+    except ImportError:
+        return {}
+def _lazy_import_llm_implementations():
+    """Lazy import LLM service implementations"""
+    result = {}
+    try:
+        from .vllm_service import VLLMService
+        result['VLLMService'] = VLLMService
+    except ImportError:
+        pass
+    try:
+        from .gpt_service import GPTService
+        result['GPTService'] = GPTService
+    except ImportError:
+        pass
+    try:
+        from .mock_llm_service import MockLLMService, extract_title_from_latex, extract_abstract_from_latex
+        result['MockLLMService'] = MockLLMService
+        result['extract_title_from_latex'] = extract_title_from_latex
+        result['extract_abstract_from_latex'] = extract_abstract_from_latex
+    except ImportError:
+        pass
+    return result
+def _lazy_import_other():
+    """Lazy import other utilities"""
+    result = {}
+    try:
+        from .reranker import rerank_paragraphs_bge
+        result['rerank_paragraphs_bge'] = rerank_paragraphs_bge
+    except ImportError:
+        pass
+    try:
+        from .review_logger import ReviewLogger
+        result['ReviewLogger'] = ReviewLogger
+    except ImportError:
+        pass
+    return result
+# Populate __all__ dynamically
+_llm_services = _lazy_import_llm_services()
+_llm_impls = _lazy_import_llm_implementations()
+_other = _lazy_import_other()
+# Make all lazy imports available at module level
+globals().update(_llm_services)
+globals().update(_llm_impls)
+globals().update(_other)
+__all__ = [
+    'parse_review_markdown',
+    'parse_keywords_json',
+    'parse_summary_json',
+    'parse_json_response',
+    'get_prompt_loader',
+] + list(_llm_services.keys()) + list(_llm_impls.keys()) + list(_other.keys())
+if AstaAPIKeyPool:
+    __all__.append('AstaAPIKeyPool')

shared/utils/asta_api_key_pool.py ADDED Viewed

	@@ -0,0 +1,205 @@

+"""
+Asta API Key Pool Manager
+Manage multiple Asta API keys, implement key rotation and error handling.
+"""
+import os
+import random
+import time
+from pathlib import Path
+from typing import List, Optional, Dict
+from threading import Lock
+class AstaAPIKeyPool:
+    """
+    Asta API Key Pool Manager
+    Features:
+    1. Load multiple API keys from file
+    2. Randomly rotate keys
+    3. Track each key's usage status and errors
+    4. Implement debounce retry strategy
+    """
+    def __init__(self, pool_path: Optional[str] = None, keys: Optional[List[str]] = None):
+        """
+        Initialize API Key Pool
+        Args:
+            pool_path: API keys file path (one key per line)
+            keys: directly provide keys list (prior to pool_path)
+        """
+        self.keys: List[str] = []
+        self.used_indices: List[int] = []  # indices used in current rotation
+        self.key_status: Dict[str, Dict] = {}  # status information for each key
+        self.lock = Lock()  # thread safe lock
+        # load keys
+        if keys:
+            self.keys = [k.strip() for k in keys if k.strip()]
+        elif os.environ.get("ASTA_API_KEY"):
+            # Try to get one or more keys from environment variable (comma-separated)
+            self.keys = [k.strip() for k in os.environ.get("ASTA_API_KEY").split(",") if k.strip()]
+        elif pool_path:
+            self._load_from_file(pool_path)
+        else:
+            raise ValueError(
+                "No API keys available. Provide keys via pool_path, keys parameter, "
+                "or ASTA_API_KEY environment variable."
+            )
+        if not self.keys:
+            raise ValueError(
+                "No API keys available. Provide keys via pool_path, keys parameter, "
+                "or ASTA_API_KEY environment variable."
+            )
+        # initialize status for each key
+        for key in self.keys:
+            self.key_status[key] = {
+                'error_count': 0,
+                'last_error_time': None,
+                'consecutive_errors': 0,
+                'total_requests': 0,
+                'successful_requests': 0,
+            }
+    def _load_from_file(self, pool_path: str):
+        """Load API keys from file"""
+        path = Path(pool_path)
+        # if relative path, try to find file relative to shared/configs
+        if not path.is_absolute():
+            # try to find file relative to project root
+            project_root = Path(__file__).parent.parent.parent
+            path = project_root / "shared" / "configs" / pool_path
+            if not path.exists():
+                # try to find file relative to shared/configs
+                path = Path(__file__).parent.parent / "configs" / pool_path
+        if not path.exists():
+            raise FileNotFoundError(
+                f"API key pool file not found: {pool_path} (tried: {path})"
+            )
+        with open(path, 'r', encoding='utf-8') as f:
+            lines = f.readlines()
+        self.keys = [line.strip() for line in lines if line.strip() and not line.strip().startswith('#')]
+        if not self.keys:
+            raise ValueError(f"No valid API keys found in pool file: {pool_path}")
+    def get_key(self) -> str:
+        """
+        Get next available API key (rotation strategy)
+        Strategy:
+        1. If current rotation is not complete, continue using unused keys
+        2. If current rotation is complete, start a new round (reset used_indices)
+        3. Prioritize keys with no recent errors
+        Returns:
+            Available API key
+        """
+        with self.lock:
+            if not self.keys:
+                raise ValueError("No API keys available in pool")
+            # if current rotation is complete, start a new round
+            if len(self.used_indices) >= len(self.keys):
+                self.used_indices = []
+            # get indices not used in current rotation
+            available_indices = [i for i in range(len(self.keys)) if i not in self.used_indices]
+            if not available_indices:
+                # all keys are used in current rotation, start a new round
+                available_indices = list(range(len(self.keys)))
+                self.used_indices = []
+            # prioritize keys with fewer errors (randomly select, but prioritize keys with higher success rate and fewer errors)
+            key_scores = []
+            for idx in available_indices:
+                key = self.keys[idx]
+                status = self.key_status[key]
+                # calculate score: error count, success rate, score越高
+                error_count = status['error_count']
+                total = status['total_requests']
+                success_rate = (status['successful_requests'] / total) if total > 0 else 1.0
+                # if recent error, reduce score
+                recent_error_penalty = 0
+                if status['last_error_time']:
+                    time_since_error = time.time() - status['last_error_time']
+                    if time_since_error < 60:  # 1 minute
+                        recent_error_penalty = 0.5
+                score = success_rate - (error_count * 0.1) - recent_error_penalty
+                key_scores.append((idx, score))
+            # sort by score, select highest score (but add some randomness)
+            key_scores.sort(key=lambda x: x[1], reverse=True)
+            # select from top 50% (add randomness but prioritize better keys)
+            top_n = max(1, len(key_scores) // 2) if len(key_scores) > 1 else 1
+            selected_idx, _ = random.choice(key_scores[:top_n])
+            # mark as used
+            self.used_indices.append(selected_idx)
+            selected_key = self.keys[selected_idx]
+            self.key_status[selected_key]['total_requests'] += 1
+            return selected_key
+    def mark_success(self, key: str):
+        """mark key as successful"""
+        with self.lock:
+            if key in self.key_status:
+                self.key_status[key]['successful_requests'] += 1
+                self.key_status[key]['consecutive_errors'] = 0
+    def mark_error(self, key: str, error_type: str = "rate_limit"):
+        """
+        mark key as failed
+        Args:
+            key: failed API key
+            error_type: error type ("rate_limit", "auth_error", "server_error", "other")
+        """
+        with self.lock:
+            if key in self.key_status:
+                status = self.key_status[key]
+                status['error_count'] += 1
+                status['consecutive_errors'] += 1
+                status['last_error_time'] = time.time()
+    def get_status(self) -> Dict:
+        """get pool status information (for debugging)"""
+        with self.lock:
+            return {
+                'total_keys': len(self.keys),
+                'current_round_progress': f"{len(self.used_indices)}/{len(self.keys)}",
+                'keys_status': {
+                    key: {
+                        'error_count': status['error_count'],
+                        'successful_requests': status['successful_requests'],
+                        'total_requests': status['total_requests'],
+                        'success_rate': (
+                            status['successful_requests'] / status['total_requests']
+                            if status['total_requests'] > 0 else 0.0
+                        ),
+                        'consecutive_errors': status['consecutive_errors'],
+                        'last_error_time': status['last_error_time'],
+                    }
+                    for key, status in self.key_status.items()
+                }
+            }
+    def reset_round(self):
+        """reset current rotation (force start a new round)"""
+        with self.lock:
+            self.used_indices = []

shared/utils/gpt_service.py ADDED Viewed

	@@ -0,0 +1,210 @@

+"""
+OpenAI GPT API service implementation
+"""
+import os
+from typing import List, Dict, Optional, Any, Union
+from openai import OpenAI
+from .llm_service import LLMService, ChatMessage
+class GPTService(LLMService):
+    """
+    OpenAI GPT API service wrapper
+    This service connects to OpenAI's API (or compatible API)
+    """
+    def __init__(
+        self,
+        api_key: Optional[str] = None,
+        model_name: str = "gpt-4o",
+        base_url: Optional[str] = None,
+        timeout: int = 300,
+    ):
+        """
+        Initialize GPT service
+        Args:
+            api_key: OpenAI / OpenRouter API key (set via env if omitted)
+            model_name: Model name (e.g., gpt-4o, openai/gpt-oss-120b:free, etc.)
+            base_url: API base URL (default: https://api.openai.com/v1)
+            timeout: Request timeout in seconds
+        """
+        # Prefer explicit parameter, then common environment variables.
+        # This allows using OpenRouter (OPENROUTER_API_KEY) without hard-coding secrets.
+        self.api_key = (
+            api_key
+            or os.environ.get("OPENAI_API_KEY")
+            or os.environ.get("OPENROUTER_API_KEY")
+        )
+        if not self.api_key:
+            raise ValueError(
+                "API key is required. Set OPENAI_API_KEY or OPENROUTER_API_KEY "
+                "environment variable, or pass api_key parameter."
+            )
+        self.model_name = model_name
+        # Prefer explicit base_url, then environment variables, then OpenAI default.
+        # This allows swapping in any OpenAI-compatible endpoint (e.g., OpenRouter)
+        # without changing code.
+        self.base_url = (
+            base_url
+            or os.environ.get("OPENAI_BASE_URL")
+            or os.environ.get("OPENROUTER_BASE_URL")
+            or "https://api.openai.com/v1"
+        )
+        self.timeout = timeout
+        self.client = OpenAI(
+            api_key=self.api_key,
+            base_url=self.base_url,
+            timeout=self.timeout,
+        )
+    def _format_messages(self, messages: List[Union[ChatMessage, Dict[str, str]]]) -> List[Dict[str, str]]:
+        """Format messages for OpenAI API"""
+        formatted = []
+        for msg in messages:
+            if isinstance(msg, ChatMessage):
+                formatted.append({"role": msg.role, "content": msg.content})
+            elif isinstance(msg, dict):
+                formatted.append(msg)
+            else:
+                raise ValueError(f"Invalid message type: {type(msg)}")
+        return formatted
+    def generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: float = 0.7,
+        top_p: float = 0.95,
+        top_k: int = 20,
+        max_tokens: int = 16384,
+        presence_penalty: float = 0.0,
+        **kwargs
+    ) -> str:
+        """
+        Generate text from messages
+        Args:
+            messages: List of chat messages
+            temperature: Sampling temperature
+            top_p: Top-p sampling parameter
+            top_k: Top-k sampling parameter (not used by GPT API, but kept for compatibility)
+            max_tokens: Maximum tokens to generate
+            presence_penalty: Presence penalty (0-2)
+            **kwargs: Additional parameters
+        Returns:
+            Generated text
+        """
+        formatted_messages = self._format_messages(messages)
+        try:
+            # GPT API doesn't support top_k, so we exclude it
+            # Some newer models (like GPT 5.2) use max_completion_tokens instead of max_tokens
+            params = {
+                "model": self.model_name,
+                "messages": formatted_messages,
+                "temperature": temperature,
+                "top_p": top_p,
+                "presence_penalty": presence_penalty,
+            }
+            # Check if model requires max_completion_tokens instead of max_tokens
+            # Models that use max_completion_tokens: o1, o1-preview, o1-mini, and newer models
+            if any(model_name in self.model_name.lower() for model_name in ["o1", "gpt-5", "gpt5"]):
+                params["max_completion_tokens"] = max_tokens
+            else:
+                params["max_tokens"] = max_tokens
+            params.update({k: v for k, v in kwargs.items() if k not in ["top_k", "max_tokens", "max_completion_tokens"]})
+            response = self.client.chat.completions.create(**params)
+            return response.choices[0].message.content
+        except Exception as e:
+            # If max_tokens fails, try max_completion_tokens as fallback
+            if "max_tokens" in str(e) and "max_completion_tokens" in str(e):
+                try:
+                    params = {
+                        "model": self.model_name,
+                        "messages": formatted_messages,
+                        "temperature": temperature,
+                        "top_p": top_p,
+                        "max_completion_tokens": max_tokens,
+                        "presence_penalty": presence_penalty,
+                    }
+                    params.update({k: v for k, v in kwargs.items() if k not in ["top_k", "max_tokens", "max_completion_tokens"]})
+                    response = self.client.chat.completions.create(**params)
+                    return response.choices[0].message.content
+                except Exception as e2:
+                    raise RuntimeError(f"Error generating text from GPT service: {e2}")
+            raise RuntimeError(f"Error generating text from GPT service: {e}")
+    def stream_generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: float = 0.7,
+        top_p: float = 0.95,
+        top_k: int = 20,
+        max_tokens: int = 16384,
+        presence_penalty: float = 0.0,
+        **kwargs
+    ):
+        """
+        Stream generate text from messages
+        Yields:
+            Generated text chunks
+        """
+        formatted_messages = self._format_messages(messages)
+        try:
+            params = {
+                "model": self.model_name,
+                "messages": formatted_messages,
+                "temperature": temperature,
+                "top_p": top_p,
+                "presence_penalty": presence_penalty,
+                "stream": True,
+            }
+            # Check if model requires max_completion_tokens instead of max_tokens
+            if any(model_name in self.model_name.lower() for model_name in ["o1", "gpt-5", "gpt5"]):
+                params["max_completion_tokens"] = max_tokens
+            else:
+                params["max_tokens"] = max_tokens
+            params.update({k: v for k, v in kwargs.items() if k not in ["top_k", "max_tokens", "max_completion_tokens"]})
+            stream = self.client.chat.completions.create(**params)
+            for chunk in stream:
+                if chunk.choices[0].delta.content:
+                    yield chunk.choices[0].delta.content
+        except Exception as e:
+            # If max_tokens fails, try max_completion_tokens as fallback
+            if "max_tokens" in str(e) and "max_completion_tokens" in str(e):
+                try:
+                    params = {
+                        "model": self.model_name,
+                        "messages": formatted_messages,
+                        "temperature": temperature,
+                        "top_p": top_p,
+                        "max_completion_tokens": max_tokens,
+                        "presence_penalty": presence_penalty,
+                        "stream": True,
+                    }
+                    params.update({k: v for k, v in kwargs.items() if k not in ["top_k", "max_tokens", "max_completion_tokens"]})
+                    stream = self.client.chat.completions.create(**params)
+                    for chunk in stream:
+                        if chunk.choices[0].delta.content:
+                            yield chunk.choices[0].delta.content
+                    return
+                except Exception as e2:
+                    raise RuntimeError(f"Error streaming text from GPT service: {e2}")
+            raise RuntimeError(f"Error streaming text from GPT service: {e}")

shared/utils/json_parser.py ADDED Viewed

	@@ -0,0 +1,428 @@

+"""
+Robust JSON parsing utilities for LLM responses
+"""
+import json
+import re
+from typing import Any, Dict, List, Optional
+def extract_json_from_text(text: str) -> Optional[str]:
+    """
+    Extract JSON from text by removing markdown code block markers
+    Args:
+        text: Text that may contain JSON in markdown code blocks or plain JSON
+    Returns:
+        Extracted JSON string or None if not found
+    """
+    if not text:
+        return None
+    text_stripped = text.strip()
+    # Try to parse as plain JSON first (no code blocks)
+    try:
+        json.loads(text_stripped)
+        return text_stripped
+    except json.JSONDecodeError:
+        pass
+    # Remove markdown code block markers: ```json ... ``` or ``` ... ```
+    if text_stripped.startswith('```json'):
+        # Remove ```json at start and ``` at end
+        if text_stripped.endswith('```'):
+            text_stripped = text_stripped[7:-3].strip()
+        else:
+            # No closing ```, just remove opening
+            text_stripped = text_stripped[7:].strip()
+    elif text_stripped.startswith('```'):
+        # Handle ``` ... ``` (without json label)
+        if text_stripped.endswith('```'):
+            text_stripped = text_stripped[3:-3].strip()
+        else:
+            return None
+    # Try to parse as JSON after removing code block markers
+    try:
+        json.loads(text_stripped)
+        return text_stripped
+    except json.JSONDecodeError:
+        return None
+def parse_json_response(text: str, fallback: Any = None) -> Any:
+    """
+    Parse JSON from LLM response with robust error handling
+    Args:
+        text: LLM response text
+        fallback: Fallback value if parsing fails
+    Returns:
+        Parsed JSON object or fallback
+    """
+    if not text:
+        return fallback
+    # Extract JSON from text
+    json_str = extract_json_from_text(text)
+    if json_str is None:
+        return fallback
+    try:
+        return json.loads(json_str)
+    except json.JSONDecodeError as e:
+        # Try to fix common JSON issues
+        json_str = fix_json_common_issues(json_str)
+        try:
+            return json.loads(json_str)
+        except json.JSONDecodeError:
+            return fallback
+def fix_json_common_issues(json_str: str) -> str:
+    """
+    Fix common JSON formatting issues
+    Args:
+        json_str: JSON string that may have issues
+    Returns:
+        Fixed JSON string
+    """
+    # Remove trailing commas
+    json_str = re.sub(r',\s*}', '}', json_str)
+    json_str = re.sub(r',\s*]', ']', json_str)
+    # Fix single quotes to double quotes (basic)
+    json_str = re.sub(r"'(\w+)':", r'"\1":', json_str)
+    # Remove comments (basic)
+    json_str = re.sub(r'//.*?$', '', json_str, flags=re.MULTILINE)
+    json_str = re.sub(r'/\*.*?\*/', '', json_str, flags=re.DOTALL)
+    return json_str
+def parse_keywords_json(response: str) -> List[str]:
+    """
+    Parse keywords from JSON response
+    Expected format:
+    {"keywords": ["keyword1", "keyword2", ...]}
+    or
+    ["keyword1", "keyword2", ...]
+    Args:
+        response: LLM response text
+    Returns:
+        List of keywords, or empty list if parsing fails
+    """
+    if response is None:
+        return []
+    parsed = parse_json_response(response, fallback=None)
+    if parsed is None:
+        return []
+    # Handle dict format: {"keywords": [...]}
+    if isinstance(parsed, dict):
+        if "keywords" in parsed and isinstance(parsed["keywords"], list):
+            return parsed["keywords"][:5]
+        return []
+    # Handle list format: ["keyword1", "keyword2", ...]
+    if isinstance(parsed, list):
+        return parsed[:5]
+    return []
+def parse_summary_json(response: str) -> str:
+    """
+    Parse summary from JSON response
+    Expected format:
+    {"summary": "summary text"}
+    or
+    {"text": "summary text", "summary": "summary text"}
+    Args:
+        response: LLM response text
+    Returns:
+        Summary text
+    """
+    parsed = parse_json_response(response, fallback=None)
+    if parsed is None:
+        # Fallback to text parsing
+        return response.strip()
+    if isinstance(parsed, dict):
+        # Try different possible keys
+        for key in ["summary", "text", "content", "description"]:
+            if key in parsed:
+                summary = str(parsed[key]).strip()
+                if summary:
+                    return summary
+    # Fallback to text parsing
+    return response.strip()
+def parse_review_json(response: str, review_format: str = "detailed") -> Dict[str, Any]:
+    """
+    Parse review from JSON or markdown response
+    Expected formats:
+    - JSON: {"summary": "...", "soundness": 5, ...}
+    - Markdown: ## Summary\n\n...\n## Soundness\n\n...
+    Args:
+        response: LLM response text (JSON or markdown)
+        review_format: Review format type (detailed, summary, structured)
+    Returns:
+        Review dictionary with parsed fields
+    """
+    # First try to parse as JSON
+    parsed = parse_json_response(response, fallback=None)
+    if parsed is not None and isinstance(parsed, dict):
+        # JSON format - ensure it has required fields
+        if "review" not in parsed:
+            parsed["review"] = response.strip()
+        return parsed
+    # If not JSON, try to parse as markdown
+    if "## " in response or "##" in response:
+        markdown_parsed = parse_review_markdown(response)
+        if len(markdown_parsed) > 1:  # More than just "review" field
+            return markdown_parsed
+    # Fallback to text parsing
+    return {"review": response.strip()}
+def parse_review_markdown(markdown_text: str) -> Dict[str, Any]:
+    """
+    Parse review from markdown format with sections like:
+    ## Summary
+    ...
+    ## Soundness
+    ...
+    etc.
+    Args:
+        markdown_text: Markdown formatted review text
+    Returns:
+        Review dictionary with parsed fields
+    """
+    review_dict = {"review": markdown_text.strip()}
+    # Pattern to match markdown sections: ## SectionName\n\ncontent
+    section_pattern = r'##\s*([^\n]+)\s*\n\n(.*?)(?=\n##\s*|$)'
+    matches = re.finditer(section_pattern, markdown_text, re.DOTALL)
+    for match in matches:
+        section_name = match.group(1).strip()
+        section_content = match.group(2).strip()
+        # Normalize section name (case-insensitive, remove extra spaces)
+        section_name_lower = section_name.lower()
+        # Map section names to dictionary keys
+        if "summary" in section_name_lower:
+            review_dict["summary"] = section_content
+        elif "soundness" in section_name_lower:
+            # Extract score - prioritize single float number (e.g., "3.0", "4.5")
+            # If format is "3 / 5" or "**3 / 5**", extract the number before the slash
+            score_val = None
+            lines = section_content.split('\n')
+            if lines:
+                first_line = lines[0].strip()
+                first_line_clean = re.sub(r'[`\*]', '', first_line)
+                # Try to match number at start that's NOT followed by "/"
+                num_match = re.match(r'^(\d+\.?\d*)(\s*)', first_line_clean)
+                if num_match:
+                    remaining = first_line_clean[len(num_match.group(0)):].strip()
+                    if not remaining.startswith('/'):
+                        try:
+                            score_val = float(num_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+                # If not found and there's a "/", try to extract number before "/" (e.g., "3 / 5" -> 3)
+                if score_val is None and '/' in first_line_clean:
+                    fraction_match = re.match(r'^\s*[`\*]*\s*(\d+\.?\d*)\s*[`\*]*\s*/\s*\d+', first_line_clean)
+                    if fraction_match:
+                        try:
+                            score_val = float(fraction_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+            # If not found, try to find number after "score:" or "rating:"
+            if score_val is None:
+                score_match = re.search(r'(?:score|rating)\s*[:=]\s*(\d+\.?\d*)', section_content, re.IGNORECASE)
+                if score_match:
+                    try:
+                        score_val = float(score_match.group(1))
+                    except (ValueError, IndexError):
+                        pass
+            if score_val is not None:
+                review_dict["soundness"] = score_val  # Keep as float
+        elif "presentation" in section_name_lower:
+            score_val = None
+            lines = section_content.split('\n')
+            if lines:
+                first_line = lines[0].strip()
+                first_line_clean = re.sub(r'[`\*]', '', first_line)
+                num_match = re.match(r'^(\d+\.?\d*)(\s*)', first_line_clean)
+                if num_match:
+                    remaining = first_line_clean[len(num_match.group(0)):].strip()
+                    if not remaining.startswith('/'):
+                        try:
+                            score_val = float(num_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+                if score_val is None and '/' in first_line_clean:
+                    fraction_match = re.match(r'^\s*[`\*]*\s*(\d+\.?\d*)\s*[`\*]*\s*/\s*\d+', first_line_clean)
+                    if fraction_match:
+                        try:
+                            score_val = float(fraction_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+            if score_val is None:
+                score_match = re.search(r'(?:score|rating)\s*[:=]\s*(\d+\.?\d*)', section_content, re.IGNORECASE)
+                if score_match:
+                    try:
+                        score_val = float(score_match.group(1))
+                    except (ValueError, IndexError):
+                        pass
+            if score_val is not None:
+                review_dict["presentation"] = score_val
+        elif "contribution" in section_name_lower:
+            score_val = None
+            lines = section_content.split('\n')
+            if lines:
+                first_line = lines[0].strip()
+                first_line_clean = re.sub(r'[`\*]', '', first_line)
+                num_match = re.match(r'^(\d+\.?\d*)(\s*)', first_line_clean)
+                if num_match:
+                    remaining = first_line_clean[len(num_match.group(0)):].strip()
+                    if not remaining.startswith('/'):
+                        try:
+                            score_val = float(num_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+                if score_val is None and '/' in first_line_clean:
+                    fraction_match = re.match(r'^\s*[`\*]*\s*(\d+\.?\d*)\s*[`\*]*\s*/\s*\d+', first_line_clean)
+                    if fraction_match:
+                        try:
+                            score_val = float(fraction_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+            if score_val is None:
+                score_match = re.search(r'(?:score|rating)\s*[:=]\s*(\d+\.?\d*)', section_content, re.IGNORECASE)
+                if score_match:
+                    try:
+                        score_val = float(score_match.group(1))
+                    except (ValueError, IndexError):
+                        pass
+            if score_val is not None:
+                review_dict["contribution"] = score_val
+        elif "strength" in section_name_lower:
+            review_dict["strengths"] = section_content
+        elif "weakness" in section_name_lower:
+            review_dict["weaknesses"] = section_content
+        elif "question" in section_name_lower:
+            review_dict["questions"] = section_content
+        elif "rating" in section_name_lower and "confidence" not in section_name_lower:
+            score_val = None
+            lines = section_content.split('\n')
+            if lines:
+                first_line = lines[0].strip()
+                first_line_clean = re.sub(r'[`\*]', '', first_line)
+                num_match = re.match(r'^(\d+\.?\d*)(\s*)', first_line_clean)
+                if num_match:
+                    remaining = first_line_clean[len(num_match.group(0)):].strip()
+                    if not remaining.startswith('/'):
+                        try:
+                            score_val = float(num_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+                if score_val is None and '/' in first_line_clean:
+                    fraction_match = re.match(r'^\s*[`\*]*\s*(\d+\.?\d*)\s*[`\*]*\s*/\s*\d+', first_line_clean)
+                    if fraction_match:
+                        try:
+                            score_val = float(fraction_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+            if score_val is None:
+                score_match = re.search(r'(?:score|rating)\s*[:=]\s*(\d+\.?\d*)', section_content, re.IGNORECASE)
+                if score_match:
+                    try:
+                        score_val = float(score_match.group(1))
+                    except (ValueError, IndexError):
+                        pass
+            if score_val is not None:
+                review_dict["rating"] = score_val
+        elif "confidence" in section_name_lower:
+            score_val = None
+            lines = section_content.split('\n')
+            if lines:
+                first_line = lines[0].strip()
+                first_line_clean = re.sub(r'[`\*]', '', first_line)
+                num_match = re.match(r'^(\d+\.?\d*)(\s*)', first_line_clean)
+                if num_match:
+                    remaining = first_line_clean[len(num_match.group(0)):].strip()
+                    if not remaining.startswith('/'):
+                        try:
+                            score_val = float(num_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+                if score_val is None and '/' in first_line_clean:
+                    fraction_match = re.match(r'^\s*[`\*]*\s*(\d+\.?\d*)\s*[`\*]*\s*/\s*\d+', first_line_clean)
+                    if fraction_match:
+                        try:
+                            score_val = float(fraction_match.group(1))
+                        except (ValueError, IndexError):
+                            pass
+            if score_val is None:
+                score_match = re.search(r'(?:score|rating)\s*[:=]\s*(\d+\.?\d*)', section_content, re.IGNORECASE)
+                if score_match:
+                    try:
+                        score_val = float(score_match.group(1))
+                    except (ValueError, IndexError):
+                        pass
+            if score_val is not None:
+                review_dict["confidence"] = score_val
+        elif "decision" in section_name_lower:
+            review_dict["decision"] = section_content
+    return review_dict

shared/utils/llm_service.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""
+Abstract base class for LLM services
+"""
+from abc import ABC, abstractmethod
+from typing import List, Dict, Optional, Any, Union
+from pydantic import BaseModel
+class ChatMessage(BaseModel):
+    """Chat message model"""
+    role: str  # "system", "user", "assistant"
+    content: str
+class LLMService(ABC):
+    """Abstract base class for LLM services"""
+    @abstractmethod
+    def generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: float = 0.7,
+        top_p: float = 0.8,
+        top_k: int = 20,
+        max_tokens: int = 16384,
+        presence_penalty: float = 0.0,
+        **kwargs
+    ) -> str:
+        """
+        Generate text from messages
+        Args:
+            messages: List of chat messages
+            temperature: Sampling temperature
+            top_p: Top-p sampling parameter
+            top_k: Top-k sampling parameter
+            max_tokens: Maximum tokens to generate
+            presence_penalty: Presence penalty (0-2)
+            **kwargs: Additional parameters
+        Returns:
+            Generated text
+        """
+        pass
+    @abstractmethod
+    def stream_generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: float = 0.7,
+        top_p: float = 0.8,
+        top_k: int = 20,
+        max_tokens: int = 16384,
+        presence_penalty: float = 0.0,
+        **kwargs
+    ):
+        """
+        Stream generate text from messages
+        Yields:
+            Generated text chunks
+        """
+        pass

shared/utils/llm_service_factory.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""
+Factory for creating LLM services from configuration
+"""
+import os
+import yaml
+from pathlib import Path
+from typing import Optional, Dict, Any
+from .llm_service import LLMService
+from .vllm_service import VLLMService
+from .gpt_service import GPTService
+class LLMServiceFactory:
+    """Factory for creating LLM services from configuration"""
+    def __init__(self, config_file: Optional[str] = None):
+        """
+        Initialize factory with configuration
+        Args:
+            config_file: Path to vLLM service config YAML file
+        """
+        if config_file is None:
+            project_root = Path(__file__).parent.parent.parent
+            config_file = project_root / "shared" / "configs" / "llm_service_config.yaml"
+        self.config_file = Path(config_file)
+        self._config = None
+        self._load_config()
+    def _load_config(self):
+        """Load configuration from YAML file"""
+        if not self.config_file.exists():
+            raise FileNotFoundError(f"Config file not found: {self.config_file}")
+        with open(self.config_file, 'r', encoding='utf-8') as f:
+            self._config = yaml.safe_load(f)
+    def create_vllm_service(self, **override_params) -> VLLMService:
+        """
+        Create vLLM service from configuration
+        Args:
+            **override_params: Parameters to override config values
+        Returns:
+            VLLMService instance
+        """
+        vllm_config = self._config.get("vllm", {})
+        # Merge config with overrides
+        params = {
+            "base_url": vllm_config.get("base_url", "http://localhost:8000/v1"),
+            "api_key": vllm_config.get("api_key", "dummy-key"),
+            "model_name": vllm_config.get("model_name", "Qwen/Qwen3-4B-Instruct-2507"),
+            "timeout": vllm_config.get("timeout", 300),
+        }
+        params.update(override_params)
+        return VLLMService(**params)
+    def create_gpt_service(self, **override_params) -> GPTService:
+        """
+        Create GPT service from configuration
+        Args:
+            **override_params: Parameters to override config values
+        Returns:
+            GPTService instance
+        """
+        gpt_config = self._config.get("gpt", {})
+        # if not gpt_config.get("enabled", False):
+        #     raise ValueError("GPT service is not enabled in configuration")
+        # Merge config with overrides
+        params = {
+            "api_key": gpt_config.get("api_key") or os.environ.get("OPENAI_API_KEY"),
+            "model_name": gpt_config.get("model_name", "gpt-4o"),
+            "base_url": gpt_config.get("base_url"),
+            "timeout": gpt_config.get("timeout", 300),
+        }
+        params.update(override_params)
+        return GPTService(**params)
+    def create_service(self, service_type: str = "vllm", **override_params) -> LLMService:
+        """
+        Create LLM service by type
+        Args:
+            service_type: Service type ("vllm" or "gpt")
+            **override_params: Parameters to override config values
+        Returns:
+            LLMService instance
+        """
+        if service_type == "vllm":
+            return self.create_vllm_service(**override_params)
+        elif service_type == "gpt":
+            return self.create_gpt_service(**override_params)
+        else:
+            raise ValueError(f"Unknown service type: {service_type}")
+    def get_llm_assignment(self, component: str) -> str:
+        """
+        Get LLM service assignment for a component
+        Args:
+            component: Component name ("keyword_generator", "paper_summarizer", "reviewer", "refiner")
+        Returns:
+            Service type ("vllm" or "gpt")
+        Raises:
+            KeyError: If component is not found and no fallback is available
+        """
+        assignments = self._config.get("llm_assignments", {})
+        if component in assignments:
+            return assignments[component]
+        # Fallback: if refiner not configured, use reviewer's assignment
+        if component == "refiner" and "reviewer" in assignments:
+            return assignments["reviewer"]
+        # Default fallback to vllm (may cause connection errors if vllm not available)
+        return assignments.get(component, "vllm")
+    def create_service_for_component(self, component: str, **override_params) -> LLMService:
+        """
+        Create LLM service for a specific component based on configuration
+        Args:
+            component: Component name ("keyword_generator", "paper_summarizer", "reviewer")
+            **override_params: Parameters to override config values
+        Returns:
+            LLMService instance
+        """
+        service_type = self.get_llm_assignment(component)
+        return self.create_service(service_type, **override_params)
+# Global factory instance
+_factory: Optional[LLMServiceFactory] = None
+def load_api_key_from_config(config_path: str) -> Optional[str]:
+    """
+    Load API key from a YAML config file.
+    Args:
+        config_path: Path to YAML config file
+    Returns:
+        API key string, or None if not found
+    Note:
+        Returns None (instead of raising) if file doesn't exist or key not found,
+        to allow graceful fallback to environment variables.
+    """
+    from pathlib import Path
+    config_file = Path(config_path)
+    if not config_file.exists():
+        return None
+    try:
+        with open(config_file, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+        return config.get('api_key')
+    except Exception:
+        return None
+def get_llm_service_factory(config_file: Optional[str] = None) -> LLMServiceFactory:
+    """
+    Get or create global LLM service factory
+    Args:
+        config_file: Optional path to config file
+    Returns:
+        LLMServiceFactory instance
+    """
+    global _factory
+    if _factory is None or config_file is not None:
+        _factory = LLMServiceFactory(config_file)
+    return _factory

shared/utils/load_balancer.py ADDED Viewed

	@@ -0,0 +1,382 @@

+"""
+Simple Python Load Balancer
+A lightweight load balancer for vLLM and Reranker services.
+Uses FastAPI to forward requests to multiple backend services.
+"""
+import os
+import sys
+import time
+import asyncio
+import httpx
+from pathlib import Path
+from typing import List, Dict, Optional, Any
+from threading import Lock
+import argparse
+try:
+    from fastapi import FastAPI, Request, HTTPException
+    from fastapi.responses import StreamingResponse, Response
+    from fastapi.middleware.cors import CORSMiddleware
+    import uvicorn
+    HAS_FASTAPI = True
+except ImportError:
+    HAS_FASTAPI = False
+    print("Warning: FastAPI not installed. Install with: pip install fastapi uvicorn httpx")
+class SimpleLoadBalancer:
+    """Simple load balancer with round-robin and least-connection strategies"""
+    def __init__(
+        self,
+        backends: List[str],
+        strategy: str = "round_robin",
+        health_check_interval: float = 10.0,
+    ):
+        """
+        Initialize load balancer
+        Args:
+            backends: List of backend URLs (e.g., ["http://localhost:8000", "http://localhost:8001"])
+            strategy: Load balancing strategy ("round_robin" or "least_conn")
+            health_check_interval: Health check interval in seconds
+        """
+        self.backends = backends
+        self.strategy = strategy
+        self.health_check_interval = health_check_interval
+        # Round-robin state
+        self.current_index = 0
+        self.index_lock = Lock()
+        # Least-connection state
+        self.connection_counts: Dict[str, int] = {backend: 0 for backend in backends}
+        self.conn_lock = Lock()
+        # Health check state
+        self.healthy_backends: Dict[str, bool] = {backend: True for backend in backends}
+        self.health_lock = Lock()
+        # HTTP client for forwarding requests
+        self.client = httpx.AsyncClient(timeout=300.0)
+        print(f"Load balancer initialized with {len(backends)} backends")
+        print(f"Strategy: {strategy}")
+        for i, backend in enumerate(backends):
+            print(f"  [{i+1}] {backend}")
+    def get_backend(self) -> Optional[str]:
+        """Get next backend based on strategy"""
+        with self.health_lock:
+            available_backends = [b for b in self.backends if self.healthy_backends.get(b, True)]
+        if not available_backends:
+            # If no healthy backends, try all backends
+            available_backends = self.backends
+        if not available_backends:
+            return None
+        if self.strategy == "round_robin":
+            with self.index_lock:
+                backend = available_backends[self.current_index % len(available_backends)]
+                self.current_index = (self.current_index + 1) % len(available_backends)
+            return backend
+        elif self.strategy == "least_conn":
+            with self.conn_lock:
+                # Find backend with least connections
+                backend = min(available_backends, key=lambda b: self.connection_counts.get(b, 0))
+                self.connection_counts[backend] = self.connection_counts.get(backend, 0) + 1
+            return backend
+        else:
+            # Default to round-robin
+            with self.index_lock:
+                backend = available_backends[self.current_index % len(available_backends)]
+                self.current_index = (self.current_index + 1) % len(available_backends)
+            return backend
+    def release_backend(self, backend: str):
+        """Release a backend (for least-conn strategy)"""
+        if self.strategy == "least_conn":
+            with self.conn_lock:
+                self.connection_counts[backend] = max(0, self.connection_counts.get(backend, 0) - 1)
+    async def health_check(self, backend: str) -> bool:
+        """Check if a backend is healthy"""
+        try:
+            # For vLLM backends (URLs ending with /v1), use /models endpoint
+            # For other backends, try /health first, then root
+            if backend.endswith("/v1"):
+                # vLLM endpoint: try /models (which becomes /v1/models)
+                endpoints = ["/models", "/"]
+            else:
+                # Other services: try /health, then root
+                endpoints = ["/health", "/"]
+            for endpoint in endpoints:
+                try:
+                    response = await self.client.get(f"{backend}{endpoint}", timeout=5.0)
+                    if response.status_code < 500:
+                        return True
+                except:
+                    continue
+            return False
+        except Exception:
+            return False
+    async def check_all_backends(self):
+        """Check health of all backends"""
+        while True:
+            for backend in self.backends:
+                is_healthy = await self.health_check(backend)
+                with self.health_lock:
+                    self.healthy_backends[backend] = is_healthy
+                if not is_healthy:
+                    print(f"Warning: Backend {backend} is unhealthy")
+            await asyncio.sleep(self.health_check_interval)
+    async def forward_request(
+        self,
+        method: str,
+        path: str,
+        request: Request,
+        backend: Optional[str] = None
+    ) -> Response:
+        """Forward a request to a backend"""
+        if backend is None:
+            backend = self.get_backend()
+        if backend is None:
+            raise HTTPException(status_code=503, detail="No healthy backends available")
+        try:
+            # Get request body
+            body = await request.body()
+            # Get query parameters
+            query_params = dict(request.query_params)
+            # Get headers (exclude host and connection)
+            headers = dict(request.headers)
+            headers.pop("host", None)
+            headers.pop("connection", None)
+            headers.pop("content-length", None)
+            # Forward request
+            url = f"{backend}{path}"
+            if query_params:
+                url += "?" + "&".join(f"{k}={v}" for k, v in query_params.items())
+            response = await self.client.request(
+                method=method,
+                url=url,
+                content=body,
+                headers=headers,
+            )
+            # Create response
+            return Response(
+                content=response.content,
+                status_code=response.status_code,
+                headers=dict(response.headers),
+            )
+        except Exception as e:
+            # Mark backend as unhealthy
+            with self.health_lock:
+                self.healthy_backends[backend] = False
+            self.release_backend(backend)
+            raise HTTPException(status_code=502, detail=f"Backend error: {str(e)}")
+        finally:
+            self.release_backend(backend)
+    async def forward_streaming_request(
+        self,
+        method: str,
+        path: str,
+        request: Request,
+        backend: Optional[str] = None
+    ):
+        """Forward a streaming request to a backend"""
+        if backend is None:
+            backend = self.get_backend()
+        if backend is None:
+            raise HTTPException(status_code=503, detail="No healthy backends available")
+        try:
+            # Get request body
+            body = await request.body()
+            # Get query parameters
+            query_params = dict(request.query_params)
+            # Get headers
+            headers = dict(request.headers)
+            headers.pop("host", None)
+            headers.pop("connection", None)
+            headers.pop("content-length", None)
+            # Forward request
+            url = f"{backend}{path}"
+            if query_params:
+                url += "?" + "&".join(f"{k}={v}" for k, v in query_params.items())
+            async with httpx.AsyncClient(timeout=300.0) as client:
+                async with client.stream(
+                    method=method,
+                    url=url,
+                    content=body,
+                    headers=headers,
+                ) as response:
+                    async def generate():
+                        async for chunk in response.aiter_bytes():
+                            yield chunk
+                    return StreamingResponse(
+                        generate(),
+                        status_code=response.status_code,
+                        headers=dict(response.headers),
+                    )
+        except Exception as e:
+            # Mark backend as unhealthy
+            with self.health_lock:
+                self.healthy_backends[backend] = False
+            self.release_backend(backend)
+            raise HTTPException(status_code=502, detail=f"Backend error: {str(e)}")
+        finally:
+            self.release_backend(backend)
+def create_load_balancer_app(
+    backends: List[str],
+    strategy: str = "round_robin",
+    health_check_interval: float = 10.0,
+) -> FastAPI:
+    """Create FastAPI app with load balancer"""
+    if not HAS_FASTAPI:
+        raise RuntimeError("FastAPI not installed. Install with: pip install fastapi uvicorn httpx")
+    app = FastAPI(title="Simple Load Balancer", version="1.0.0")
+    # Add CORS middleware
+    app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+    # Create load balancer
+    lb = SimpleLoadBalancer(backends, strategy, health_check_interval)
+    # Start health check task
+    @app.on_event("startup")
+    async def start_health_check():
+        asyncio.create_task(lb.check_all_backends())
+    # Health check endpoint
+    @app.get("/health")
+    async def health():
+        healthy_count = sum(1 for h in lb.healthy_backends.values() if h)
+        return {
+            "status": "healthy" if healthy_count > 0 else "unhealthy",
+            "healthy_backends": healthy_count,
+            "total_backends": len(lb.backends),
+            "backends": [
+                {
+                    "url": backend,
+                    "healthy": lb.healthy_backends.get(backend, False),
+                    "connections": lb.connection_counts.get(backend, 0) if lb.strategy == "least_conn" else None,
+                }
+                for backend in lb.backends
+            ]
+        }
+    # Forward all other requests
+    @app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"])
+    async def forward(request: Request, path: str):
+        method = request.method
+        is_streaming = "stream" in request.query_params or "stream=true" in str(request.url)
+        if is_streaming:
+            return await lb.forward_streaming_request(method, f"/{path}", request)
+        else:
+            return await lb.forward_request(method, f"/{path}", request)
+    return app
+def main():
+    """Main entry point for load balancer"""
+    parser = argparse.ArgumentParser(description="Simple Python Load Balancer")
+    parser.add_argument(
+        "--backends",
+        type=str,
+        nargs="+",
+        required=True,
+        help="Backend URLs (e.g., http://localhost:8000 http://localhost:8001)"
+    )
+    parser.add_argument(
+        "--host",
+        type=str,
+        default="0.0.0.0",
+        help="Host to bind to (default: 0.0.0.0)"
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=8000,
+        help="Port to bind to (default: 8000)"
+    )
+    parser.add_argument(
+        "--strategy",
+        type=str,
+        default="round_robin",
+        choices=["round_robin", "least_conn"],
+        help="Load balancing strategy (default: round_robin)"
+    )
+    parser.add_argument(
+        "--health-check-interval",
+        type=float,
+        default=10.0,
+        help="Health check interval in seconds (default: 10.0)"
+    )
+    args = parser.parse_args()
+    if not HAS_FASTAPI:
+        print("Error: FastAPI not installed. Install with: pip install fastapi uvicorn httpx")
+        sys.exit(1)
+    # Create app
+    app = create_load_balancer_app(
+        backends=args.backends,
+        strategy=args.strategy,
+        health_check_interval=args.health_check_interval,
+    )
+    # Run server
+    print(f"Starting load balancer on {args.host}:{args.port}")
+    print(f"Strategy: {args.strategy}")
+    print(f"Backends: {', '.join(args.backends)}")
+    uvicorn.run(
+        app,
+        host=args.host,
+        port=args.port,
+        log_level="info"
+    )
+if __name__ == "__main__":
+    main()
+# python -m shared.utils.load_balancer --backends http://localhost:8000 http://localhost:8001 http://localhost:8002 http://localhost:8003     --port 8004     --strategy round_robin

shared/utils/mock_llm_service.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""
+Mock LLM Service that returns pre-generated reviews from a JSON file
+This is a hack for testing the refiner pipeline with existing reviews
+"""
+import json
+import re
+from pathlib import Path
+from typing import List, Dict, Optional, Any, Union
+from .llm_service import LLMService, ChatMessage
+def extract_title_from_latex(paper_context: str) -> Optional[str]:
+    """Extract title from LaTeX format \\title{...}"""
+    match = re.search(r'\\title\{([^}]+)\}', paper_context)
+    if match:
+        return match.group(1).strip()
+    return None
+def extract_abstract_from_latex(paper_context: str) -> Optional[str]:
+    """Extract abstract from LaTeX format \\begin{abstract}...\\end{abstract}"""
+    match = re.search(r'\\begin\{abstract\}(.*?)\\end\{abstract\}', paper_context, re.DOTALL)
+    if match:
+        abstract = match.group(1).strip()
+        # Clean up LaTeX commands
+        abstract = re.sub(r'\\[a-zA-Z]+\{([^}]+)\}', r'\1', abstract)  # Remove LaTeX commands
+        abstract = re.sub(r'\$([^$]+)\$', r'\1', abstract)  # Remove math mode
+        return abstract
+    return None
+class MockLLMService(LLMService):
+    """
+    Mock LLM Service that returns pre-generated reviews from a JSON file
+    This service matches papers by extracting title and abstract from paper_context
+    and returns the corresponding pred_fast_mode_baseline from the JSON file.
+    """
+    def __init__(self, json_file_path: str):
+        """
+        Initialize Mock LLM Service
+        Args:
+            json_file_path: Path to JSON file containing pre-generated reviews
+        """
+        self.json_file_path = Path(json_file_path)
+        if not self.json_file_path.exists():
+            raise FileNotFoundError(f"JSON file not found: {json_file_path}")
+        # Load JSON data
+        with open(self.json_file_path, 'r', encoding='utf-8') as f:
+            self.data = json.load(f)
+        # Build index for faster lookup
+        self._build_index()
+    def _build_index(self):
+        """Build index mapping (title, abstract) to review"""
+        self.index = {}
+        self.entries = []  # Store full entries for fallback matching
+        self.initial_scores_index = {}  # Store initial scores and decision for each entry
+        for entry in self.data:
+            paper_context = entry.get('paper_context', '')
+            title = extract_title_from_latex(paper_context)
+            abstract = extract_abstract_from_latex(paper_context)
+            # Extract review content (prefer meta_review.content, fallback to raw_text)
+            model_prediction = entry.get('model_prediction', {})
+            meta_review = model_prediction.get('meta_review', {})
+            review_content = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+            # Extract initial scores and decision
+            initial_scores = {
+                'rating': meta_review.get('rating'),
+                'soundness': meta_review.get('soundness'),
+                'presentation': meta_review.get('presentation'),
+                'contribution': meta_review.get('contribution'),
+                'decision': model_prediction.get('decision'),
+            }
+            if title and abstract:
+                # Use normalized title and first 200 chars of abstract as key
+                normalized_title = title.lower().strip()
+                normalized_abstract = abstract[:200].lower().strip()
+                key = (normalized_title, normalized_abstract)
+                self.index[key] = review_content
+                self.initial_scores_index[key] = initial_scores
+            # Store entry for fallback matching
+            self.entries.append({
+                'title': title,
+                'abstract': abstract,
+                'paper_context': paper_context,
+                'review': review_content,
+                'id': entry.get('id', ''),
+                'initial_scores': initial_scores,
+            })
+    def _find_entry(self, messages: List[Union[ChatMessage, Dict[str, str]]]) -> Optional[Dict[str, Any]]:
+        """
+        Find entry by matching title and abstract from messages
+        Args:
+            messages: List of chat messages
+        Returns:
+            Entry dict with 'review' and 'initial_scores' or None if not found
+        """
+        # Extract paper context from user message
+        user_message = None
+        for msg in messages:
+            if isinstance(msg, dict):
+                if msg.get('role') == 'user':
+                    user_message = msg.get('content', '')
+            elif isinstance(msg, ChatMessage):
+                if msg.role == 'user':
+                    user_message = msg.content
+        if not user_message:
+            return None
+        # Try to extract title and abstract from user message
+        # Look for patterns like "Title: ..." or "Abstract: ..."
+        title_match = re.search(r'Title:\s*(.+?)(?:\n|$)', user_message, re.IGNORECASE)
+        abstract_match = re.search(r'Abstract:\s*(.+?)(?:\n\n|Content:|$)', user_message, re.DOTALL | re.IGNORECASE)
+        extracted_title = None
+        extracted_abstract = None
+        if title_match and abstract_match:
+            extracted_title = title_match.group(1).strip()
+            extracted_abstract = abstract_match.group(1).strip()
+        else:
+            # Fallback: search in paper_context if available
+            paper_context_match = re.search(r'Paper to review:\s*(.+?)(?:Please provide|$)', user_message, re.DOTALL)
+            if paper_context_match:
+                paper_context = paper_context_match.group(1)
+                extracted_title = extract_title_from_latex(paper_context)
+                extracted_abstract = extract_abstract_from_latex(paper_context)
+        if extracted_title and extracted_abstract:
+            # Normalize for matching
+            normalized_title = extracted_title.lower().strip()
+            normalized_abstract = extracted_abstract[:200].lower().strip()
+            # Try exact match first
+            key = (normalized_title, normalized_abstract)
+            if key in self.index:
+                return {
+                    'review': self.index[key],
+                    'initial_scores': self.initial_scores_index.get(key, {})
+                }
+            # Try fuzzy match (check if title matches)
+            for (index_title, index_abstract), review in self.index.items():
+                # Check title similarity (either contains or is contained)
+                title_similar = (
+                    normalized_title in index_title or
+                    index_title in normalized_title or
+                    normalized_title == index_title
+                )
+                # Check abstract similarity (first 100 chars)
+                abstract_similar = (
+                    normalized_abstract[:100] in index_abstract[:100] or
+                    index_abstract[:100] in normalized_abstract[:100] or
+                    normalized_abstract[:100] == index_abstract[:100]
+                )
+                if title_similar and abstract_similar:
+                    return {
+                        'review': review,
+                        'initial_scores': self.initial_scores_index.get((index_title, index_abstract), {})
+                    }
+        # Final fallback: try to match by paper_context in entries
+        for entry in self.entries:
+            if entry['paper_context']:
+                # Check if user message contains similar content
+                entry_title = entry['title']
+                if entry_title and extracted_title:
+                    if entry_title.lower().strip() in extracted_title.lower() or extracted_title.lower() in entry_title.lower():
+                        return {
+                            'review': entry['review'],
+                            'initial_scores': entry.get('initial_scores', {})
+                        }
+        return None
+    def _find_review(self, messages: List[Union[ChatMessage, Dict[str, str]]]) -> Optional[str]:
+        """
+        Find review by matching title and abstract from messages
+        Args:
+            messages: List of chat messages
+        Returns:
+            Review text or None if not found
+        """
+        entry = self._find_entry(messages)
+        if entry:
+            return entry['review']
+        return None
+    def get_initial_scores(self, messages: List[Union[ChatMessage, Dict[str, str]]]) -> Optional[Dict[str, Any]]:
+        """
+        Get initial scores and decision by matching title and abstract from messages
+        Args:
+            messages: List of chat messages
+        Returns:
+            Dict with initial scores (rating, soundness, presentation, contribution, decision) or None if not found
+        """
+        entry = self._find_entry(messages)
+        if entry:
+            return entry.get('initial_scores', {})
+        return None
+    def generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: float = 0.7,
+        top_p: float = 0.8,
+        top_k: int = 20,
+        max_tokens: int = 16384,
+        presence_penalty: float = 0.0,
+        **kwargs
+    ) -> str:
+        """
+        Generate text from messages (returns pre-generated review)
+        Args:
+            messages: List of chat messages
+            temperature: Ignored (for compatibility)
+            top_p: Ignored (for compatibility)
+            top_k: Ignored (for compatibility)
+            max_tokens: Ignored (for compatibility)
+            presence_penalty: Ignored (for compatibility)
+            **kwargs: Additional parameters (ignored)
+        Returns:
+            Pre-generated review text
+        """
+        review = self._find_review(messages)
+        if review:
+            return review
+        # Fallback: return a default message
+        return "## Summary:\n\nReview not found in pre-generated data."
+    def stream_generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: float = 0.7,
+        top_p: float = 0.8,
+        top_k: int = 20,
+        max_tokens: int = 16384,
+        presence_penalty: float = 0.0,
+        **kwargs
+    ):
+        """
+        Stream generate text from messages (yields pre-generated review)
+        Yields:
+            Pre-generated review text chunks
+        """
+        review = self._find_review(messages)
+        if review:
+            # Yield in chunks to simulate streaming
+            chunk_size = 100
+            for i in range(0, len(review), chunk_size):
+                yield review[i:i + chunk_size]
+        else:
+            yield "## Summary:\n\nReview not found in pre-generated data."

shared/utils/prompt_loader.py ADDED Viewed

	@@ -0,0 +1,220 @@

+"""
+Utility for loading prompts from YAML configuration files
+"""
+import yaml
+from pathlib import Path
+from typing import Dict, Any, Optional
+class PromptLoader:
+    """Load and manage prompts from YAML files"""
+    def __init__(self, prompts_file: Optional[str] = None):
+        """
+        Initialize prompt loader
+        Args:
+            prompts_file: Path to prompts YAML file. If None, uses default location.
+        """
+        if prompts_file is None:
+            # Default to shared/configs/prompts.yaml relative to project root
+            project_root = Path(__file__).parent.parent.parent
+            prompts_file = project_root / "shared" / "configs" / "prompts.yaml"
+        self.prompts_file = Path(prompts_file)
+        self._prompts = None
+        self._load_prompts()
+    def _load_prompts(self):
+        """Load prompts from YAML file"""
+        if not self.prompts_file.exists():
+            raise FileNotFoundError(f"Prompts file not found: {self.prompts_file}")
+        with open(self.prompts_file, 'r', encoding='utf-8') as f:
+            self._prompts = yaml.safe_load(f)
+    def get_keyword_generation_prompt(self, context: str) -> str:
+        """
+        Get keyword generation prompt with context filled in
+        Args:
+            context: Paper information context
+        Returns:
+            Formatted prompt string
+        """
+        template = self._prompts["keyword_generation"]["user"]
+        return template.format(context=context)
+    def get_keyword_generation_system(self) -> str:
+        """Get keyword generation system message"""
+        return self._prompts["keyword_generation"].get("system", "")
+    def get_paper_summarization_prompt(self, reference_paper: str, related_paper: str) -> str:
+        """
+        Get paper summarization prompt with reference_paper and related_paper filled in
+        Args:
+            reference_paper: Reference paper information (the paper being reviewed)
+            related_paper: Related paper information
+        Returns:
+            Formatted prompt string
+        """
+        template = self._prompts["paper_summarization"]["user"]
+        return template.format(reference_paper=reference_paper, related_paper=related_paper)
+    def get_paper_results_summarization_prompt(self, content: str) -> str:
+        """
+        Get paper results summarization prompt with content filled in
+        Args:
+            content: Paper content (experiment results section)
+        Returns:
+            Formatted prompt string
+        """
+        template = self._prompts["paper_results_summarization"]["user"]
+        return template.format(content=content)
+    def get_paper_insight_miner_prompt(self, content: str, candidate_review: str) -> str:
+        """
+        Get paper insight miner prompt with content and candidate_review filled in
+        Args:
+            content: Paper content
+            candidate_review: Candidate review draft
+        Returns:
+            Formatted prompt string
+        """
+        template = self._prompts["paper_insight_miner"]["user"]
+        # Use replace instead of format to avoid issues with JSON braces in the template
+        prompt = template.replace("{content}", content)
+        prompt = prompt.replace("{candidate_review}", candidate_review)
+        return prompt
+    def get_paper_results_analyzer_prompt(self, content: str, candidate_review: str) -> str:
+        """
+        Get paper results analyzer prompt with content and candidate_review filled in
+        Args:
+            content: Paper content
+            candidate_review: Candidate review draft
+        Returns:
+            Formatted prompt string
+        """
+        template = self._prompts["paper_results_analyzer"]["user"]
+        # Use replace instead of format to avoid issues with JSON braces in the template
+        prompt = template.replace("{content}", content)
+        prompt = prompt.replace("{candidate_review}", candidate_review)
+        return prompt
+    def get_review_prompt(self, review_format: str = "detailed") -> str:
+        """
+        Get review prompt for specified format
+        Args:
+            review_format: Review format ("detailed", "summary", "structured")
+        Returns:
+            Review prompt string
+        """
+        if review_format not in self._prompts["review_prompts"]:
+            review_format = "detailed"
+        return self._prompts["review_prompts"][review_format]
+    def get_reviewer_system_message(self) -> str:
+        """Get system message for reviewer"""
+        return self._prompts.get("reviewer_system", "You are an expert academic reviewer with deep knowledge in the field.")
+    def get_refiner_prompt(self, review_format: str = "detailed") -> str:
+        """
+        Get refiner prompt for specified format
+        Args:
+            review_format: Review format ("detailed", "summary", "structured")
+        Returns:
+            Refiner prompt string
+        """
+        if "refiner_prompts" not in self._prompts:
+            raise ValueError("refiner_prompts not found in prompts file")
+        if review_format not in self._prompts["refiner_prompts"]:
+            review_format = "detailed"
+        return self._prompts["refiner_prompts"][review_format]
+    def get_refiner_system_message(self) -> str:
+        """Get system message for refiner"""
+        return self._prompts.get("refiner_system", "You are an expert review refiner with deep knowledge in academic review quality standards and meta rubrics.")
+    def get_rubrics_template(self) -> str:
+        """
+        Get the rubrics template for generating paper-specific rubrics.
+        Returns:
+            Rubrics template string (JSON array format)
+        """
+        return self._prompts.get("rubrics", "")
+    def get_rubric_generation_prompt(self, version: str = "v2") -> str:
+        """
+        Get rubric generation prompt.
+        Args:
+            version: Prompt version ("v1" or "v2", default: "v2")
+        Returns:
+            Rubric generation prompt template string
+        """
+        key = f"{version}_rubric_generation_prompt"
+        prompt = self._prompts.get(key, "")
+        # For v2, replace rubric_template placeholder with actual template
+        if version == "v2" and "<<rubric_template>>" in prompt:
+            rubric_template = self.get_rubrics_template()
+            prompt = prompt.replace("<<rubric_template>>", rubric_template)
+        return prompt
+    def get_evaluator_prompt(self, version: str = "v1") -> str:
+        """
+        Get evaluator prompt for evaluating reviews using rubrics.
+        Args:
+            version: Prompt version ("v0" or "v1", default: "v1")
+        Returns:
+            Evaluator prompt template string
+        """
+        key = f"{version}_evaluator_prompt"
+        return self._prompts.get(key, "")
+    def reload(self):
+        """Reload prompts from file"""
+        self._load_prompts()
+# Global prompt loader instance
+_prompt_loader: Optional[PromptLoader] = None
+def get_prompt_loader(prompts_file: Optional[str] = None) -> PromptLoader:
+    """
+    Get or create global prompt loader instance
+    Args:
+        prompts_file: Optional path to prompts file
+    Returns:
+        PromptLoader instance
+    """
+    global _prompt_loader
+    if _prompt_loader is None or prompts_file is not None:
+        _prompt_loader = PromptLoader(prompts_file)
+    return _prompt_loader

shared/utils/reranker.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""
+Reranker utilities for paper retrieval
+Based on OpenScholar's rerank_paragraphs_bge function
+Supports two modes:
+1. Direct mode: Use FlagReranker directly (requires global lock for thread-safety)
+2. API mode: Use reranker API service with load balancing (recommended for multi-GPU)
+"""
+import os
+import threading
+import time
+import requests
+from typing import List, Dict, Any, Optional, Tuple, Union
+# Suppress transformers progress bars
+os.environ.setdefault('TRANSFORMERS_VERBOSITY', 'error')
+# Global lock for reranker usage (FlagReranker's tokenizer is not thread-safe)
+# This prevents "Already borrowed" errors when multiple threads use the same reranker
+# NOTE: Not needed when using API mode
+_reranker_usage_lock = threading.Lock()
+# Try to import endpoint pool for API mode
+try:
+    from .reranker_endpoint_pool import RerankerEndpointPool
+    HAS_ENDPOINT_POOL = True
+except ImportError:
+    HAS_ENDPOINT_POOL = False
+    RerankerEndpointPool = None
+def rerank_paragraphs_bge(
+    query: str,
+    paragraphs: List[Dict[str, Any]],
+    reranker: Optional[Any] = None,
+    reranker_endpoint_pool: Optional[Any] = None,
+    norm_cite: bool = False,
+    start_index: int = 0,
+    use_abstract: bool = False,
+    timeout: float = 30.0,
+) -> Tuple[List[Dict[str, Any]], Dict[int, float], Dict[int, int]]:
+    """
+    Rerank paragraphs using BGE reranker (from OpenScholar)
+    Supports two modes:
+    1. Direct mode: Pass FlagReranker instance (uses global lock, thread-safe but serialized)
+    2. API mode: Pass RerankerEndpointPool (recommended for multi-GPU, parallel requests)
+    Args:
+        query: Search query
+        paragraphs: List of paragraph/paper dictionaries
+        reranker: FlagReranker instance (for direct mode, optional if using API mode)
+        reranker_endpoint_pool: RerankerEndpointPool instance (for API mode, optional if using direct mode)
+        norm_cite: Whether to normalize citation counts and add to scores
+        start_index: Starting index for id mapping
+        use_abstract: Whether to include abstract in reranking text
+        timeout: Request timeout for API mode (seconds)
+    Returns:
+        Tuple of:
+        - reranked_paragraphs: List of reranked paragraphs
+        - result_dict: Dictionary mapping original index to score
+        - id_mapping: Dictionary mapping new index to original index
+    """
+    # Filter out paragraphs without text
+    paragraphs = [p for p in paragraphs if p.get("text") is not None]
+    if not paragraphs:
+        return [], {}, {}
+    # Build paragraph texts for reranking
+    if use_abstract:
+        paragraph_texts = [
+            p["title"] + "\n" + p["abstract"] + "\n" + p["text"]
+            if "title" in p and "abstract" in p and p.get("title") and p.get("abstract")
+            else p["text"]
+            for p in paragraphs
+        ]
+    else:
+        paragraph_texts = [
+            p["title"] + " " + p["text"]
+            if "title" in p and p.get("title") is not None
+            else p["text"]
+            for p in paragraphs
+        ]
+    # Filter out empty or None texts
+    valid_indices = []
+    valid_texts = []
+    for i, text in enumerate(paragraph_texts):
+        if text and isinstance(text, str) and text.strip():
+            valid_indices.append(i)
+            valid_texts.append(text)
+    # If no valid texts, return empty results
+    if not valid_texts:
+        return [], {}, {}
+    # If some texts were filtered out, update paragraphs list
+    if len(valid_indices) < len(paragraphs):
+        paragraphs = [paragraphs[i] for i in valid_indices]
+        paragraph_texts = valid_texts
+    # Compute reranking scores
+    if reranker is None and reranker_endpoint_pool is None:
+        # If no reranker, return original order
+        id_mapping = {i: i + start_index for i in range(len(paragraphs))}
+        result_dict = {i: 0.0 for i in range(len(paragraphs))}
+        return paragraphs, result_dict, id_mapping
+    # API mode: Use reranker API service (recommended for multi-GPU)
+    if reranker_endpoint_pool is not None:
+        return _rerank_via_api(
+            query=query,
+            paragraph_texts=paragraph_texts,
+            paragraphs=paragraphs,
+            reranker_endpoint_pool=reranker_endpoint_pool,
+            norm_cite=norm_cite,
+            start_index=start_index,
+            timeout=timeout
+        )
+    # Direct mode: Use FlagReranker directly (requires global lock)
+    # Suppress transformers warnings and progress bars during computation
+    original_verbosity = os.environ.get('TRANSFORMERS_VERBOSITY', '')
+    os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
+    # Use lock to prevent "Already borrowed" errors from Rust tokenizer
+    # FlagReranker's tokenizer is not thread-safe, so we need to serialize access
+    with _reranker_usage_lock:
+        try:
+            # Ensure we have at least one valid text before calling compute_score
+            if not paragraph_texts:
+                return [], {}, {}
+            scores = reranker.compute_score([[query, p] for p in paragraph_texts], batch_size=100)
+        finally:
+            # Restore original verbosity
+            if original_verbosity:
+                os.environ['TRANSFORMERS_VERBOSITY'] = original_verbosity
+            elif 'TRANSFORMERS_VERBOSITY' in os.environ:
+                del os.environ['TRANSFORMERS_VERBOSITY']
+    # Handle score format (can be float or list)
+    if isinstance(scores, float):
+        result_dict = {0: scores}
+    else:
+        result_dict = {p_id: score for p_id, score in enumerate(scores)}
+    # Add normalized citation counts if enabled
+    if norm_cite:
+        citation_items = [
+            item["citation_counts"]
+            for item in paragraphs
+            if "citation_counts" in item and item["citation_counts"] is not None
+        ]
+        if len(citation_items) > 0:
+            max_citations = max(citation_items)
+            for p_id in result_dict:
+                if (
+                    "citation_counts" in paragraphs[p_id]
+                    and paragraphs[p_id]["citation_counts"] is not None
+                ):
+                    result_dict[p_id] = result_dict[p_id] + (
+                        paragraphs[p_id]["citation_counts"] / max_citations
+                    )
+    # Sort by score
+    p_ids = sorted(result_dict.items(), key=lambda x: x[1], reverse=True)
+    # Build reranked list and id mapping
+    new_orders = []
+    id_mapping = {}
+    for i, (p_id, _) in enumerate(p_ids):
+        new_orders.append(paragraphs[p_id])
+        id_mapping[i] = int(p_id) + start_index
+    return new_orders, result_dict, id_mapping
+def _rerank_via_api(
+    query: str,
+    paragraph_texts: List[str],
+    paragraphs: List[Dict[str, Any]],
+    reranker_endpoint_pool: Any,
+    norm_cite: bool = False,
+    start_index: int = 0,
+    timeout: float = 30.0,
+) -> Tuple[List[Dict[str, Any]], Dict[int, float], Dict[int, int]]:
+    """
+    Rerank paragraphs via API service (supports load balancing across multiple GPUs)
+    Args:
+        query: Search query
+        paragraph_texts: List of paragraph texts (already formatted)
+        paragraphs: List of paragraph dictionaries
+        reranker_endpoint_pool: RerankerEndpointPool instance
+        norm_cite: Whether to normalize citation counts
+        start_index: Starting index for id mapping
+        timeout: Request timeout
+    Returns:
+        Tuple of reranked paragraphs, result dict, and id mapping
+    """
+    if not paragraph_texts:
+        return [], {}, {}
+    # Get endpoint from pool (round-robin load balancing)
+    endpoint = reranker_endpoint_pool.get_endpoint()
+    api_url = f"{endpoint}/rerank"
+    # Prepare request
+    request_data = {
+        "query": query,
+        "paragraphs": paragraph_texts,
+        "batch_size": 100
+    }
+    start_time = time.time()
+    try:
+        # Make API request
+        response = requests.post(
+            api_url,
+            json=request_data,
+            timeout=timeout
+        )
+        response.raise_for_status()
+        result = response.json()
+        scores = result.get("scores", [])
+        response_time = time.time() - start_time
+        # Mark success
+        reranker_endpoint_pool.mark_success(endpoint, response_time)
+    except requests.exceptions.RequestException as e:
+        # Mark error
+        reranker_endpoint_pool.mark_error(endpoint, str(e))
+        raise RuntimeError(f"Reranker API request failed: {e}")
+    # Handle score format (should be list from API)
+    if isinstance(scores, float):
+        result_dict = {0: scores}
+    else:
+        result_dict = {p_id: score for p_id, score in enumerate(scores)}
+    # Add normalized citation counts if enabled
+    if norm_cite:
+        citation_items = [
+            item["citation_counts"]
+            for item in paragraphs
+            if "citation_counts" in item and item["citation_counts"] is not None
+        ]
+        if len(citation_items) > 0:
+            max_citations = max(citation_items)
+            for p_id in result_dict:
+                if (
+                    "citation_counts" in paragraphs[p_id]
+                    and paragraphs[p_id]["citation_counts"] is not None
+                ):
+                    result_dict[p_id] = result_dict[p_id] + (
+                        paragraphs[p_id]["citation_counts"] / max_citations
+                    )
+    # Sort by score
+    p_ids = sorted(result_dict.items(), key=lambda x: x[1], reverse=True)
+    # Build reranked list and id mapping
+    new_orders = []
+    id_mapping = {}
+    for i, (p_id, _) in enumerate(p_ids):
+        new_orders.append(paragraphs[p_id])
+        id_mapping[i] = int(p_id) + start_index
+    return new_orders, result_dict, id_mapping

shared/utils/reranker_api_service.py ADDED Viewed

	@@ -0,0 +1,221 @@

+"""
+Reranker API Service
+Pack FlagReranker into an HTTP API service, supporting multi-GPU load balancing.
+"""
+import os
+import sys
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+import argparse
+# Suppress transformers warnings
+os.environ.setdefault('TRANSFORMERS_VERBOSITY', 'error')
+try:
+    from fastapi import FastAPI, HTTPException
+    from fastapi.middleware.cors import CORSMiddleware
+    from pydantic import BaseModel
+    import uvicorn
+    HAS_FASTAPI = True
+except ImportError:
+    HAS_FASTAPI = False
+    print("Warning: FastAPI not installed. Install with: pip install fastapi uvicorn")
+try:
+    from FlagEmbedding import FlagReranker
+    HAS_FLAGEMBEDDING = True
+except ImportError:
+    HAS_FLAGEMBEDDING = False
+    print("Warning: FlagEmbedding not installed. Install with: pip install FlagEmbedding")
+# Request/Response models
+class RerankRequest(BaseModel):
+    query: str
+    paragraphs: List[str]
+    batch_size: int = 100
+class RerankResponse(BaseModel):
+    scores: List[float]
+    success: bool
+    message: Optional[str] = None
+# Global reranker instance
+_reranker: Optional[Any] = None
+def create_app(model_path: str, use_fp16: bool = True, device: Optional[str] = None):
+    """Create FastAPI app with reranker"""
+    global _reranker
+    app = FastAPI(title="Reranker API Service", version="1.0.0")
+    # Add CORS middleware
+    app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+    @app.on_event("startup")
+    async def load_reranker():
+        """Load reranker model on startup"""
+        global _reranker
+        if not HAS_FLAGEMBEDDING:
+            raise RuntimeError("FlagEmbedding not installed")
+        print(f"Loading reranker model: {model_path}")
+        print(f"Using FP16: {use_fp16}")
+        if device:
+            print(f"Using device: {device}")
+        try:
+            _reranker = FlagReranker(
+                model_path,
+                use_fp16=use_fp16,
+            )
+            if device:
+                # Note: FlagReranker may not support explicit device setting
+                # This is a placeholder for future support
+                pass
+            print("Reranker model loaded successfully")
+        except Exception as e:
+            print(f"Error loading reranker: {e}")
+            raise
+    @app.get("/health")
+    async def health_check():
+        """Health check endpoint"""
+        return {
+            "status": "healthy",
+            "model_loaded": _reranker is not None
+        }
+    @app.post("/rerank", response_model=RerankResponse)
+    async def rerank(request: RerankRequest):
+        """Rerank paragraphs given a query"""
+        global _reranker
+        if _reranker is None:
+            raise HTTPException(status_code=503, detail="Reranker not loaded")
+        if not request.paragraphs:
+            return RerankResponse(
+                scores=[],
+                success=True,
+                message="No paragraphs to rerank"
+            )
+        try:
+            # Prepare sentence pairs: [[query, paragraph], ...]
+            sentence_pairs = [[request.query, p] for p in request.paragraphs]
+            # Compute scores
+            scores = _reranker.compute_score(
+                sentence_pairs,
+                batch_size=request.batch_size
+            )
+            # Handle score format (can be float or list)
+            if isinstance(scores, float):
+                scores = [scores]
+            elif not isinstance(scores, list):
+                scores = list(scores)
+            return RerankResponse(
+                scores=scores,
+                success=True
+            )
+        except Exception as e:
+            print(f"Error during reranking: {e}")
+            import traceback
+            traceback.print_exc()
+            raise HTTPException(status_code=500, detail=str(e))
+    return app
+def main():
+    """Main entry point for reranker API service"""
+    parser = argparse.ArgumentParser(description="Reranker API Service")
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        required=True,
+        help="Path to reranker model (e.g., 'OpenScholar/OpenScholar_Reranker')"
+    )
+    parser.add_argument(
+        "--host",
+        type=str,
+        default="0.0.0.0",
+        help="Host to bind to (default: 0.0.0.0)"
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=8004,
+        help="Port to bind to (default: 8004)"
+    )
+    parser.add_argument(
+        "--use_fp16",
+        action="store_true",
+        default=True,
+        help="Use FP16 precision (default: True)"
+    )
+    parser.add_argument(
+        "--no_fp16",
+        dest="use_fp16",
+        action="store_false",
+        help="Disable FP16 precision"
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default=None,
+        help="Device to use (e.g., 'cuda:0', 'cuda:1')"
+    )
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=1,
+        help="Number of worker processes (default: 1, use 1 for reranker)"
+    )
+    args = parser.parse_args()
+    if not HAS_FASTAPI:
+        print("Error: FastAPI not installed. Install with: pip install fastapi uvicorn")
+        sys.exit(1)
+    if not HAS_FLAGEMBEDDING:
+        print("Error: FlagEmbedding not installed. Install with: pip install FlagEmbedding")
+        sys.exit(1)
+    # Create app
+    app = create_app(
+        model_path=args.model_path,
+        use_fp16=args.use_fp16,
+        device=args.device
+    )
+    # Run server
+    print(f"Starting reranker API service on {args.host}:{args.port}")
+    print(f"Model: {args.model_path}")
+    print(f"FP16: {args.use_fp16}")
+    uvicorn.run(
+        app,
+        host=args.host,
+        port=args.port,
+        workers=args.workers,
+        log_level="info"
+    )
+if __name__ == "__main__":
+    main()

shared/utils/reranker_endpoint_pool.py ADDED Viewed

	@@ -0,0 +1,160 @@

+"""
+Reranker Endpoint Pool Manager
+Manage multiple reranker API endpoints, implement round-robin access and load balancing.
+Reuse the logic of VLLMEndpointPool.
+"""
+from pathlib import Path
+from typing import List, Optional, Dict
+from threading import Lock
+class RerankerEndpointPool:
+    """
+    Reranker Endpoint Pool Manager
+    Features:
+    1. Load multiple reranker API endpoints from file
+    2. Round-robin access endpoints (ensure uniform distribution)
+    3. Track usage status and errors for each endpoint
+    """
+    def __init__(self, pool_path: Optional[str] = None, endpoints: Optional[List[str]] = None, use_round_robin: bool = True):
+        """
+        Initialize Reranker Endpoint Pool
+        Args:
+            pool_path: Endpoints file path (one endpoint URL per line)
+            endpoints: directly provide endpoints list (prior to pool_path)
+            use_round_robin: whether to use round-robin strategy (default True, recommended)
+        """
+        self.endpoints: List[str] = []
+        self.current_index: int = 0  # Round-robin current index
+        self.endpoint_status: Dict[str, Dict] = {}  # status information for each endpoint
+        self.lock = Lock()  # thread safe lock
+        self.use_round_robin = use_round_robin  # whether to use round-robin
+        # load endpoints
+        if endpoints:
+            self.endpoints = endpoints
+        elif pool_path:
+            self._load_from_file(pool_path)
+        else:
+            raise ValueError("Either pool_path or endpoints must be provided")
+        if not self.endpoints:
+            raise ValueError("No endpoints loaded")
+        # initialize status for each endpoint
+        for endpoint in self.endpoints:
+            if endpoint not in self.endpoint_status:
+                self.endpoint_status[endpoint] = {
+                    'total_requests': 0,
+                    'successful_requests': 0,
+                    'failed_requests': 0,
+                    'total_response_time': 0.0,
+                    'last_error': None,
+                }
+        print(f"RerankerEndpointPool initialized with {len(self.endpoints)} endpoints")
+        for i, endpoint in enumerate(self.endpoints):
+            print(f"  [{i+1}] {endpoint}")
+    def _load_from_file(self, pool_path: str):
+        """load endpoints from file"""
+        path = Path(pool_path)
+        if not path.is_absolute():
+            # try to find file relative to shared/configs/
+            project_root = Path(__file__).parent.parent.parent
+            path = project_root / "shared" / "configs" / pool_path
+        if not path.exists():
+            raise FileNotFoundError(f"Reranker endpoint pool file not found: {path}")
+        with open(path, 'r', encoding='utf-8') as f:
+            lines = f.readlines()
+        self.endpoints = []
+        for line in lines:
+            line = line.strip()
+            if line and not line.startswith('#'):
+                # ensure URL format is correct
+                if not line.startswith('http://') and not line.startswith('https://'):
+                    line = f"http://{line}"
+                self.endpoints.append(line)
+    def get_endpoint(self) -> str:
+        """
+        Get next available endpoint (round-robin strategy)
+        Returns:
+            Available endpoint URL
+        """
+        with self.lock:
+            if not self.endpoints:
+                raise ValueError("No reranker endpoints available in pool")
+            if self.use_round_robin:
+                # Round-robin: simple round-robin, ensure uniform distribution
+                selected_idx = self.current_index
+                self.current_index = (self.current_index + 1) % len(self.endpoints)
+                selected_endpoint = self.endpoints[selected_idx]
+                self.endpoint_status[selected_endpoint]['total_requests'] += 1
+                return selected_endpoint
+            else:
+                # smart selection mode (can select based on error rate, etc.)
+                # simple implementation: select endpoint with least requests
+                min_requests = min(
+                    self.endpoint_status[ep]['total_requests']
+                    for ep in self.endpoints
+                )
+                candidates = [
+                    ep for ep in self.endpoints
+                    if self.endpoint_status[ep]['total_requests'] == min_requests
+                ]
+                selected_endpoint = candidates[0]
+                self.endpoint_status[selected_endpoint]['total_requests'] += 1
+                return selected_endpoint
+    def mark_success(self, endpoint: str, response_time: float = 0.0):
+        """mark endpoint request as successful"""
+        with self.lock:
+            if endpoint in self.endpoint_status:
+                self.endpoint_status[endpoint]['successful_requests'] += 1
+                self.endpoint_status[endpoint]['total_response_time'] += response_time
+    def mark_error(self, endpoint: str, error: str):
+        """mark endpoint request as failed"""
+        with self.lock:
+            if endpoint in self.endpoint_status:
+                self.endpoint_status[endpoint]['failed_requests'] += 1
+                self.endpoint_status[endpoint]['last_error'] = error
+    def get_status(self) -> Dict:
+        """get pool status information"""
+        with self.lock:
+            endpoints_status = {}
+            for endpoint, status in self.endpoint_status.items():
+                total = status['total_requests']
+                success = status['successful_requests']
+                failed = status['failed_requests']
+                avg_time = (
+                    status['total_response_time'] / success
+                    if success > 0 else 0.0
+                )
+                endpoints_status[endpoint] = {
+                    'total_requests': total,
+                    'successful_requests': success,
+                    'failed_requests': failed,
+                    'success_rate': success / total if total > 0 else 0.0,
+                    'avg_response_time': avg_time,
+                    'last_error': status['last_error'],
+                }
+            return {
+                'total_endpoints': len(self.endpoints),
+                'endpoints_status': endpoints_status,
+            }

shared/utils/reranker_pool.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+Reranker Pool for multiprocessing-safe reranker sharing
+Solve the issue of FlagReranker not being pickleable in multi-process/multi-thread environment.
+"""
+import os
+import threading
+from typing import Optional, Dict
+from pathlib import Path
+# Global reranker storage (thread-safe)
+# Note: Dictionary access is atomic in Python for simple operations,
+# but we use a lock for thread-safety when modifying the dict
+_reranker_pool: Dict[str, object] = {}
+_reranker_lock = threading.Lock()
+def get_reranker(model_path: str, use_fp16: bool = True):
+    """
+    Get or create reranker (thread-safe, process-shared)
+    Performance optimization:
+    - Load and cache on first call
+    - Return cached instance on subsequent calls (no lock check)
+    - Use double-check locking pattern, reduce lock contention
+    Args:
+        model_path: Reranker model path
+        use_fp16: whether to use FP16
+    Returns:
+        FlagReranker instance
+    """
+    global _reranker_pool
+    # create unique key
+    key = f"{model_path}_{use_fp16}"
+    # performance optimization: fast path check (no lock)
+    if key in _reranker_pool:
+        return _reranker_pool[key]
+    # slow path: needs loading (needs lock)
+    with _reranker_lock:
+        # double check: other threads may have loaded while waiting for lock
+        if key not in _reranker_pool:
+            # lazy import, avoid importing when module is loaded
+            try:
+                from FlagEmbedding import FlagReranker
+                # set environment variable to suppress progress bar
+                original_verbosity = os.environ.get('TRANSFORMERS_VERBOSITY', '')
+                os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
+                try:
+                    # load model
+                    reranker = FlagReranker(model_path, use_fp16=use_fp16)
+                    _reranker_pool[key] = reranker
+                finally:
+                    # restore original verbosity
+                    if original_verbosity:
+                        os.environ['TRANSFORMERS_VERBOSITY'] = original_verbosity
+                    elif 'TRANSFORMERS_VERBOSITY' in os.environ:
+                        del os.environ['TRANSFORMERS_VERBOSITY']
+            except ImportError:
+                raise ImportError(
+                    "FlagEmbedding not installed. Install it with: pip install FlagEmbedding"
+                )
+        return _reranker_pool[key]
+def clear_reranker_pool():
+    """clear reranker pool (mainly for testing)"""
+    global _reranker_pool
+    with _reranker_lock:
+        _reranker_pool.clear()

shared/utils/review_logger.py ADDED Viewed

	@@ -0,0 +1,306 @@

+"""
+Review Logger Utility
+Captures and logs all intermediate outputs from the review pipeline
+"""
+import json
+import uuid
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Any, Optional, List
+class ReviewLogger:
+    """
+    Logger for capturing complete review pipeline execution logs
+    """
+    def __init__(self, log_dir: Optional[str] = None, enabled: bool = True):
+        """
+        Initialize Review Logger
+        Args:
+            log_dir: Directory to save log files. If None, uses current directory.
+            enabled: Whether logging is enabled
+        """
+        self.enabled = enabled
+        self.log_dir = Path(log_dir) if log_dir else Path.cwd()
+        self.log_dir.mkdir(parents=True, exist_ok=True)
+        # Current run data
+        self.current_run_id: Optional[str] = None
+        self.current_run_data: Optional[Dict[str, Any]] = None
+    def start_run(
+        self,
+        title: str,
+        abstract: str,
+        content: Optional[str] = None,
+        keywords: Optional[List[str]] = None,
+        publication_date_range: Optional[str] = None,
+        venues: Optional[str] = None,
+        review_format: str = "detailed",
+    ) -> str:
+        """
+        Start a new review run and generate UUID
+        IMPORTANT: If current_run_data already exists, this method will preserve existing
+        intermediate_outputs data to prevent data loss. Only input data and metadata are updated.
+        Args:
+            title: Paper title
+            abstract: Paper abstract
+            content: Paper content (optional)
+            keywords: Existing keywords (optional)
+            publication_date_range: Date range filter (optional)
+            venues: Venue filter (optional)
+            review_format: Review format
+        Returns:
+            Run UUID string
+        """
+        if not self.enabled:
+            return ""
+        # Generate UUID based on timestamp
+        timestamp = datetime.now()
+        # Use timestamp-based UUID (UUID1 uses MAC address + timestamp)
+        run_id = str(uuid.uuid1())
+        # PRESERVE existing intermediate_outputs if current_run_data already exists
+        # This prevents data loss if start_run() is called multiple times
+        existing_intermediate_outputs = None
+        existing_final_output = None
+        existing_errors = None
+        if self.current_run_data is not None:
+            existing_intermediate_outputs = self.current_run_data.get("intermediate_outputs")
+            existing_final_output = self.current_run_data.get("final_output")
+            existing_errors = self.current_run_data.get("errors", [])
+        self.current_run_id = run_id
+        # Initialize intermediate_outputs: use existing data if available, otherwise create new
+        if existing_intermediate_outputs is not None:
+            # Preserve existing intermediate outputs
+            intermediate_outputs = existing_intermediate_outputs
+            # Only initialize None fields if they don't exist
+            if "generated_keywords" not in intermediate_outputs:
+                intermediate_outputs["generated_keywords"] = None
+            if "retrieved_papers" not in intermediate_outputs:
+                intermediate_outputs["retrieved_papers"] = []
+            if "paper_summaries" not in intermediate_outputs:
+                intermediate_outputs["paper_summaries"] = []
+            if "related_work_json_list" not in intermediate_outputs:
+                intermediate_outputs["related_work_json_list"] = None
+            if "paper_results_analyzer_output" not in intermediate_outputs:
+                intermediate_outputs["paper_results_analyzer_output"] = None
+            if "paper_insight_miner_output" not in intermediate_outputs:
+                intermediate_outputs["paper_insight_miner_output"] = None
+            if "review_prompt" not in intermediate_outputs:
+                intermediate_outputs["review_prompt"] = None
+            if "review_llm_response" not in intermediate_outputs:
+                intermediate_outputs["review_llm_response"] = None
+            if "parsed_review" not in intermediate_outputs:
+                intermediate_outputs["parsed_review"] = None
+            if "refiner_prompt" not in intermediate_outputs:
+                intermediate_outputs["refiner_prompt"] = None
+            if "refiner_llm_response" not in intermediate_outputs:
+                intermediate_outputs["refiner_llm_response"] = None
+            if "parsed_refined_review" not in intermediate_outputs:
+                intermediate_outputs["parsed_refined_review"] = None
+        else:
+            # Create new intermediate_outputs structure
+            intermediate_outputs = {
+                "generated_keywords": None,
+                "retrieved_papers": [],
+                "paper_summaries": [],
+                "related_work_json_list": None,
+                "paper_results_analyzer_output": None,
+                "paper_insight_miner_output": None,
+                "review_prompt": None,
+                "review_llm_response": None,
+                "parsed_review": None,
+                "refiner_prompt": None,
+                "refiner_llm_response": None,
+                "parsed_refined_review": None,
+            }
+        self.current_run_data = {
+            "run_id": run_id,
+            "timestamp": timestamp.isoformat(),
+            "input": {
+                "title": title,
+                "abstract": abstract,
+                "content": content,
+                "keywords": keywords,
+                "publication_date_range": publication_date_range,
+                "venues": venues,
+                "review_format": review_format,
+            },
+            "intermediate_outputs": intermediate_outputs,
+            "final_output": existing_final_output,
+            "errors": existing_errors if existing_errors is not None else [],
+        }
+        return run_id
+    def log_keywords(self, keywords: List[str]):
+        """Log generated search keywords"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["generated_keywords"] = keywords
+    def log_retrieved_papers(self, papers: List[Dict[str, Any]]):
+        """Log retrieved papers (raw)"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            # Store paper metadata (may be large, so we store essential info)
+            self.current_run_data["intermediate_outputs"]["retrieved_papers"] = [
+                {
+                    "paper_id": p.get("paper_id"),
+                    "title": p.get("title"),
+                    "authors": p.get("authors", [])[:10],  # Limit authors
+                    "year": p.get("year"),
+                    "venue": p.get("venue"),
+                    "abstract": p.get("abstract", "")[:500],  # Truncate abstract
+                    "citation_counts": p.get("citation_counts", 0),
+                }
+                for p in papers
+            ]
+    def log_paper_summary(self, paper_title: str, summary: str, paper_index: int):
+        """Log a single paper summary"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            if "paper_summaries" not in self.current_run_data["intermediate_outputs"]:
+                self.current_run_data["intermediate_outputs"]["paper_summaries"] = []
+            self.current_run_data["intermediate_outputs"]["paper_summaries"].append({
+                "paper_index": paper_index,
+                "paper_title": paper_title,
+                "summary": summary,
+            })
+    def log_related_work_json_list(self, related_work_json_list: List[Dict[str, Any]]):
+        """Log the final related work JSON list"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["related_work_json_list"] = related_work_json_list
+    def log_paper_results_analyzer_output(self, results_analyzer_output: str):
+        """Log the paper results analyzer JSON output"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["paper_results_analyzer_output"] = results_analyzer_output
+    def log_paper_insight_miner_output(self, insight_miner_output: str):
+        """Log the paper insight miner JSON output"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["paper_insight_miner_output"] = insight_miner_output
+    def log_review_prompt(self, prompt: str, system_message: Optional[str] = None):
+        """Log the review prompt sent to LLM"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["review_prompt"] = {
+                "system_message": system_message,
+                "user_prompt": prompt,
+            }
+    def log_review_llm_response(self, response: str):
+        """Log the raw LLM response for review"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["review_llm_response"] = response
+    def log_parsed_review(self, parsed_review: Dict[str, Any]):
+        """Log the parsed review dictionary"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["parsed_review"] = parsed_review
+    def log_refiner_prompt(self, prompt: str, system_message: Optional[str] = None):
+        """Log the refiner prompt sent to LLM"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["refiner_prompt"] = {
+                "system_message": system_message,
+                "user_prompt": prompt,
+            }
+    def log_refiner_llm_response(self, response: str):
+        """Log the raw LLM response for refiner"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["refiner_llm_response"] = response
+    def log_parsed_refined_review(self, parsed_review: Dict[str, Any]):
+        """Log the parsed refined review dictionary"""
+        if self.enabled and self.current_run_data:
+            # Ensure intermediate_outputs exists
+            if "intermediate_outputs" not in self.current_run_data:
+                self.current_run_data["intermediate_outputs"] = {}
+            self.current_run_data["intermediate_outputs"]["parsed_refined_review"] = parsed_review
+    def log_final_output(self, final_output: Dict[str, Any]):
+        """Log the final review output"""
+        if self.enabled and self.current_run_data:
+            self.current_run_data["final_output"] = final_output
+    def log_error(self, error: str, step: Optional[str] = None):
+        """Log an error that occurred during execution"""
+        if self.enabled and self.current_run_data:
+            if "errors" not in self.current_run_data:
+                self.current_run_data["errors"] = []
+            self.current_run_data["errors"].append({
+                "step": step,
+                "error": error,
+                "timestamp": datetime.now().isoformat(),
+            })
+    def save_run(self) -> Optional[str]:
+        """
+        Save the current run to a JSON file
+        Returns:
+            Path to saved log file, or None if logging is disabled
+        """
+        if not self.enabled or not self.current_run_data:
+            return None
+        # Generate filename with timestamp and UUID
+        timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"review_log_{timestamp_str}_{self.current_run_id[:8]}.json"
+        log_path = self.log_dir / filename
+        # Save to JSON
+        with open(log_path, 'w', encoding='utf-8') as f:
+            json.dump(self.current_run_data, f, indent=2, ensure_ascii=False)
+        return str(log_path)
+    def get_current_run_id(self) -> Optional[str]:
+        """Get the current run ID"""
+        return self.current_run_id if self.enabled else None

shared/utils/vllm_endpoint_pool.py ADDED Viewed

	@@ -0,0 +1,257 @@

+"""
+VLLM Endpoint Pool Manager
+Manage multiple vLLM endpoints, implement round-robin access and load balancing.
+"""
+import random
+from pathlib import Path
+from typing import List, Optional, Dict
+from threading import Lock
+class VLLMEndpointPool:
+    """
+    VLLM Endpoint Pool Manager
+    Features:
+    1. Load multiple vLLM endpoints from file
+    2. Round-robin access endpoints (ensure uniform distribution)
+    3. Track usage status and errors for each endpoint
+    4. Smart selection (based on error rate and success rate, as backup)
+    """
+    def __init__(self, pool_path: Optional[str] = None, endpoints: Optional[List[str]] = None, use_round_robin: bool = True):
+        """
+        Initialize VLLM Endpoint Pool
+        Args:
+            pool_path: Endpoints file path (one endpoint URL per line)
+            endpoints: directly provide endpoints list (prior to pool_path)
+            use_round_robin: whether to use round-robin strategy (True=uniform distribution, False=smart selection)
+        """
+        self.endpoints: List[str] = []
+        self.current_index: int = 0  # Round-robin current index
+        self.used_indices: List[int] = []  # used indices in current round (for smart selection)
+        self.endpoint_status: Dict[str, Dict] = {}  # status information for each endpoint
+        self.lock = Lock()  # thread safe lock
+        self.use_round_robin = use_round_robin  # whether to use round-robin
+        # load endpoints
+        if endpoints:
+            self.endpoints = [e.strip() for e in endpoints if e.strip()]
+        elif pool_path:
+            self._load_from_file(pool_path)
+        else:
+            # try to get single endpoint from environment variable (backward compatibility)
+            import os
+            env_endpoint = os.environ.get("VLLM_BASE_URL")
+            if env_endpoint:
+                # ensure format is correct (may need to add /v1)
+                if not env_endpoint.endswith('/v1'):
+                    if env_endpoint.endswith('/'):
+                        env_endpoint = env_endpoint.rstrip('/') + '/v1'
+                    else:
+                        env_endpoint = env_endpoint + '/v1'
+                self.endpoints = [env_endpoint]
+        if not self.endpoints:
+            raise ValueError(
+                "No vLLM endpoints available. Provide endpoints via pool_path, endpoints parameter, "
+                "or VLLM_BASE_URL environment variable."
+            )
+        # initialize status for each endpoint
+        for endpoint in self.endpoints:
+            self.endpoint_status[endpoint] = {
+                'error_count': 0,
+                'last_error_time': None,
+                'consecutive_errors': 0,
+                'total_requests': 0,
+                'successful_requests': 0,
+                'total_response_time': 0.0,  # 累计响应时间
+            }
+    def _load_from_file(self, pool_path: str):
+        """load vLLM endpoints from file"""
+        path = Path(pool_path)
+        # if relative path, try to find file relative to shared/configs
+        if not path.is_absolute():
+            # try to find file relative to project root
+            project_root = Path(__file__).parent.parent.parent
+            path = project_root / "shared" / "configs" / pool_path
+            if not path.exists():
+                # try to find file relative to shared/configs
+                path = Path(__file__).parent.parent / "configs" / pool_path
+        if not path.exists():
+            raise FileNotFoundError(
+                f"VLLM endpoint pool file not found: {pool_path} (tried: {path})"
+            )
+        with open(path, 'r', encoding='utf-8') as f:
+            lines = f.readlines()
+        endpoints = []
+        for line in lines:
+            line = line.strip()
+            if line and not line.startswith('#'):
+                # ensure format is correct (may need to add /v1)
+                if not line.endswith('/v1'):
+                    if line.endswith('/'):
+                        line = line.rstrip('/') + '/v1'
+                    else:
+                        line = line + '/v1'
+                endpoints.append(line)
+        self.endpoints = endpoints
+        if not self.endpoints:
+            raise ValueError(f"No valid vLLM endpoints found in pool file: {pool_path}")
+    def get_endpoint(self) -> str:
+        """
+        Get next available endpoint (round-robin strategy)
+        Strategy:
+        - Round-robin mode (default): simple round-robin, ensure uniform distribution
+        - Smart selection mode: select best endpoint based on error rate, success rate, response time
+        Returns:
+            Available endpoint URL
+        """
+        import time
+        with self.lock:
+            if not self.endpoints:
+                raise ValueError("No vLLM endpoints available in pool")
+            if self.use_round_robin:
+                # Round-robin: simple round-robin, ensure uniform distribution
+                selected_idx = self.current_index
+                self.current_index = (self.current_index + 1) % len(self.endpoints)
+                selected_endpoint = self.endpoints[selected_idx]
+                self.endpoint_status[selected_endpoint]['total_requests'] += 1
+                return selected_endpoint
+            else:
+                # smart selection mode (original logic)
+                # if current round is complete, start a new round
+                if len(self.used_indices) >= len(self.endpoints):
+                    self.used_indices = []
+                # get indices not used in current round
+                available_indices = [i for i in range(len(self.endpoints)) if i not in self.used_indices]
+                if not available_indices:
+                    # all endpoints are in current round, start a new round
+                    available_indices = list(range(len(self.endpoints)))
+                    self.used_indices = []
+                # prioritize endpoints with fewer errors and higher success rate
+                endpoint_scores = []
+                for idx in available_indices:
+                    endpoint = self.endpoints[idx]
+                    status = self.endpoint_status[endpoint]
+                    # calculate score: error count, success rate, response time, score越高
+                    error_count = status['error_count']
+                    total = status['total_requests']
+                    success_rate = (status['successful_requests'] / total) if total > 0 else 1.0
+                    # calculate average response time (shorter is better)
+                    avg_response_time = (
+                        status['total_response_time'] / status['successful_requests']
+                        if status['successful_requests'] > 0 else 0.0
+                    )
+                    # normalize response time score (assume 10 seconds as baseline, faster score higher)
+                    response_time_score = 1.0 / (1.0 + avg_response_time / 10.0)
+                    # if recent error, reduce score
+                    recent_error_penalty = 0
+                    if status['last_error_time']:
+                        time_since_error = time.time() - status['last_error_time']
+                        if time_since_error < 60:  # 1 minute内
+                            recent_error_penalty = 0.5
+                    score = success_rate - (error_count * 0.1) - recent_error_penalty + (response_time_score * 0.2)
+                    endpoint_scores.append((idx, score))
+                # sort by score, select highest score (but add some randomness)
+                endpoint_scores.sort(key=lambda x: x[1], reverse=True)
+                # select from top 50% (add randomness but prioritize better)
+                top_n = max(1, len(endpoint_scores) // 2) if len(endpoint_scores) > 1 else 1
+                selected_idx, _ = random.choice(endpoint_scores[:top_n])
+                # mark as used
+                self.used_indices.append(selected_idx)
+                selected_endpoint = self.endpoints[selected_idx]
+                self.endpoint_status[selected_endpoint]['total_requests'] += 1
+                return selected_endpoint
+    def mark_success(self, endpoint: str, response_time: float = 0.0):
+        """
+        mark endpoint as successful
+        Args:
+            endpoint: successful endpoint URL
+            response_time: response time (seconds)
+        """
+        with self.lock:
+            if endpoint in self.endpoint_status:
+                status = self.endpoint_status[endpoint]
+                status['successful_requests'] += 1
+                status['consecutive_errors'] = 0
+                status['total_response_time'] += response_time
+    def mark_error(self, endpoint: str, error_type: str = "server_error"):
+        """
+        mark endpoint as failed
+        Args:
+            endpoint: failed endpoint URL
+            error_type: error type ("server_error", "timeout", "connection_error", "other")
+        """
+        import time
+        with self.lock:
+            if endpoint in self.endpoint_status:
+                status = self.endpoint_status[endpoint]
+                status['error_count'] += 1
+                status['consecutive_errors'] += 1
+                status['last_error_time'] = time.time()
+    def get_status(self) -> Dict:
+        """get pool status information (for debugging)"""
+        with self.lock:
+            return {
+                'total_endpoints': len(self.endpoints),
+                'current_round_progress': f"{len(self.used_indices)}/{len(self.endpoints)}",
+                'endpoints_status': {
+                    endpoint: {
+                        'error_count': status['error_count'],
+                        'successful_requests': status['successful_requests'],
+                        'total_requests': status['total_requests'],
+                        'success_rate': (
+                            status['successful_requests'] / status['total_requests']
+                            if status['total_requests'] > 0 else 0.0
+                        ),
+                        'avg_response_time': (
+                            status['total_response_time'] / status['successful_requests']
+                            if status['successful_requests'] > 0 else 0.0
+                        ),
+                        'consecutive_errors': status['consecutive_errors'],
+                        'last_error_time': status['last_error_time'],
+                    }
+                    for endpoint, status in self.endpoint_status.items()
+                }
+            }
+    def reset_round(self):
+        """reset current round (force start a new round)"""
+        with self.lock:
+            self.used_indices = []

shared/utils/vllm_service.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""
+Simplified vLLM service for review system
+This service only handles API calls, no load balancing logic.
+Load balancing should be handled at the deployment service level (e.g., nginx reverse proxy).
+"""
+import os
+import time
+import random
+import yaml
+from pathlib import Path
+from typing import List, Dict, Optional, Any, Union
+from threading import Semaphore, Lock as ThreadLock
+from openai import OpenAI
+from .llm_service import LLMService, ChatMessage
+class VLLMService(LLMService):
+    """
+    Simplified vLLM service wrapper for local LLM deployment
+    This service connects to a vLLM server endpoint.
+    Load balancing should be handled at the deployment level (e.g., nginx, multiple services behind a load balancer).
+    Features:
+    - Simple API calls to a single endpoint
+    - Automatic retry with exponential backoff for 500 errors
+    - Configurable max concurrent requests (per service instance)
+    """
+    # Class-level semaphore for rate limiting (shared across all instances of this service)
+    # Use lazy initialization to avoid pickle issues with multiprocessing
+    _request_semaphore: Optional[Semaphore] = None
+    _max_concurrent_requests: int = 8  # Default limit
+    _semaphore_lock = ThreadLock()  # Thread-safe initialization lock
+    def __init__(
+        self,
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,
+        model_name: Optional[str] = None,
+        timeout: Optional[int] = None,
+        config_file: Optional[str] = None,
+        max_concurrent_requests: Optional[int] = None,
+        max_retries: int = 3,
+        retry_delay: float = 1.0,
+        retry_backoff: float = 2.0,
+    ):
+        """
+        Initialize vLLM service
+        Args:
+            base_url: vLLM server base URL (default: from config or http://localhost:8000/v1)
+            api_key: API key (overrides config)
+            model_name: Model name identifier (overrides config)
+            timeout: Request timeout in seconds (overrides config)
+            config_file: Path to config file (default: configs/llm_service_config.yaml)
+            max_concurrent_requests: Maximum concurrent requests per service instance (default: 8)
+            max_retries: Maximum number of retries for failed requests (default: 3)
+            retry_delay: Initial retry delay in seconds (default: 1.0)
+            retry_backoff: Retry delay multiplier (default: 2.0)
+        """
+        # Load config from YAML
+        config = self._load_config(config_file)
+        vllm_config = config.get("vllm", {})
+        # Use provided values or fall back to config, then environment variables
+        self.base_url = base_url or vllm_config.get("base_url") or os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")
+        self.model_name = model_name or vllm_config.get("model_name", "Qwen/Qwen3-4B-Instruct-2507")
+        self.api_key = api_key or vllm_config.get("api_key", "dummy-key")
+        self.timeout = timeout or vllm_config.get("timeout", 300)
+        # Retry configuration
+        self.max_retries = max_retries
+        self.retry_delay = retry_delay
+        self.retry_backoff = retry_backoff
+        # Rate limiting: Initialize class-level semaphore if not already initialized
+        # Use lazy initialization with thread-safe check to avoid pickle issues
+        if max_concurrent_requests is not None:
+            VLLMService._max_concurrent_requests = max_concurrent_requests
+        else:
+            # Try to get from config
+            config_max_concurrent = vllm_config.get("max_concurrent_requests")
+            if config_max_concurrent is not None:
+                VLLMService._max_concurrent_requests = config_max_concurrent
+        # Lazy initialization of semaphore will happen on first use
+        # This avoids pickle issues when using multiprocessing/ThreadPoolExecutor
+        # Store default sampling parameters from config
+        self.default_temperature = vllm_config.get("temperature", 0.7)
+        self.default_top_p = vllm_config.get("top_p", 0.8)
+        self.default_top_k = vllm_config.get("top_k", 20)
+        self.default_max_tokens = vllm_config.get("max_tokens", 16384)
+        self.default_presence_penalty = vllm_config.get("presence_penalty", 0.0)
+        # Create OpenAI client
+        self.client = OpenAI(
+            api_key=self.api_key,
+            base_url=self.base_url,
+            timeout=self.timeout,
+        )
+    @staticmethod
+    def _load_config(config_file: Optional[str] = None) -> Dict[str, Any]:
+        """
+        Load configuration from YAML file
+        Args:
+            config_file: Path to config file
+        Returns:
+            Configuration dictionary
+        """
+        if config_file is None:
+            project_root = Path(__file__).parent.parent.parent
+            config_file = project_root / "shared" / "configs" / "llm_service_config.yaml"
+        config_path = Path(config_file)
+        if not config_path.exists():
+            # Return defaults if config file doesn't exist
+            return {
+                "vllm": {
+                    "base_url": "http://localhost:8000/v1",
+                    "api_key": "dummy-key",
+                    "model_name": "Qwen/Qwen3-4B-Instruct-2507",
+                    "timeout": 300,
+                    "max_concurrent_requests": 8,
+                    "max_retries": 3,
+                    "retry_delay": 1.0,
+                    "retry_backoff": 2.0,
+                    "temperature": 0.7,
+                    "top_p": 0.8,
+                    "top_k": 20,
+                    "max_tokens": 16384,
+                    "presence_penalty": 0.0,
+                }
+            }
+        with open(config_path, 'r', encoding='utf-8') as f:
+            return yaml.safe_load(f) or {}
+    @classmethod
+    def _ensure_semaphore(cls):
+        """Thread-safe lazy initialization of semaphore to avoid pickle issues"""
+        if cls._request_semaphore is None:
+            with cls._semaphore_lock:
+                # Double-check pattern
+                if cls._request_semaphore is None:
+                    cls._request_semaphore = Semaphore(cls._max_concurrent_requests)
+    def _format_messages(self, messages: List[Union[ChatMessage, Dict[str, str]]]) -> List[Dict[str, str]]:
+        """Format messages for OpenAI API"""
+        formatted = []
+        for msg in messages:
+            if isinstance(msg, ChatMessage):
+                formatted.append({"role": msg.role, "content": msg.content})
+            elif isinstance(msg, dict):
+                formatted.append(msg)
+            else:
+                raise ValueError(f"Invalid message type: {type(msg)}")
+        return formatted
+    def generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        max_tokens: Optional[int] = None,
+        presence_penalty: Optional[float] = None,
+        **kwargs
+    ) -> str:
+        """
+        Generate text from messages
+        Args:
+            messages: List of chat messages
+            temperature: Sampling temperature (uses config default if None)
+            top_p: Top-p sampling parameter (uses config default if None)
+            top_k: Top-k sampling parameter (uses config default if None)
+            max_tokens: Maximum tokens to generate (uses config default if None)
+            presence_penalty: Presence penalty (uses config default if None)
+            **kwargs: Additional parameters
+        Returns:
+            Generated text
+        """
+        formatted_messages = self._format_messages(messages)
+        # Use provided values or fall back to config defaults
+        temperature = temperature if temperature is not None else self.default_temperature
+        top_p = top_p if top_p is not None else self.default_top_p
+        max_tokens = max_tokens if max_tokens is not None else self.default_max_tokens
+        presence_penalty = presence_penalty if presence_penalty is not None else self.default_presence_penalty
+        # Ensure semaphore is initialized (lazy, thread-safe)
+        self._ensure_semaphore()
+        # Use semaphore to limit concurrent requests
+        with VLLMService._request_semaphore:
+            last_exception = None
+            for retry_attempt in range(self.max_retries + 1):
+                try:
+                    response = self.client.chat.completions.create(
+                        model=self.model_name,
+                        messages=formatted_messages,
+                        temperature=temperature,
+                        top_p=top_p,
+                        max_tokens=max_tokens,
+                        presence_penalty=presence_penalty,
+                        **kwargs
+                    )
+                    return response.choices[0].message.content
+                except Exception as e:
+                    last_exception = e
+                    # Check if it's a server error (500, 502, 503, 504) that we should retry
+                    should_retry = False
+                    error_str = str(e).lower()
+                    if any(code in error_str for code in ["500", "502", "503", "504"]):
+                        should_retry = True
+                    elif "server error" in error_str or "internal server error" in error_str:
+                        should_retry = True
+                    # Don't retry on last attempt
+                    if retry_attempt < self.max_retries and should_retry:
+                        # Calculate delay with exponential backoff and jitter
+                        delay = self.retry_delay * (self.retry_backoff ** retry_attempt)
+                        jitter = random.uniform(0, delay * 0.1)  # 10% jitter
+                        time.sleep(delay + jitter)
+                        continue
+                    else:
+                        # Either not a retryable error or out of retries
+                        raise last_exception
+    def stream_generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        max_tokens: Optional[int] = None,
+        presence_penalty: Optional[float] = None,
+        **kwargs
+    ):
+        """
+        Stream generate text from messages
+        Yields:
+            Generated text chunks
+        """
+        formatted_messages = self._format_messages(messages)
+        # Use provided values or fall back to config defaults
+        temperature = temperature if temperature is not None else self.default_temperature
+        top_p = top_p if top_p is not None else self.default_top_p
+        max_tokens = max_tokens if max_tokens is not None else self.default_max_tokens
+        presence_penalty = presence_penalty if presence_penalty is not None else self.default_presence_penalty
+        # Ensure semaphore is initialized (lazy, thread-safe)
+        self._ensure_semaphore()
+        # Use semaphore to limit concurrent requests
+        with VLLMService._request_semaphore:
+            last_exception = None
+            for retry_attempt in range(self.max_retries + 1):
+                try:
+                    stream = self.client.chat.completions.create(
+                        model=self.model_name,
+                        messages=formatted_messages,
+                        temperature=temperature,
+                        top_p=top_p,
+                        max_tokens=max_tokens,
+                        presence_penalty=presence_penalty,
+                        stream=True,
+                        **kwargs
+                    )
+                    # Stream chunks
+                    for chunk in stream:
+                        if chunk.choices[0].delta.content:
+                            yield chunk.choices[0].delta.content
+                    return  # Success, exit retry loop
+                except Exception as e:
+                    last_exception = e
+                    # Check if it's a server error that we should retry
+                    should_retry = False
+                    error_str = str(e).lower()
+                    if any(code in error_str for code in ["500", "502", "503", "504"]):
+                        should_retry = True
+                    elif "server error" in error_str or "internal server error" in error_str:
+                        should_retry = True
+                    # Don't retry on last attempt
+                    if retry_attempt < self.max_retries and should_retry:
+                        # Calculate delay with exponential backoff and jitter
+                        delay = self.retry_delay * (self.retry_backoff ** retry_attempt)
+                        jitter = random.uniform(0, delay * 0.1)  # 10% jitter
+                        time.sleep(delay + jitter)
+                        continue
+                    else:
+                        # Either not a retryable error or out of retries
+                        raise last_exception

shared/utils/vllm_service_simple.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""
+Simplified vLLM service for review system
+This service only handles API calls, no load balancing logic.
+Load balancing should be handled at the deployment service level (e.g., nginx reverse proxy).
+"""
+import os
+import time
+import random
+import yaml
+from pathlib import Path
+from typing import List, Dict, Optional, Any, Union
+from threading import Semaphore, Lock as ThreadLock
+from openai import OpenAI
+from .llm_service import LLMService, ChatMessage
+class VLLMService(LLMService):
+    """
+    Simplified vLLM service wrapper for local LLM deployment
+    This service connects to a vLLM server endpoint.
+    Load balancing should be handled at the deployment level (e.g., nginx, multiple services behind a load balancer).
+    Features:
+    - Simple API calls to a single endpoint
+    - Automatic retry with exponential backoff for 500 errors
+    - Configurable max concurrent requests (per service instance)
+    """
+    # Class-level semaphore for rate limiting (shared across all instances of this service)
+    # Use lazy initialization to avoid pickle issues with multiprocessing
+    _request_semaphore: Optional[Semaphore] = None
+    _max_concurrent_requests: int = 8  # Default limit
+    _semaphore_lock = ThreadLock()  # Thread-safe initialization lock
+    def __init__(
+        self,
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,
+        model_name: Optional[str] = None,
+        timeout: Optional[int] = None,
+        config_file: Optional[str] = None,
+        max_concurrent_requests: Optional[int] = None,
+        max_retries: int = 3,
+        retry_delay: float = 1.0,
+        retry_backoff: float = 2.0,
+    ):
+        """
+        Initialize vLLM service
+        Args:
+            base_url: vLLM server base URL (default: from config or http://localhost:8000/v1)
+            api_key: API key (overrides config)
+            model_name: Model name identifier (overrides config)
+            timeout: Request timeout in seconds (overrides config)
+            config_file: Path to config file (default: configs/llm_service_config.yaml)
+            max_concurrent_requests: Maximum concurrent requests per service instance (default: 8)
+            max_retries: Maximum number of retries for failed requests (default: 3)
+            retry_delay: Initial retry delay in seconds (default: 1.0)
+            retry_backoff: Retry delay multiplier (default: 2.0)
+        """
+        # Load config from YAML
+        config = self._load_config(config_file)
+        vllm_config = config.get("vllm", {})
+        # Use provided values or fall back to config, then environment variables
+        self.base_url = base_url or vllm_config.get("base_url") or os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")
+        self.model_name = model_name or vllm_config.get("model_name", "Qwen/Qwen3-4B-Instruct-2507")
+        self.api_key = api_key or vllm_config.get("api_key", "dummy-key")
+        self.timeout = timeout or vllm_config.get("timeout", 300)
+        # Retry configuration
+        self.max_retries = max_retries
+        self.retry_delay = retry_delay
+        self.retry_backoff = retry_backoff
+        # Rate limiting: Initialize class-level semaphore if not already initialized
+        # Use lazy initialization with thread-safe check to avoid pickle issues
+        if max_concurrent_requests is not None:
+            VLLMService._max_concurrent_requests = max_concurrent_requests
+        else:
+            # Try to get from config
+            config_max_concurrent = vllm_config.get("max_concurrent_requests")
+            if config_max_concurrent is not None:
+                VLLMService._max_concurrent_requests = config_max_concurrent
+        # Lazy initialization of semaphore will happen on first use
+        # This avoids pickle issues when using multiprocessing/ThreadPoolExecutor
+        # Store default sampling parameters from config
+        self.default_temperature = vllm_config.get("temperature", 0.7)
+        self.default_top_p = vllm_config.get("top_p", 0.8)
+        self.default_top_k = vllm_config.get("top_k", 20)
+        self.default_max_tokens = vllm_config.get("max_tokens", 16384)
+        self.default_presence_penalty = vllm_config.get("presence_penalty", 0.0)
+        # Create OpenAI client
+        self.client = OpenAI(
+            api_key=self.api_key,
+            base_url=self.base_url,
+            timeout=self.timeout,
+        )
+    @staticmethod
+    def _load_config(config_file: Optional[str] = None) -> Dict[str, Any]:
+        """
+        Load configuration from YAML file
+        Args:
+            config_file: Path to config file
+        Returns:
+            Configuration dictionary
+        """
+        if config_file is None:
+            project_root = Path(__file__).parent.parent.parent
+            config_file = project_root / "shared" / "configs" / "llm_service_config.yaml"
+        config_path = Path(config_file)
+        if not config_path.exists():
+            # Return defaults if config file doesn't exist
+            return {
+                "vllm": {
+                    "base_url": "http://localhost:8000/v1",
+                    "api_key": "dummy-key",
+                    "model_name": "Qwen/Qwen3-4B-Instruct-2507",
+                    "timeout": 300,
+                    "max_concurrent_requests": 8,
+                    "max_retries": 3,
+                    "retry_delay": 1.0,
+                    "retry_backoff": 2.0,
+                    "temperature": 0.7,
+                    "top_p": 0.8,
+                    "top_k": 20,
+                    "max_tokens": 16384,
+                    "presence_penalty": 0.0,
+                }
+            }
+        with open(config_path, 'r', encoding='utf-8') as f:
+            return yaml.safe_load(f) or {}
+    @classmethod
+    def _ensure_semaphore(cls):
+        """Thread-safe lazy initialization of semaphore to avoid pickle issues"""
+        if cls._request_semaphore is None:
+            with cls._semaphore_lock:
+                # Double-check pattern
+                if cls._request_semaphore is None:
+                    cls._request_semaphore = Semaphore(cls._max_concurrent_requests)
+    def _format_messages(self, messages: List[Union[ChatMessage, Dict[str, str]]]) -> List[Dict[str, str]]:
+        """Format messages for OpenAI API"""
+        formatted = []
+        for msg in messages:
+            if isinstance(msg, ChatMessage):
+                formatted.append({"role": msg.role, "content": msg.content})
+            elif isinstance(msg, dict):
+                formatted.append(msg)
+            else:
+                raise ValueError(f"Invalid message type: {type(msg)}")
+        return formatted
+    def generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        max_tokens: Optional[int] = None,
+        presence_penalty: Optional[float] = None,
+        **kwargs
+    ) -> str:
+        """
+        Generate text from messages
+        Args:
+            messages: List of chat messages
+            temperature: Sampling temperature (uses config default if None)
+            top_p: Top-p sampling parameter (uses config default if None)
+            top_k: Top-k sampling parameter (uses config default if None)
+            max_tokens: Maximum tokens to generate (uses config default if None)
+            presence_penalty: Presence penalty (uses config default if None)
+            **kwargs: Additional parameters
+        Returns:
+            Generated text
+        """
+        formatted_messages = self._format_messages(messages)
+        # Use provided values or fall back to config defaults
+        temperature = temperature if temperature is not None else self.default_temperature
+        top_p = top_p if top_p is not None else self.default_top_p
+        max_tokens = max_tokens if max_tokens is not None else self.default_max_tokens
+        presence_penalty = presence_penalty if presence_penalty is not None else self.default_presence_penalty
+        # Ensure semaphore is initialized (lazy, thread-safe)
+        self._ensure_semaphore()
+        # Use semaphore to limit concurrent requests
+        with VLLMService._request_semaphore:
+            last_exception = None
+            for retry_attempt in range(self.max_retries + 1):
+                try:
+                    response = self.client.chat.completions.create(
+                        model=self.model_name,
+                        messages=formatted_messages,
+                        temperature=temperature,
+                        top_p=top_p,
+                        max_tokens=max_tokens,
+                        presence_penalty=presence_penalty,
+                        **kwargs
+                    )
+                    return response.choices[0].message.content
+                except Exception as e:
+                    last_exception = e
+                    # Check if it's a server error (500, 502, 503, 504) that we should retry
+                    should_retry = False
+                    error_str = str(e).lower()
+                    if any(code in error_str for code in ["500", "502", "503", "504"]):
+                        should_retry = True
+                    elif "server error" in error_str or "internal server error" in error_str:
+                        should_retry = True
+                    # Don't retry on last attempt
+                    if retry_attempt < self.max_retries and should_retry:
+                        # Calculate delay with exponential backoff and jitter
+                        delay = self.retry_delay * (self.retry_backoff ** retry_attempt)
+                        jitter = random.uniform(0, delay * 0.1)  # 10% jitter
+                        time.sleep(delay + jitter)
+                        continue
+                    else:
+                        # Either not a retryable error or out of retries
+                        raise last_exception
+    def stream_generate(
+        self,
+        messages: List[Union[ChatMessage, Dict[str, str]]],
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        max_tokens: Optional[int] = None,
+        presence_penalty: Optional[float] = None,
+        **kwargs
+    ):
+        """
+        Stream generate text from messages
+        Yields:
+            Generated text chunks
+        """
+        formatted_messages = self._format_messages(messages)
+        # Use provided values or fall back to config defaults
+        temperature = temperature if temperature is not None else self.default_temperature
+        top_p = top_p if top_p is not None else self.default_top_p
+        max_tokens = max_tokens if max_tokens is not None else self.default_max_tokens
+        presence_penalty = presence_penalty if presence_penalty is not None else self.default_presence_penalty
+        # Ensure semaphore is initialized (lazy, thread-safe)
+        self._ensure_semaphore()
+        # Use semaphore to limit concurrent requests
+        with VLLMService._request_semaphore:
+            last_exception = None
+            for retry_attempt in range(self.max_retries + 1):
+                try:
+                    stream = self.client.chat.completions.create(
+                        model=self.model_name,
+                        messages=formatted_messages,
+                        temperature=temperature,
+                        top_p=top_p,
+                        max_tokens=max_tokens,
+                        presence_penalty=presence_penalty,
+                        stream=True,
+                        **kwargs
+                    )
+                    # Stream chunks
+                    for chunk in stream:
+                        if chunk.choices[0].delta.content:
+                            yield chunk.choices[0].delta.content
+                    return  # Success, exit retry loop
+                except Exception as e:
+                    last_exception = e
+                    # Check if it's a server error that we should retry
+                    should_retry = False
+                    error_str = str(e).lower()
+                    if any(code in error_str for code in ["500", "502", "503", "504"]):
+                        should_retry = True
+                    elif "server error" in error_str or "internal server error" in error_str:
+                        should_retry = True
+                    # Don't retry on last attempt
+                    if retry_attempt < self.max_retries and should_retry:
+                        # Calculate delay with exponential backoff and jitter
+                        delay = self.retry_delay * (self.retry_backoff ** retry_attempt)
+                        jitter = random.uniform(0, delay * 0.1)  # 10% jitter
+                        time.sleep(delay + jitter)
+                        continue
+                    else:
+                        # Either not a retryable error or out of retries
+                        raise last_exception

src/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""
+Unified Review System
+A comprehensive system for paper review generation and evaluation.
+"""
+__version__ = "1.0.0"

src/evaluator/1_get_rubrics.py ADDED Viewed

	@@ -0,0 +1,601 @@

+"""
+Generate review-based rubrics by querying LLMs with concurrent parallel requests.
+This script:
+1. Reads the JSON file with review data
+2. Extracts entries with 'id', 'pred_fast_mode_baseline', 'paper_context', and 'decision'
+3. Loads the rubric generation prompt from prompts.yaml
+4. Loads LLM configuration from configs.yaml (supports gpt and vllm modes)
+5. For each entry, generates rubrics by replacing <<golden_review>> with the ground truth review
+6. Uses concurrent parallel requests (ThreadPoolExecutor) for efficient LLM queries
+7. Extracts rubrics from LLM responses and saves to eval_rubrics.json
+Output JSON file (eval_rubrics.json) contains a list of dicts with:
+- id: Entry identifier
+- paper_context: Paper content
+- decision: Decision field from input
+- golden_review: The pred_fast_mode_baseline review (ground truth)
+- rubrics: List of rubric objects, each with title, description, and weight
+Usage:
+    python 1_generate_review_based_rubrics.py \
+        --json_path input.json \
+        --output_path eval_rubrics.json \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --max_workers 5
+The configs.yaml should specify either "gpt" or "vllm" mode and corresponding settings.
+"""
+import json
+import os
+import sys
+import argparse
+import yaml
+from typing import Dict, List, Any, Optional
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+import pandas as pd
+from dotenv import load_dotenv
+# Add parent directory to path to import llm_service
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Import parse_llm_response from local llm_service module (for parsing LLM responses)
+import llm_service as local_llm_service
+parse_llm_response = local_llm_service.parse_llm_response
+# Import from shared/utils for gpt/vllm support
+# Add project root to path to enable absolute imports from shared.utils
+project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+if project_root not in sys.path:
+    sys.path.insert(0, project_root)
+# Use absolute imports from shared.utils package
+from shared.utils.llm_service import LLMService
+from shared.utils.vllm_service import VLLMService
+from shared.utils.gpt_service import GPTService
+# Load environment variables
+load_dotenv()
+class ReviewProcessor:
+    """Handles the extraction and processing of reviews from different sources."""
+    @staticmethod
+    def extract_review_content(pred_context):
+        """
+        Extract the review content from the prediction context.
+        Args:
+            pred_context: Raw prediction data that contains the review
+        Returns:
+            str: Extracted review content
+        """
+        try:
+            # First attempt to extract from boxed format
+            return pred_context.split(r'\boxed_review{')[-1].split('\n}')[0]
+        except Exception:
+            # Alternative extraction if the first method fails
+            if isinstance(pred_context, dict) and 'output' in pred_context:
+                return pred_context['output'].split(r'\boxed_review{')[-1].split('\n}')[0]
+            else:
+                # Return as is if extraction fails
+                return pred_context
+def load_json_data(json_path: str) -> List[Dict[str, Any]]:
+    """
+    Load JSON data from file.
+    Handles both list and dict formats.
+    """
+    with open(json_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    # Convert dict to list if needed
+    if isinstance(data, dict):
+        data = list(data.values())
+    return data
+def load_prompt_template(yaml_path: str) -> str:
+    """
+    Load the rubric generation prompt from YAML file.
+    """
+    with open(yaml_path, 'r', encoding='utf-8') as f:
+        prompts = yaml.safe_load(f)
+    prompt_template = prompts.get('v2_rubric_generation_prompt', '')
+    rubric_template = prompts.get('rubrics', '')
+    prompt_template = prompt_template.replace('<<rubric_template>>', rubric_template)
+    return prompt_template
+def clean_rubrics_json(json_str: str) -> str:
+    """
+    Clean JSON string by escaping unescaped double quotes inside string values.
+    This function handles cases where the model outputs double quotes inside
+    string values (especially in description fields) without proper escaping.
+    The expected format is a JSON array of objects with "title", "description", "weight" fields.
+    Strategy: Find each field's value and escape unescaped quotes inside it.
+    """
+    import re
+    # First, try to extract JSON array if wrapped in markdown code blocks
+    json_match = re.search(r'```json\s*(\[.*?\])\s*```', json_str, re.DOTALL)
+    if json_match:
+        json_str = json_match.group(1)
+    else:
+        # Try to find JSON array directly
+        json_match = re.search(r'(\[.*?\])', json_str, re.DOTALL)
+        if json_match:
+            json_str = json_match.group(1)
+    # Process the JSON string character by character to find and fix string values
+    # We'll look for patterns like "field": " and then find the matching closing quote
+    result = []
+    i = 0
+    while i < len(json_str):
+        # Look for field pattern: "field_name": "
+        field_match = re.search(r'"(title|description|weight)"\s*:\s*"', json_str[i:])
+        if not field_match:
+            # No more fields to process, append rest and break
+            result.append(json_str[i:])
+            break
+        # Append everything before the match
+        match_start = i + field_match.start()
+        result.append(json_str[i:match_start])
+        # Process the field value
+        value_start = i + field_match.end()  # Position after opening quote
+        # Find the closing quote by scanning character by character
+        # The closing quote should be followed by comma, closing brace, or closing bracket
+        j = value_start
+        found_closing = False
+        while j < len(json_str):
+            if json_str[j] == '\\':
+                # Skip escaped character (could be \", \\, etc.)
+                if j + 1 < len(json_str):
+                    j += 2
+                    continue
+                else:
+                    j += 1
+                    break
+            elif json_str[j] == '"':
+                # Found a quote - check if it's the closing quote
+                # Look ahead (skip whitespace) to see if followed by comma, brace, or bracket
+                k = j + 1
+                while k < len(json_str) and json_str[k] in ' \t\n\r':
+                    k += 1
+                if k < len(json_str) and json_str[k] in ',}]':
+                    # This is the closing quote!
+                    value_content = json_str[value_start:j]
+                    closing_part = json_str[j:k+1]  # " followed by , } or ]
+                    # Fix unescaped quotes in value_content
+                    # Strategy: preserve already-escaped quotes, escape others
+                    fixed_content = value_content.replace('\\"', '__TEMP_ESC__')
+                    fixed_content = fixed_content.replace('"', '\\"')
+                    fixed_content = fixed_content.replace('__TEMP_ESC__', '\\"')
+                    # Append the fixed field
+                    result.append(json_str[match_start:value_start])  # "field": "
+                    result.append(fixed_content)  # fixed value content
+                    result.append(closing_part)  # " followed by punctuation
+                    i = k + 1
+                    found_closing = True
+                    break
+            j += 1
+        if not found_closing:
+            # Couldn't find proper closing quote, append rest and break
+            result.append(json_str[match_start:])
+            break
+    return ''.join(result)
+def extract_rubrics_from_response(response: str) -> Optional[List[Dict[str, Any]]]:
+    """
+    Extract rubrics (JSON array) from LLM response.
+    Handles cases where description fields contain unescaped double quotes.
+    Returns None if parsing fails (silently, no error messages printed).
+    """
+    try:
+        # First, try using parse_llm_response (handles markdown blocks)
+        try:
+            parsed = parse_llm_response(response)
+            # Check if parsed result is a list (array of rubrics)
+            if isinstance(parsed, list):
+                return parsed
+            # If parsed result is a dict, check for common keys that might contain the array
+            if isinstance(parsed, dict):
+                # Check for common keys
+                for key in ['rubrics', 'rubric', 'items', 'criteria']:
+                    if key in parsed and isinstance(parsed[key], list):
+                        return parsed[key]
+                # If no key found, try to find the first list value
+                for value in parsed.values():
+                    if isinstance(value, list):
+                        return value
+        except Exception:
+            # parse_llm_response failed, try manual cleaning
+            pass
+        # If parse_llm_response failed, try manual extraction and cleaning
+        import re
+        # Try to find JSON array in response
+        json_match = re.search(r'\[.*?\]', response, re.DOTALL)
+        if json_match:
+            json_str = json_match.group(0)
+            # Try direct parsing first
+            try:
+                rubrics = json.loads(json_str)
+                if isinstance(rubrics, list):
+                    return rubrics
+            except json.JSONDecodeError:
+                # JSON parsing failed, try cleaning
+                try:
+                    cleaned_json = clean_rubrics_json(json_str)
+                    rubrics = json.loads(cleaned_json)
+                    if isinstance(rubrics, list):
+                        return rubrics
+                except Exception:
+                    # Last resort: try a more aggressive cleaning approach
+                    try:
+                        # Replace unescaped quotes in description fields more aggressively
+                        # Pattern: "description": "..." where ... may contain quotes
+                        def fix_description_quotes(match):
+                            prefix = match.group(1)  # "description": "
+                            content = match.group(2)  # the content
+                            suffix = match.group(3)  # closing quote
+                            # Escape all quotes in content, but preserve escaped ones
+                            # First, mark escaped quotes temporarily
+                            content = content.replace('\\"', '__ESCAPED_QUOTE__')
+                            # Escape all remaining quotes
+                            content = content.replace('"', '\\"')
+                            # Restore escaped quotes
+                            content = content.replace('__ESCAPED_QUOTE__', '\\"')
+                            return prefix + content + suffix
+                        # More specific pattern for description field
+                        desc_pattern = r'("description"\s*:\s*")(.*?)("(?:\s*[,}])?)'
+                        fixed_json = re.sub(desc_pattern, fix_description_quotes, json_str, flags=re.DOTALL)
+                        rubrics = json.loads(fixed_json)
+                        if isinstance(rubrics, list):
+                            return rubrics
+                    except Exception:
+                        pass
+        # If all else fails, return None (silently)
+        return None
+    except Exception:
+        # Any unexpected error, return None (silently)
+        return None
+def generate_rubrics_for_entry(
+    entry: Dict[str, Any],
+    prompt_template: str,
+    llm_service: LLMService,
+    max_retries: int = 16
+) -> Dict[str, Any]:
+    """
+    Generate rubrics for a single entry with retry mechanism.
+    Args:
+        entry: Dictionary with 'id', 'pred_fast_mode_baseline', 'paper_context', 'decision'
+        prompt_template: Prompt template with <<golden_review>> placeholder
+        llm_service: LLMService instance (VLLMService or GPTService)
+        max_retries: Maximum number of retries if JSON parsing fails (default: 16)
+    Returns:
+        Dictionary with 'id', 'paper_context', 'decision', 'golden_review', 'rubrics' (list)
+    """
+    entry_id = entry.get('id', 'unknown')
+    golden_review = entry.get('pred_fast_mode_baseline', '')
+    paper_context = entry.get('paper_context', '')
+    decision = entry.get('decision', '')
+    # Replace placeholder in prompt template
+    prompt = prompt_template.replace('<<golden_review>>', golden_review)
+    prompt = prompt.replace('<<paper_context>>', paper_context)
+    # Convert prompt to messages format (shared/utils services use messages format)
+    messages = [{"role": "user", "content": prompt}]
+    # Retry loop for JSON parsing failures
+    last_error = None
+    for attempt in range(max_retries):
+        try:
+            # Generate response from LLM
+            response = llm_service.generate(messages=messages)
+            # Extract rubrics from response
+            rubrics_list = extract_rubrics_from_response(response)
+            # If successful, return the result (silently, no output during retries)
+            if rubrics_list is not None and isinstance(rubrics_list, list):
+                return {
+                    'id': entry_id,
+                    'paper_context': paper_context,
+                    'decision': decision,
+                    'golden_review': golden_review,
+                    'rubrics': rubrics_list
+                }
+            # If extraction failed, continue retrying silently
+            # Store the error message for the last attempt
+            if attempt == max_retries - 1:
+                last_error = "Failed to extract valid rubrics from response"
+        except Exception as e:
+            # Store the error (will be overwritten by subsequent attempts until the last one)
+            last_error = e
+    # All retries failed, output warning only once
+    if last_error:
+        print(f"[WARN] Failed to generate rubrics for entry {entry_id} after {max_retries} attempts: {last_error}")
+    # All retries failed, return with empty rubrics
+    result = {
+        'id': entry_id,
+        'paper_context': paper_context,
+        'decision': decision,
+        'golden_review': golden_review,
+        'rubrics': []  # Empty list as fallback
+    }
+    if last_error:
+        result['error'] = str(last_error)
+    return result
+def load_llm_config(config_path: str) -> Dict[str, Any]:
+    """
+    Load LLM configuration from YAML file.
+    Args:
+        config_path: Path to configs.yaml file
+    Returns:
+        Configuration dictionary
+    """
+    with open(config_path, 'r', encoding='utf-8') as f:
+        config = yaml.safe_load(f)
+    return config
+def create_llm_service_from_config(config: Dict[str, Any]) -> LLMService:
+    """
+    Create LLM service from configuration.
+    Args:
+        config: Configuration dictionary from configs.yaml
+    Returns:
+        LLMService instance (VLLMService or GPTService)
+    """
+    mode = config.get('mode', 'gpt').lower()
+    if mode == 'gpt':
+        gpt_config = config.get('gpt', {})
+        api_key = gpt_config.get('api_key') or os.getenv('OPENAI_API_KEY')
+        if not api_key:
+            raise ValueError("GPT mode requires api_key in configs.yaml or OPENAI_API_KEY environment variable")
+        service = GPTService(
+            api_key=api_key,
+            model_name=gpt_config.get('model_name', 'gpt-4o'),
+            base_url=gpt_config.get('base_url'),
+            timeout=gpt_config.get('timeout', 300)
+        )
+        return service
+    elif mode == 'vllm':
+        vllm_config = config.get('vllm', {})
+        service = VLLMService(
+            base_url=vllm_config.get('base_url', 'http://localhost:8000/v1'),
+            api_key=vllm_config.get('api_key', 'dummy-key'),
+            model_name=vllm_config.get('model_name'),
+            timeout=vllm_config.get('timeout', 300),
+            max_concurrent_requests=vllm_config.get('max_concurrent_requests', 64),
+            max_retries=vllm_config.get('max_retries', 3),
+            retry_delay=vllm_config.get('retry_delay', 1.0),
+            retry_backoff=vllm_config.get('retry_backoff', 2.0)
+        )
+        return service
+    else:
+        raise ValueError(f"Unknown mode: {mode}. Must be 'gpt' or 'vllm'")
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Generate review-based rubrics using LLMs")
+    # Input/Output paths
+    parser.add_argument("--json_path", type=str, required=True,
+                       help="Path to input JSON file with review data")
+    parser.add_argument("--output_path", type=str, default=None,
+                       help="Path to output JSON file (default: eval_rubrics.json in same dir as input)")
+    parser.add_argument("--yaml_path", type=str, default=None,
+                       help="Path to prompts.yaml file (default: prompts.yaml in same dir as script)")
+    parser.add_argument("--config_path", type=str, default=None,
+                       help="Path to configs.yaml file (default: configs.yaml in same dir as script)")
+    # Multi-threading
+    parser.add_argument("--max_workers", type=int, default=None,
+                       help="Maximum number of worker threads (default: from MAX_WORKERS env var or 5)")
+    return parser.parse_args()
+def main():
+    """Main execution function."""
+    args = parse_args()
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    # File paths
+    json_path = args.json_path
+    if not os.path.isabs(json_path):
+        json_path = os.path.join(script_dir, json_path)
+    if args.output_path:
+        output_path = args.output_path
+        if not os.path.isabs(output_path):
+            output_path = os.path.join(script_dir, output_path)
+    else:
+        # Default: same directory as input JSON, with eval_rubrics.json name
+        output_dir = os.path.dirname(json_path)
+        output_path = os.path.join(output_dir, 'eval_rubrics.json')
+    if args.yaml_path:
+        yaml_path = args.yaml_path
+        if not os.path.isabs(yaml_path):
+            yaml_path = os.path.join(script_dir, yaml_path)
+    else:
+        yaml_path = os.path.join(script_dir, 'prompts.yaml')
+    if args.config_path:
+        config_path = args.config_path
+        if not os.path.isabs(config_path):
+            config_path = os.path.join(script_dir, config_path)
+    else:
+        config_path = os.path.join(script_dir, 'configs.yaml')
+    max_workers = args.max_workers or int(os.getenv("MAX_WORKERS", "5"))
+    # Check if files exist
+    if not os.path.exists(json_path):
+        raise FileNotFoundError(f"JSON file not found: {json_path}")
+    if not os.path.exists(yaml_path):
+        raise FileNotFoundError(f"YAML file not found: {yaml_path}")
+    if not os.path.exists(config_path):
+        raise FileNotFoundError(f"Config file not found: {config_path}")
+    print(f"Loading JSON data from {json_path}...")
+    data = load_json_data(json_path)
+    print(f"Loaded {len(data)} entries")
+    print(f"Loading prompt template from {yaml_path}...")
+    prompt_template = load_prompt_template(yaml_path)
+    if not prompt_template:
+        raise ValueError("Could not find 'v2_rubric_generation_prompt' in YAML file")
+    print("Prompt template loaded successfully")
+    # Load LLM configuration and create service
+    print(f"Loading LLM configuration from {config_path}...")
+    llm_config = load_llm_config(config_path)
+    llm_service = create_llm_service_from_config(llm_config)
+    mode = llm_config.get('mode', 'gpt')
+    print(f"LLM service initialized (mode: {mode})")
+    if hasattr(llm_service, 'model_name'):
+        print(f"Using model: {llm_service.model_name}")
+    # Extract required fields from each entry
+    print("Extracting required fields from entries...")
+    entries = []
+    for item in data:
+        if 'id' in item and 'pred_fast_mode_baseline' in item:
+            entries.append(item)
+        else:
+            print(f"[WARN] Skipping entry missing required fields: {item.get('id', 'unknown')}")
+    print(f"Processing {len(entries)} entries with {max_workers} workers...")
+    # Generate rubrics using concurrent processing
+    results = []
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        # Submit all tasks
+        future_to_entry = {
+            executor.submit(
+                generate_rubrics_for_entry,
+                entry,
+                prompt_template,
+                llm_service
+            ): entry
+            for entry in entries
+        }
+        # Process completed tasks with progress bar
+        for future in tqdm(as_completed(future_to_entry), total=len(entries), desc="Generating rubrics"):
+            try:
+                result = future.result()
+                results.append(result)
+            except Exception as e:
+                entry = future_to_entry[future]
+                entry_id = entry.get('id', 'unknown')
+                print(f"\n[ERROR] Failed to process entry {entry_id}: {e}")
+                # Add error entry with empty rubrics
+                results.append({
+                    'id': entry_id,
+                    'paper_context': entry.get('paper_context', ''),
+                    'decision': entry.get('decision', ''),
+                    'golden_review': entry.get('pred_fast_mode_baseline', ''),
+                    'rubrics': [],
+                    'error': str(e)
+                })
+    print(f"\nSuccessfully generated rubrics for {len(results)} entries")
+    # Save to JSON
+    with open(output_path, 'w', encoding='utf-8') as f:
+        json.dump(results, f, ensure_ascii=False, indent=2)
+    print(f"\nResults saved to {output_path}")
+    # Print summary statistics
+    print("\n" + "="*80)
+    print("SUMMARY STATISTICS")
+    print("="*80)
+    print(f"Total entries processed: {len(results)}")
+    # Count successful vs failed
+    successful = sum(1 for r in results if 'error' not in r and len(r.get('rubrics', [])) > 0)
+    failed = len(results) - successful
+    print(f"Successful: {successful}")
+    print(f"Failed: {failed}")
+    # Check rubrics statistics
+    rubric_counts = [len(r.get('rubrics', [])) for r in results if isinstance(r.get('rubrics'), list)]
+    if rubric_counts:
+        print(f"\nRubrics per entry:")
+        print(f"  Mean: {sum(rubric_counts) / len(rubric_counts):.2f}")
+        print(f"  Min: {min(rubric_counts)}")
+        print(f"  Max: {max(rubric_counts)}")
+if __name__ == "__main__":
+    main()
+"""
+Example usage:
+python 1_generate_review_based_rubrics.py \
+        --json_path ./examples/input.json \
+        --output_path eval_rubrics.json \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --max_workers 5
+"""

src/evaluator/2_evaluate.py ADDED Viewed

	@@ -0,0 +1,1730 @@

+"""
+Unified evaluation script for semantic (LLM-based) and auto_metric (rule-based) evaluation.
+This script:
+1. Reads eval_rubrics.json (from 1_generate_review_based_rubrics.py) containing rubrics for each paper
+2. Reads input JSON file containing model reviews (supports multiple formats)
+3. Supports three evaluation modes:
+   - semantic: LLM-based rubrics evaluation (from 2_evaluate_direct.py)
+   - auto_metric: Rule-based metrics evaluation (from 3_rule_evaluate.py)
+   - both: Run both evaluations separately
+4. Supports strict mode: normalize scores to discrete scales before computing metrics (--strict_mode)
+5. Outputs separate JSON files for results and summaries
+Usage:
+    # Semantic evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode semantic \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --max_workers 5
+    # Auto-metric evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json
+    # Auto-metric evaluation with strict mode (normalize scores to discrete scales)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --strict_mode
+    # Auto-metric evaluation with manually specified input format (refined)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format refined
+    # Auto-metric evaluation with manually specified input format (original)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path ours.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format original
+    # Both evaluations
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode both \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --auto_metric_output auto_metric_results.json \
+        --max_workers 32
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import argparse
+import yaml
+import math
+from typing import Dict, List, Any, Optional
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+from itertools import combinations
+from scipy.stats import spearmanr
+from sklearn.metrics import precision_recall_fscore_support
+# Add parent directory to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Import parse_llm_response from local llm_service module
+import llm_service as local_llm_service
+parse_llm_response = local_llm_service.parse_llm_response
+# Import from shared/utils for gpt/vllm support
+project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+if project_root not in sys.path:
+    sys.path.insert(0, project_root)
+from shared.utils.llm_service import LLMService
+from shared.utils.vllm_service import VLLMService
+from shared.utils.gpt_service import GPTService
+sys.path.insert(0, os.path.join(project_root, 'shared', 'utils'))
+from json_parser import parse_review_markdown
+class ReviewProcessor:
+    """Handles the extraction and processing of reviews from different sources."""
+    @staticmethod
+    def extract_review_content(pred_context):
+        """
+        Extract the review content from the prediction context.
+        Args:
+            pred_context: Raw prediction data that contains the review
+        Returns:
+            str: Extracted review content
+        """
+        try:
+            # First attempt to extract from boxed format
+            return pred_context.split(r'\boxed_review{')[-1].split('\n}')[0]
+        except Exception:
+            # Alternative extraction if the first method fails
+            if isinstance(pred_context, dict) and 'output' in pred_context:
+                return pred_context['output'].split(r'\boxed_review{')[-1].split('\n}')[0]
+            else:
+                # Return as is if extraction fails
+                return pred_context
+# ============================================================================
+# Semantic Evaluation Functions (from 2_evaluate_direct.py)
+# ============================================================================
+def load_prompt_template(yaml_path: str) -> str:
+    """Load the evaluator prompt from YAML file."""
+    with open(yaml_path, 'r', encoding='utf-8') as f:
+        prompts = yaml.safe_load(f)
+    return prompts.get('v1_evaluator_prompt', '')
+def build_evaluation_prompt(
+    rubrics: List[Dict[str, Any]],
+    paper_content: str,
+    review: str,
+    prompt_template: str
+) -> str:
+    """Build the evaluation prompt by replacing placeholders."""
+    rubrics_json = json.dumps(rubrics, indent=4, ensure_ascii=False)
+    prompt = prompt_template.replace('{rubrics_json}', rubrics_json)
+    prompt = prompt.replace('<<paper_content>>', paper_content)
+    prompt = prompt.replace('<<review>>', review)
+    return prompt
+def calculate_weighted_scores(
+    raw_scores: Dict[str, Dict[str, Any]],
+    rubrics: List[Dict[str, Any]]
+) -> Dict[str, float]:
+    """Calculate weighted scores for each rubric."""
+    rubric_weights = {r['title']: r['weight'] for r in rubrics}
+    weighted_scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        if rubric_title not in rubric_weights:
+            continue
+        rubric_score = rubric_data.get('score', 0)
+        if isinstance(rubric_score, str):
+            try:
+                rubric_score = int(rubric_score)
+            except ValueError:
+                rubric_score = 0
+        if rubric_score not in [0, 1]:
+            rubric_score = 1 if rubric_score > 0 else 0
+        weight = rubric_weights[rubric_title]
+        weighted_scores[rubric_title] = rubric_score * weight
+    return weighted_scores
+def calculate_scores(raw_scores: Dict[str, Dict[str, Any]]) -> Dict[str, float]:
+    """Calculate scores for each rubric."""
+    scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        scores[rubric_title] = rubric_data.get('score', 0)
+    return scores
+def evaluate_review_semantic(
+    entry: Dict[str, Any],
+    paper_content: str,
+    prompt_template: str,
+    llm_service: LLMService
+) -> Dict[str, Any]:
+    """Evaluate a single review using article-specific rubrics."""
+    entry_id = entry.get('id', 'unknown')
+    rubrics = entry.get('rubrics', [])
+    model_review = entry.get('model_review', '')
+    if not rubrics:
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': 'No valid rubrics found',
+            'raw_response': ''
+        }
+    # Build prompt
+    prompt = build_evaluation_prompt(rubrics, paper_content, model_review, prompt_template)
+    # Call LLM
+    try:
+        messages = [{"role": "user", "content": prompt}]
+        response = llm_service.generate(messages=messages)
+        # Parse response
+        raw_scores = parse_llm_response(response)
+        weighted_scores = calculate_scores(raw_scores)
+        total_score = sum(weighted_scores.values())
+        return {
+            'id': entry_id,
+            'raw_scores': raw_scores,
+            'weighted_scores': weighted_scores,
+            'total_score': total_score,
+            'raw_response': response
+        }
+    except Exception as e:
+        print(f"[ERROR] Error evaluating review {entry_id}: {e}")
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': str(e),
+            'raw_response': ''
+        }
+def calculate_per_rubric_statistics(
+    valid_results: List[Dict[str, Any]],
+    rubric_titles: List[str]
+) -> Dict[str, Dict[str, float]]:
+    """Calculate per-rubric statistics from evaluation results."""
+    rubric_scores = {title: [] for title in rubric_titles}
+    for result in valid_results:
+        weighted_scores = result.get('weighted_scores', {})
+        if not isinstance(weighted_scores, dict):
+            continue
+        for rubric_title in rubric_titles:
+            if rubric_title in weighted_scores:
+                score = weighted_scores[rubric_title]
+                if isinstance(score, str):
+                    try:
+                        score = float(score)
+                    except ValueError:
+                        continue
+                elif isinstance(score, (int, float)):
+                    score = float(score)
+                else:
+                    continue
+                rubric_scores[rubric_title].append(score)
+    per_rubric_stats = {}
+    for rubric_title in rubric_titles:
+        scores = rubric_scores[rubric_title]
+        if not scores:
+            continue
+        mean_score = sum(scores) / len(scores)
+        min_score = min(scores)
+        max_score = max(scores)
+        count = len(scores)
+        if rubric_title == "False or Contradictory Claims":
+            pass_count = sum(1 for s in scores if s >= 0)
+        else:
+            pass_count = sum(1 for s in scores if s >= 1)
+        pass_rate = pass_count / count if count > 0 else 0.0
+        per_rubric_stats[rubric_title] = {
+            'mean': mean_score,
+            'min': min_score,
+            'max': max_score,
+            'count': count,
+            'pass_rate': pass_rate
+        }
+    return per_rubric_stats
+# ============================================================================
+# Auto-Metric Evaluation Functions (from 3_rule_evaluate.py)
+# ============================================================================
+def extract_scores_from_review(review_text: str) -> Dict[str, Any]:
+    """Extract numeric scores and decision from a review markdown text."""
+    if not review_text:
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+    try:
+        parsed = parse_review_markdown(review_text)
+        decision = parsed.get('decision', '')
+        if decision:
+            decision_lower = decision.lower().strip()
+            if 'accept' in decision_lower:
+                decision = 'accept'
+            elif 'reject' in decision_lower:
+                decision = 'reject'
+            elif 'undecided' in decision_lower:
+                decision = 'undecided'
+            else:
+                decision = decision_lower
+        else:
+            decision = None
+        return {
+            'soundness': parsed.get('soundness'),
+            'presentation': parsed.get('presentation'),
+            'rating': parsed.get('rating'),
+            'confidence': parsed.get('confidence'),
+            'decision': decision
+        }
+    except Exception as e:
+        print(f"Warning: Failed to parse review text: {e}")
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+def calculate_mse(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Squared Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return (predicted - ground_truth) ** 2
+def calculate_mae(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Absolute Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return abs(predicted - ground_truth)
+def normalize_to_discrete_scale(score: Optional[float], scale_type: str) -> Optional[float]:
+    """
+    Normalize a float score to the nearest discrete value based on scale type.
+    Uses round-half-up tie-breaking (e.g., 3.5 rounds to 4, 1.5 rounds to 2).
+    Args:
+        score: The float score to normalize (can be None)
+        scale_type: Either '0-5' for 0-5 scale (discrete: 0,1,2,3,4,5)
+                    or '0-10' for 0-10 scale (discrete: 0,2,4,6,8,10)
+    Returns:
+        Normalized discrete score, or None if input is None
+    """
+    if score is None:
+        return None
+    try:
+        score = float(score)
+    except (ValueError, TypeError):
+        return None
+    if scale_type == '0-5':
+        # Discrete values: 0, 1, 2, 3, 4, 5
+        discrete_values = [0, 1, 2, 3, 4, 5]
+        # Clamp to valid range
+        score = max(0, min(5, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        # For ties, prefer the higher value
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    elif scale_type == '0-10':
+        # Discrete values: 0, 2, 4, 6, 8, 10
+        discrete_values = [0, 2, 4, 6, 8, 10]
+        # Clamp to valid range
+        score = max(0, min(10, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    else:
+        raise ValueError(f"Unknown scale_type: {scale_type}. Must be '0-5' or '0-10'")
+def normalize_scores_dict(scores: Dict[str, Optional[float]]) -> Dict[str, Optional[float]]:
+    """
+    Normalize all scores in a dictionary to their appropriate discrete scales.
+    Args:
+        scores: Dictionary with keys 'soundness', 'presentation', 'rating', 'confidence'
+    Returns:
+        Dictionary with normalized scores
+    """
+    normalized = {}
+    # soundness, presentation, confidence use 0-5 scale
+    for key in ['soundness', 'presentation', 'confidence']:
+        normalized[key] = normalize_to_discrete_scale(scores.get(key), '0-5')
+    # rating uses 0-10 scale
+    normalized['rating'] = normalize_to_discrete_scale(scores.get('rating'), '0-10')
+    return normalized
+def calculate_score_metrics(
+    model_scores: Dict[str, float],
+    ground_truth_scores: Dict[str, float],
+    normalize: bool = False
+) -> Dict[str, Any]:
+    """
+    Calculate MSE and MAE metrics for each scoring dimension.
+    Args:
+        model_scores: Dictionary with model scores
+        ground_truth_scores: Dictionary with ground truth scores
+        normalize: If True, normalize scores to discrete scales before computing metrics
+    Returns:
+        Dictionary with MSE, MAE metrics and optionally normalized scores
+    """
+    dimensions = ['soundness', 'presentation', 'rating', 'confidence']
+    # Normalize scores to discrete scales if requested
+    if normalize:
+        model_scores_normalized = normalize_scores_dict(model_scores)
+        gt_scores_normalized = normalize_scores_dict(ground_truth_scores)
+    else:
+        model_scores_normalized = model_scores
+        gt_scores_normalized = ground_truth_scores
+    mse_values = {}
+    mae_values = {}
+    valid_count = 0
+    for dim in dimensions:
+        # Use normalized scores for metric calculation
+        mse = calculate_mse(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mae = calculate_mae(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mse_values[f'{dim}_mse'] = mse
+        mae_values[f'{dim}_mae'] = mae
+        if mse is not None:
+            valid_count += 1
+    overall_error = sum([v for v in mse_values.values() if v is not None])
+    result = {
+        **mse_values,
+        **mae_values,
+        'overall_error': overall_error if valid_count > 0 else None,
+        'valid_dimensions': valid_count
+    }
+    # Include normalized scores in result for transparency (only if normalize=True)
+    if normalize:
+        result['model_scores_normalized'] = model_scores_normalized
+        result['gt_scores_normalized'] = gt_scores_normalized
+    return result
+def normalize_score_value(value):
+    """Normalize score value to float, handling string representations."""
+    if value is None:
+        return None
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        # Try to extract numeric value from string (e.g., "2.75" -> 2.75)
+        try:
+            import re
+            match = re.search(r'(\d+\.?\d*)', value)
+            if match:
+                return float(match.group(1))
+        except:
+            pass
+    return None
+def normalize_decision(decision):
+    """Normalize decision string to standard format."""
+    if decision is None:
+        return None
+    decision_lower = str(decision).lower().strip()
+    if 'accept' in decision_lower:
+        return 'accept'
+    elif 'reject' in decision_lower:
+        return 'reject'
+    elif 'undecided' in decision_lower:
+        return 'undecided'
+    else:
+        return decision_lower
+def extract_scores_from_dict(scores_dict: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Extract scores from a structured dictionary (scores or initial_scores format).
+    Args:
+        scores_dict: Dict containing scores (e.g., {'rating': 5.75, 'soundness': '2.75', ...})
+    Returns:
+        Dict with normalized scores: {'soundness', 'presentation', 'rating', 'confidence', 'decision'}
+    """
+    if not scores_dict:
+        return {
+            'soundness': None,
+            'presentation': None,
+            'rating': None,
+            'confidence': None,
+            'decision': None
+        }
+    return {
+        'soundness': normalize_score_value(scores_dict.get('soundness')),
+        'presentation': normalize_score_value(scores_dict.get('presentation')),
+        'rating': normalize_score_value(scores_dict.get('rating')),
+        'confidence': normalize_score_value(scores_dict.get('confidence')),
+        'decision': normalize_decision(scores_dict.get('decision'))
+    }
+def evaluate_review_auto_metric(entry: Dict[str, Any], use_initial_scores: bool = False, strict_mode: bool = False) -> Dict[str, Any]:
+    """
+    Evaluate a single entry by extracting scores and calculating metrics.
+    Args:
+        entry: Evaluation entry containing model_review, scores, initial_scores, etc.
+        use_initial_scores: If True, use initial_scores instead of refined scores (for refined format)
+    Returns:
+        Dict containing evaluation metrics
+    """
+    entry_id = entry.get('id', 'unknown')
+    model_review = entry.get('model_review', '')
+    format_type = entry.get('format', 'unknown')
+    # Extract scores based on format
+    model_scores = {}
+    model_decision = None
+    if format_type == 'refined' and not use_initial_scores:
+        # Use refined scores from structured data
+        scores_dict = entry.get('scores', {})
+        model_data = extract_scores_from_dict(scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'refined' and use_initial_scores:
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'original':
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+        # Fallback: If confidence is missing from structured data, try to extract from review text
+        # (meta_review may not have confidence field, but review text might)
+        if model_scores.get('confidence') is None and model_review:
+            try:
+                review_data = extract_scores_from_review(model_review)
+                if review_data.get('confidence') is not None:
+                    model_scores['confidence'] = review_data.get('confidence')
+            except Exception:
+                pass  # Keep confidence as None if extraction fails
+    else:
+        # Fallback: extract from markdown review text
+        model_data = extract_scores_from_review(model_review)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    # Get ground truth scores from golden_review ONLY
+    # Ground truth must ONLY come from golden_review, never from model output
+    # If extraction fails, leave fields as None (do not use model_review as fallback)
+    ground_truth_review = entry.get('golden_review', '')
+    ground_truth_scores = {}
+    gt_decision = None
+    if not ground_truth_review:
+        print(f"Warning: No golden_review found for entry {entry_id}. Ground truth scores will be empty.")
+    else:
+        try:
+            # Extract scores from golden_review markdown text
+            gt_data = extract_scores_from_review(ground_truth_review)
+            if not gt_data:
+                print(f"Warning: Failed to parse golden_review for entry {entry_id}. Ground truth scores will be empty.")
+            else:
+                ground_truth_scores = {
+                    'soundness': gt_data.get('soundness'),
+                    'presentation': gt_data.get('presentation'),
+                    'rating': gt_data.get('rating'),
+                    'confidence': gt_data.get('confidence')
+                }
+                gt_decision = normalize_decision(gt_data.get('decision'))
+                # Note: If any field is None, it stays None - we do NOT use model_review as fallback
+                # Using model output as ground truth would inflate evaluation scores
+        except Exception as e:
+            print(f"Warning: Failed to extract scores from golden_review for {entry_id}: {e}")
+            print(f"  Ground truth scores will be empty. Error: {str(e)}")
+    # Calculate MSE and MAE metrics (with optional normalization in strict mode)
+    score_metrics = calculate_score_metrics(model_scores, ground_truth_scores, normalize=strict_mode)
+    # Calculate decision accuracy
+    decision_match = False
+    decision_accuracy = None
+    if model_decision is not None and gt_decision is not None:
+        model_decision_normalized = normalize_decision(model_decision)
+        decision_match = (model_decision_normalized == gt_decision)
+        decision_accuracy = 1.0 if decision_match else 0.0
+    result = {
+        'id': entry_id,
+        'format': format_type,
+        'model_soundness': model_scores.get('soundness'),
+        'model_presentation': model_scores.get('presentation'),
+        'model_rating': model_scores.get('rating'),
+        'model_confidence': model_scores.get('confidence'),
+        'model_decision': model_decision,
+        'gt_soundness': ground_truth_scores.get('soundness'),
+        'gt_presentation': ground_truth_scores.get('presentation'),
+        'gt_rating': ground_truth_scores.get('rating'),
+        'gt_confidence': ground_truth_scores.get('confidence'),
+        'gt_decision': gt_decision,
+        'decision_match': decision_match,
+        'decision_accuracy': decision_accuracy,
+        **score_metrics
+    }
+    # Add prefix to indicate which scores were used
+    if format_type == 'refined':
+        if use_initial_scores:
+            result['score_type'] = 'initial'
+        else:
+            result['score_type'] = 'refined'
+    else:
+        result['score_type'] = 'auto'
+    return result
+def calculate_pairwise_accuracies(paper_scores: List[Dict[str, float]]) -> Dict[str, float]:
+    """Calculate pairwise accuracy for each metric by comparing rankings."""
+    if len(paper_scores) < 2:
+        return {}
+    total_valid_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    correct_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    for paper1, paper2 in combinations(paper_scores, 2):
+        # Check rating ranking
+        if (paper1.get('true_rating') is not None and paper2.get('true_rating') is not None and
+            paper1.get('pred_rating') is not None and paper2.get('pred_rating') is not None):
+            total_valid_pairs['rating'] += 1
+            true_order = paper1['true_rating'] > paper2['true_rating']
+            pred_order = paper1['pred_rating'] > paper2['pred_rating']
+            if true_order == pred_order:
+                correct_pairs['rating'] += 1
+        # Similar for other dimensions...
+        # (abbreviated for space, similar logic for soundness, presentation, confidence)
+        for metric in ['soundness', 'presentation', 'confidence']:
+            true_key = f'true_{metric}'
+            pred_key = f'pred_{metric}'
+            if (paper1.get(true_key) is not None and paper2.get(true_key) is not None and
+                paper1.get(pred_key) is not None and paper2.get(pred_key) is not None):
+                total_valid_pairs[metric] += 1
+                true_order = paper1[true_key] > paper2[true_key]
+                pred_order = paper1[pred_key] > paper2[pred_key]
+                if true_order == pred_order:
+                    correct_pairs[metric] += 1
+    pairwise_accuracies = {
+        metric: correct_pairs[metric] / total_valid_pairs[metric] if total_valid_pairs[metric] > 0 else 0.0
+        for metric in ['rating', 'soundness', 'presentation', 'confidence']
+    }
+    return pairwise_accuracies
+# ============================================================================
+# Data Loading Functions
+# ============================================================================
+def load_rubrics_json(rubrics_path: str) -> Dict[str, Dict[str, Any]]:
+    """Load rubrics JSON and create lookup by id."""
+    with open(rubrics_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, list):
+        return {item['id']: item for item in data}
+    elif isinstance(data, dict):
+        return data
+    else:
+        raise ValueError(f"Invalid rubrics JSON format: expected list or dict, got {type(data)}")
+def load_model_reviews_json(reviews_path: str, format_override: Optional[str] = None) -> Dict[str, Dict[str, Any]]:
+    """
+    Load model reviews JSON and extract reviews by id.
+    Supports two input formats:
+    1. Refined format: Contains 'scores' and 'initial_scores' fields (from refinement pipeline)
+    2. Original format: Contains 'model_prediction' with 'meta_review' and 'decision' (like ours.json)
+    Args:
+        reviews_path: Path to JSON file containing model reviews
+        format_override: Optional format override ('refined', 'original', or None for auto-detect)
+    Returns:
+        Dict mapping paper_id to dict containing:
+        - 'review': review text (markdown)
+        - 'scores': refined scores dict (if available)
+        - 'initial_scores': initial scores dict (if available)
+        - 'format': 'refined' or 'original'
+    """
+    with open(reviews_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, dict):
+        data = list(data.values())
+    reviews_dict = {}
+    for item in data:
+        item_id = None
+        review_text = ''
+        scores = None
+        initial_scores = None
+        format_type = None
+        # Use format override if provided, otherwise auto-detect
+        if format_override and format_override != 'auto':
+            # Force use specified format
+            if format_override == 'refined':
+                item_id = item.get('paper_id') or item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'refined'
+                review_text = item.get('review_markdown', '') or item.get('review', '')
+                scores = item.get('scores', {})
+                initial_scores = item.get('initial_scores', {})
+            elif format_override == 'original':
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                meta_review = model_prediction.get('meta_review', {})
+                review_text = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'decision': model_prediction.get('decision'),
+                }
+            else:
+                raise ValueError(f"Unknown format_override: {format_override}. Must be 'refined', 'original', or 'auto'")
+        else:
+            # Auto-detect format
+            if "paper_id" in item:
+                # Refined format (from refinement pipeline)
+                item_id = item.get('paper_id')
+                if not item_id:
+                    continue
+                # Check if this is refined format (has scores and initial_scores)
+                if 'scores' in item and 'initial_scores' in item:
+                    format_type = 'refined'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+                    scores = item.get('scores', {})
+                    initial_scores = item.get('initial_scores', {})
+                else:
+                    # Standard format with paper_id
+                    format_type = 'standard'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+            elif "model_prediction" in item:
+                # Original format (like ours.json)
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                meta_review = model_prediction.get('meta_review', {})
+                # Extract review content (prefer meta_review.content, fallback to raw_text)
+                review_text = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+                # Extract initial scores
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'decision': model_prediction.get('decision'),
+                }
+            else:
+                # Legacy format (pred_fast_mode)
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'legacy'
+                review_dict = item.get('pred_fast_mode', {})
+                if isinstance(review_dict, dict):
+                    # review_text = review_dict.get('raw_text', '')
+                    review_text = review_dict
+                else:
+                    review_text = str(review_dict)
+        # Extract review content from the review text field
+        try:
+            if review_text:
+                extracted_review = ReviewProcessor.extract_review_content(review_text)
+            else:
+                extracted_review = ''
+            reviews_dict[item_id] = {
+                'review': extracted_review,
+                'scores': scores,
+                'initial_scores': initial_scores,
+                'format': format_type
+            }
+        except Exception as e:
+            print(f"[WARN] Failed to extract review for {item_id}: {e}")
+            continue
+    return reviews_dict
+def combine_rubrics_and_reviews(
+    rubrics_data: Dict[str, Dict[str, Any]],
+    reviews_dict: Dict[str, Dict[str, Any]]
+) -> List[Dict[str, Any]]:
+    """
+    Combine rubrics and reviews into evaluation entries.
+    Args:
+        rubrics_data: Dict mapping paper_id to rubric entry
+        reviews_dict: Dict mapping paper_id to dict containing 'review', 'scores', 'initial_scores', 'format'
+    Returns:
+        List of evaluation entries with model_review, scores, initial_scores, and format info
+    """
+    combined = []
+    missing_reviews = []
+    for paper_id, rubric_entry in rubrics_data.items():
+        review_data = reviews_dict.get(paper_id)
+        if not review_data or not review_data.get('review'):
+            missing_reviews.append(paper_id)
+            continue
+        entry = {
+            'id': paper_id,
+            'paper_context': rubric_entry.get('paper_context', ''),
+            'decision': rubric_entry.get('decision', ''),
+            'golden_review': rubric_entry.get('golden_review', ''),
+            'rubrics': rubric_entry.get('rubrics', []),
+            'model_review': review_data.get('review', ''),
+            'scores': review_data.get('scores'),  # Refined scores (if available)
+            'initial_scores': review_data.get('initial_scores'),  # Initial scores (if available)
+            'format': review_data.get('format', 'unknown')  # Format type
+        }
+        combined.append(entry)
+    if missing_reviews:
+        print(f"[WARN] {len(missing_reviews)} papers have no model review, skipping them")
+    return combined
+# ============================================================================
+# LLM Service Configuration
+# ============================================================================
+def load_llm_config(config_path: str) -> Dict[str, Any]:
+    """Load LLM configuration from YAML file."""
+    with open(config_path, 'r', encoding='utf-8') as f:
+        config = yaml.safe_load(f)
+    return config
+def create_llm_service_from_config(config: Dict[str, Any]) -> LLMService:
+    """Create LLM service from configuration."""
+    mode = config.get('mode', 'gpt').lower()
+    if mode == 'gpt':
+        gpt_config = config.get('gpt', {})
+        api_key = gpt_config.get('api_key') or os.getenv('OPENAI_API_KEY')
+        if not api_key:
+            raise ValueError("GPT mode requires api_key in configs.yaml or OPENAI_API_KEY environment variable")
+        service = GPTService(
+            api_key=api_key,
+            model_name=gpt_config.get('model_name', 'gpt-4o'),
+            base_url=gpt_config.get('base_url'),
+            timeout=gpt_config.get('timeout', 300)
+        )
+        return service
+    elif mode == 'vllm':
+        vllm_config = config.get('vllm', {})
+        service = VLLMService(
+            base_url=vllm_config.get('base_url', 'http://localhost:8000/v1'),
+            api_key=vllm_config.get('api_key', 'dummy-key'),
+            model_name=vllm_config.get('model_name'),
+            timeout=vllm_config.get('timeout', 300),
+            max_concurrent_requests=vllm_config.get('max_concurrent_requests', 64),
+            max_retries=vllm_config.get('max_retries', 3),
+            retry_delay=vllm_config.get('retry_delay', 1.0),
+            retry_backoff=vllm_config.get('retry_backoff', 2.0)
+        )
+        return service
+    else:
+        raise ValueError(f"Unknown mode: {mode}. Must be 'gpt' or 'vllm'")
+# ============================================================================
+# Main Evaluation Functions
+# ============================================================================
+def run_semantic_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    prompt_template: str,
+    llm_service: LLMService,
+    max_workers: int
+) -> tuple:
+    """Run semantic evaluation and return results and summary."""
+    print(f"\n{'='*80}")
+    print("RUNNING SEMANTIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} reviews using {max_workers} workers...")
+    results = []
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        future_to_entry = {
+            executor.submit(
+                evaluate_review_semantic,
+                entry,
+                entry['paper_context'],
+                prompt_template,
+                llm_service
+            ): entry
+            for entry in evaluation_data
+        }
+        for future in tqdm(as_completed(future_to_entry), total=len(evaluation_data), desc="Semantic evaluation"):
+            try:
+                result = future.result()
+                results.append(result)
+            except Exception as e:
+                entry = future_to_entry[future]
+                print(f"\n[ERROR] Failed to process entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'raw_scores': {},
+                    'weighted_scores': {},
+                    'total_score': 0.0,
+                    'error': str(e),
+                    'raw_response': ''
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r and r.get('weighted_scores')]
+    review_scores = [r.get('total_score', 0.0) for r in valid_results]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'failed_entries': len(results) - len(valid_results)
+    }
+    if review_scores:
+        summary['overall_score'] = {
+            'mean': sum(review_scores) / len(review_scores),
+            'min': min(review_scores),
+            'max': max(review_scores)
+        }
+    # Calculate per-rubric statistics (extract rubric titles from first entry)
+    if evaluation_data and evaluation_data[0].get('rubrics'):
+        rubric_titles = [r['title'] for r in evaluation_data[0]['rubrics']]
+        per_rubric_stats = calculate_per_rubric_statistics(valid_results, rubric_titles)
+        summary['per_rubric_statistics'] = per_rubric_stats
+    return results, summary
+def run_auto_metric_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    strict_mode: bool = False
+) -> tuple:
+    """
+    Run auto-metric evaluation and return results and summary.
+    For refined format (has scores and initial_scores), evaluates both:
+    - Refined scores evaluation
+    - Initial scores evaluation
+    For original format (only initial_scores), evaluates:
+    - Initial scores evaluation only
+    Returns:
+        Tuple of (results_list, summary_dict)
+        - results_list: List of evaluation results (may contain both refined and initial results for refined format)
+        - summary_dict: Summary statistics
+    """
+    print(f"\n{'='*80}")
+    print("RUNNING AUTO-METRIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} entries...")
+    # Detect format types
+    refined_format_count = sum(1 for e in evaluation_data if e.get('format') == 'refined')
+    original_format_count = sum(1 for e in evaluation_data if e.get('format') == 'original')
+    if refined_format_count > 0:
+        print(f"Detected {refined_format_count} entries in refined format (will evaluate both refined and initial scores)")
+    if original_format_count > 0:
+        print(f"Detected {original_format_count} entries in original format (will evaluate initial scores only)")
+    results = []
+    for entry in tqdm(evaluation_data, desc="Auto-metric evaluation"):
+        format_type = entry.get('format', 'unknown')
+        if format_type == 'refined':
+            # Evaluate both refined scores and initial scores
+            try:
+                entry_id = entry.get('id', 'unknown')
+                # Evaluate refined scores
+                refined_result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                refined_result['paper_id'] = entry_id  # Keep original paper_id
+                refined_result['id'] = f"{entry_id}_refined"
+                results.append(refined_result)
+                # Evaluate initial scores
+                initial_result = evaluate_review_auto_metric(entry, use_initial_scores=True, strict_mode=strict_mode)
+                initial_result['paper_id'] = entry_id  # Keep original paper_id
+                initial_result['id'] = f"{entry_id}_initial"
+                results.append(initial_result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+        else:
+            # Evaluate initial scores only (or extract from markdown)
+            try:
+                result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                results.append(result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r]
+    mse_results = [r for r in valid_results if r.get('overall_error') is not None]
+    # Separate refined and initial results for refined format
+    refined_results = [r for r in valid_results if r.get('score_type') == 'refined']
+    initial_results = [r for r in valid_results if r.get('score_type') == 'initial']
+    auto_results = [r for r in valid_results if r.get('score_type') == 'auto' or r.get('score_type') is None]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'mse_entries': len(mse_results),
+        'refined_results_count': len(refined_results),
+        'initial_results_count': len(initial_results),
+        'auto_results_count': len(auto_results)
+    }
+    # Calculate MSE/MAE statistics
+    # For refined format, only use refined results for overall statistics (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Refined format: use only refined results for overall statistics
+        stats_results = [r for r in refined_results if r.get('overall_error') is not None]
+    else:
+        # Original/other formats: use all results
+        stats_results = mse_results
+    if stats_results:
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        mse_stats = {}
+        mae_stats = {}
+        for dim in dimensions:
+            mse_list = [r.get(f'{dim}_mse') for r in stats_results if r.get(f'{dim}_mse') is not None]
+            mae_list = [r.get(f'{dim}_mae') for r in stats_results if r.get(f'{dim}_mae') is not None]
+            mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            if mse_clean:
+                mse_stats[dim] = {
+                    'mean': sum(mse_clean) / len(mse_clean),
+                    'count': len(mse_clean)
+                }
+            if mae_clean:
+                mae_stats[dim] = {
+                    'mean': sum(mae_clean) / len(mae_clean),
+                    'count': len(mae_clean)
+                }
+        overall_errors = [r.get('overall_error') for r in stats_results if r.get('overall_error') is not None]
+        overall_clean = [x for x in overall_errors if x is not None and not (isinstance(x, float) and math.isnan(x))]
+        if overall_clean:
+            summary['overall_error'] = {
+                'mean': sum(overall_clean) / len(overall_clean),
+                'count': len(overall_clean)
+            }
+        summary['mse_statistics'] = mse_stats
+        summary['mae_statistics'] = mae_stats
+        # Calculate separate statistics for refined and initial results
+        if refined_results:
+            refined_mse_results = [r for r in refined_results if r.get('overall_error') is not None]
+            if refined_mse_results:
+                refined_mse_stats = {}
+                refined_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in refined_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in refined_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        refined_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        refined_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['refined_mse_statistics'] = refined_mse_stats
+                summary['refined_mae_statistics'] = refined_mae_stats
+        if initial_results:
+            initial_mse_results = [r for r in initial_results if r.get('overall_error') is not None]
+            if initial_mse_results:
+                initial_mse_stats = {}
+                initial_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in initial_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in initial_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        initial_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        initial_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['initial_mse_statistics'] = initial_mse_stats
+                summary['initial_mae_statistics'] = initial_mae_stats
+    # Calculate Spearman correlations
+    def filter_valid_pairs(true_list, pred_list):
+        filtered_true = []
+        filtered_pred = []
+        for t, p in zip(true_list, pred_list):
+            if (t is not None and p is not None and
+                not (isinstance(t, float) and math.isnan(t)) and
+                not (isinstance(p, float) and math.isnan(p))):
+                filtered_true.append(t)
+                filtered_pred.append(p)
+        return filtered_true, filtered_pred
+    # Calculate Spearman correlations
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined spearman correlations
+        refined_spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in refined_results]
+            pred_values = [r.get(f'model_{dim}') for r in refined_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        refined_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Calculate initial spearman correlations
+        initial_spearman_stats = {}
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in initial_results]
+            pred_values = [r.get(f'model_{dim}') for r in initial_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        initial_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Use refined for overall statistics (avoid double counting)
+        summary['spearman_correlations'] = refined_spearman_stats
+        summary['refined_spearman_correlations'] = refined_spearman_stats
+        summary['initial_spearman_correlations'] = initial_spearman_stats
+    else:
+        # Original/other formats: use all results
+        correlation_results = valid_results
+        spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in correlation_results]
+            pred_values = [r.get(f'model_{dim}') for r in correlation_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        summary['spearman_correlations'] = spearman_stats
+    # Calculate Decision metrics
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined decision metrics
+        refined_decision_results = [r for r in refined_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if refined_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in refined_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['refined_decision_metrics'] = refined_decision_metrics
+                summary['decision_metrics'] = refined_decision_metrics  # Use refined for overall
+        # Calculate initial decision metrics
+        initial_decision_results = [r for r in initial_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if initial_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in initial_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['initial_decision_metrics'] = initial_decision_metrics
+    else:
+        # Original/other formats: use all results
+        decision_results = [r for r in valid_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+    # Calculate Pairwise comparison
+    # For refined format, only use refined results (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        pairwise_results = refined_results
+    else:
+        pairwise_results = valid_results
+    paper_scores = []
+    for r in pairwise_results:
+        if (r.get('gt_rating') is not None and r.get('model_rating') is not None) or \
+           (r.get('gt_soundness') is not None and r.get('model_soundness') is not None):
+            paper_scores.append({
+                'true_rating': r.get('gt_rating'),
+                'pred_rating': r.get('model_rating'),
+                'true_soundness': r.get('gt_soundness'),
+                'pred_soundness': r.get('model_soundness'),
+                'true_presentation': r.get('gt_presentation'),
+                'pred_presentation': r.get('model_presentation'),
+                'true_confidence': r.get('gt_confidence'),
+                'pred_confidence': r.get('model_confidence')
+            })
+    if len(paper_scores) >= 2:
+        pairwise_accuracies = calculate_pairwise_accuracies(paper_scores)
+        summary['pairwise_accuracies'] = pairwise_accuracies
+    return results, summary
+# ============================================================================
+# Main Function
+# ============================================================================
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Unified evaluation script for semantic and auto-metric evaluation")
+    # Input paths
+    parser.add_argument("--rubrics_path", type=str, required=True,
+                       help="Path to eval_rubrics.json file (from 1_generate_review_based_rubrics.py)")
+    parser.add_argument("--reviews_path", type=str, required=True,
+                       help="Path to JSON file with model reviews (contains pred_fast_mode)")
+    # Evaluation mode
+    parser.add_argument("--mode", type=str, choices=["semantic", "auto_metric", "both"], default="both",
+                       help="Evaluation mode: semantic (LLM-based), auto_metric (rule-based), or both")
+    # Output paths
+    parser.add_argument("--semantic_output", type=str, default=None,
+                       help="Path to output JSON file for semantic evaluation results (required if mode is semantic or both)")
+    parser.add_argument("--auto_metric_output", type=str, default=None,
+                       help="Path to output JSON file for auto-metric evaluation results (required if mode is auto_metric or both)")
+    # Semantic evaluation settings
+    parser.add_argument("--yaml_path", type=str, default=None,
+                       help="Path to prompts.yaml file (required for semantic evaluation)")
+    parser.add_argument("--config_path", type=str, default=None,
+                       help="Path to configs.yaml file (required for semantic evaluation)")
+    # Multi-threading
+    parser.add_argument("--max_workers", type=int, default=None,
+                       help="Maximum number of worker threads for semantic evaluation (default: 5)")
+    # Strict mode (normalize scores to discrete scales)
+    parser.add_argument("--strict_mode", action="store_true", default=False,
+                       help="Enable strict mode: normalize scores to discrete scales before computing metrics (default: False)")
+    # Input format override
+    parser.add_argument("--input_format", type=str, choices=['auto', 'refined', 'original'], default='auto',
+                       help="Manually specify input JSON format: 'refined' (has scores and initial_scores), 'original' (has model_prediction), or 'auto' for auto-detection (default: 'auto')")
+    return parser.parse_args()
+def main():
+    """Main execution function."""
+    args = parse_args()
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    # Resolve paths
+    rubrics_path = args.rubrics_path
+    if not os.path.isabs(rubrics_path):
+        rubrics_path = os.path.join(script_dir, rubrics_path)
+    reviews_path = args.reviews_path
+    if not os.path.isabs(reviews_path):
+        reviews_path = os.path.join(script_dir, reviews_path)
+    max_workers = args.max_workers or int(os.getenv("MAX_WORKERS", "5"))
+    # Validate mode and output paths
+    if args.mode in ["semantic", "both"]:
+        if not args.semantic_output:
+            raise ValueError("--semantic_output is required when mode is 'semantic' or 'both'")
+        if not args.yaml_path:
+            raise ValueError("--yaml_path is required for semantic evaluation")
+        if not args.config_path:
+            raise ValueError("--config_path is required for semantic evaluation")
+    if args.mode in ["auto_metric", "both"]:
+        if not args.auto_metric_output:
+            raise ValueError("--auto_metric_output is required when mode is 'auto_metric' or 'both'")
+    # Check if files exist
+    if not os.path.exists(rubrics_path):
+        raise FileNotFoundError(f"Rubrics file not found: {rubrics_path}")
+    if not os.path.exists(reviews_path):
+        raise FileNotFoundError(f"Reviews file not found: {reviews_path}")
+    # Load data
+    print(f"Loading rubrics from {rubrics_path}...")
+    rubrics_data = load_rubrics_json(rubrics_path)
+    print(f"Loaded {len(rubrics_data)} rubrics entries")
+    print(f"Loading model reviews from {reviews_path}...")
+    if args.input_format != 'auto':
+        print(f"Using manually specified format: {args.input_format}")
+    else:
+        print("Auto-detecting input format...")
+    reviews_dict = load_model_reviews_json(reviews_path, format_override=args.input_format if args.input_format != 'auto' else None)
+    print(f"Loaded {len(reviews_dict)} model reviews")
+    # Combine rubrics and reviews
+    print("Combining rubrics and reviews...")
+    evaluation_data = combine_rubrics_and_reviews(rubrics_data, reviews_dict)
+    print(f"Prepared {len(evaluation_data)} entries for evaluation")
+    # Run evaluations based on mode
+    if args.mode in ["semantic", "both"]:
+        # Resolve semantic evaluation paths
+        yaml_path = args.yaml_path
+        if not os.path.isabs(yaml_path):
+            yaml_path = os.path.join(script_dir, yaml_path)
+        config_path = args.config_path
+        if not os.path.isabs(config_path):
+            config_path = os.path.join(script_dir, config_path)
+        if not os.path.exists(yaml_path):
+            raise FileNotFoundError(f"YAML file not found: {yaml_path}")
+        if not os.path.exists(config_path):
+            raise FileNotFoundError(f"Config file not found: {config_path}")
+        # Load prompt template
+        print(f"Loading prompt template from {yaml_path}...")
+        prompt_template = load_prompt_template(yaml_path)
+        if not prompt_template:
+            raise ValueError("Could not find 'v1_evaluator_prompt' in YAML file")
+        # Initialize LLM service
+        print(f"Loading LLM configuration from {config_path}...")
+        llm_config = load_llm_config(config_path)
+        llm_service = create_llm_service_from_config(llm_config)
+        mode = llm_config.get('mode', 'gpt')
+        print(f"LLM service initialized (mode: {mode})")
+        if hasattr(llm_service, 'model_name'):
+            print(f"Using model: {llm_service.model_name}")
+        # Run semantic evaluation
+        semantic_results, semantic_summary = run_semantic_evaluation(
+            evaluation_data, prompt_template, llm_service, max_workers
+        )
+        # Save semantic results
+        semantic_output = args.semantic_output
+        if not os.path.isabs(semantic_output):
+            semantic_output = os.path.join(script_dir, semantic_output)
+        output_dir = os.path.dirname(semantic_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(semantic_output, 'w', encoding='utf-8') as f:
+            json.dump(semantic_results, f, ensure_ascii=False, indent=2)
+        print(f"\nSemantic evaluation results saved to {semantic_output}")
+        # Save semantic summary
+        semantic_summary_path = semantic_output.replace('.json', '_summary.json')
+        with open(semantic_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(semantic_summary, f, ensure_ascii=False, indent=2)
+        print(f"Semantic evaluation summary saved to {semantic_summary_path}")
+        # Print semantic summary
+        print("\n" + "="*80)
+        print("SEMANTIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {semantic_summary['total_entries']}")
+        print(f"Valid entries: {semantic_summary['valid_entries']}")
+        print(f"Failed entries: {semantic_summary['failed_entries']}")
+        if 'overall_score' in semantic_summary:
+            score = semantic_summary['overall_score']
+            print(f"\nOverall Score:")
+            print(f"  Mean: {score['mean']:.2f}")
+            print(f"  Min: {score['min']:.2f}")
+            print(f"  Max: {score['max']:.2f}")
+    if args.mode in ["auto_metric", "both"]:
+        # Run auto-metric evaluation
+        auto_metric_results, auto_metric_summary = run_auto_metric_evaluation(
+            evaluation_data,
+            strict_mode=args.strict_mode
+        )
+        # Save auto-metric results
+        auto_metric_output = args.auto_metric_output
+        if not os.path.isabs(auto_metric_output):
+            auto_metric_output = os.path.join(script_dir, auto_metric_output)
+        output_dir = os.path.dirname(auto_metric_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(auto_metric_output, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_results, f, ensure_ascii=False, indent=2)
+        print(f"\nAuto-metric evaluation results saved to {auto_metric_output}")
+        # Save auto-metric summary
+        auto_metric_summary_path = auto_metric_output.replace('.json', '_summary.json')
+        with open(auto_metric_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_summary, f, ensure_ascii=False, indent=2)
+        print(f"Auto-metric evaluation summary saved to {auto_metric_summary_path}")
+        # Print auto-metric summary
+        print("\n" + "="*80)
+        print("AUTO-METRIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {auto_metric_summary['total_entries']}")
+        print(f"Valid entries: {auto_metric_summary['valid_entries']}")
+        print(f"MSE entries: {auto_metric_summary['mse_entries']}")
+        if 'mse_statistics' in auto_metric_summary:
+            print("\nMSE Statistics:")
+            for dim, stats in auto_metric_summary['mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'mae_statistics' in auto_metric_summary:
+            print("\nMAE Statistics:")
+            for dim, stats in auto_metric_summary['mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        # Print refined and initial statistics if available
+        if 'refined_mse_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'refined_mae_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mse_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mae_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'spearman_correlations' in auto_metric_summary:
+            print("\nSpearman Correlations:")
+            for dim, stats in auto_metric_summary['spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        # Print refined and initial spearman correlations if available
+        if 'refined_spearman_correlations' in auto_metric_summary:
+            print("\nRefined Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['refined_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'initial_spearman_correlations' in auto_metric_summary:
+            print("\nInitial Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['initial_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'decision_metrics' in auto_metric_summary:
+            dm = auto_metric_summary['decision_metrics']
+            print(f"\nDecision Metrics:")
+            print(f"  Accuracy: {dm['accuracy']:.4f} (n={dm['count']})")
+            if 'f1_macro' in dm:
+                print(f"  F1 (macro): {dm['f1_macro']:.4f}")
+        # Print refined and initial decision metrics if available
+        if 'refined_decision_metrics' in auto_metric_summary:
+            print("\nRefined Scores - Decision Metrics:")
+            rdm = auto_metric_summary['refined_decision_metrics']
+            print(f"  Accuracy: {rdm['accuracy']:.4f} (n={rdm['count']})")
+            if 'f1_macro' in rdm:
+                print(f"  F1 (macro): {rdm['f1_macro']:.4f}")
+        if 'initial_decision_metrics' in auto_metric_summary:
+            print("\nInitial Scores - Decision Metrics:")
+            idm = auto_metric_summary['initial_decision_metrics']
+            print(f"  Accuracy: {idm['accuracy']:.4f} (n={idm['count']})")
+            if 'f1_macro' in idm:
+                print(f"  F1 (macro): {idm['f1_macro']:.4f}")
+    print("\n" + "="*80)
+    print("EVALUATION COMPLETE")
+    print("="*80)
+if __name__ == "__main__":
+    main()

src/evaluator/2_evaluate_agenticreview.py ADDED Viewed

	@@ -0,0 +1,1866 @@

+"""
+Unified evaluation script for semantic (LLM-based) and auto_metric (rule-based) evaluation.
+This script:
+1. Reads eval_rubrics.json (from 1_generate_review_based_rubrics.py) containing rubrics for each paper
+2. Reads input JSON file containing model reviews (supports multiple formats)
+3. Supports three evaluation modes:
+   - semantic: LLM-based rubrics evaluation (from 2_evaluate_direct.py)
+   - auto_metric: Rule-based metrics evaluation (from 3_rule_evaluate.py)
+   - both: Run both evaluations separately
+4. Supports strict mode: normalize scores to discrete scales before computing metrics (--strict_mode)
+5. Outputs separate JSON files for results and summaries
+Usage:
+    # Semantic evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode semantic \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --max_workers 5
+    # Auto-metric evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json
+    # Auto-metric evaluation with strict mode (normalize scores to discrete scales)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --strict_mode
+    # Auto-metric evaluation with manually specified input format (refined)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format refined
+    # Auto-metric evaluation with manually specified input format (original)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path ours.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format original
+    # Both evaluations
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode both \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --auto_metric_output auto_metric_results.json \
+        --max_workers 32
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import argparse
+import yaml
+import math
+import re
+from typing import Dict, List, Any, Optional
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+from itertools import combinations
+from scipy.stats import spearmanr
+from sklearn.metrics import precision_recall_fscore_support
+# Add parent directory to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Import parse_llm_response from local llm_service module
+import llm_service as local_llm_service
+parse_llm_response = local_llm_service.parse_llm_response
+# Import from shared/utils for gpt/vllm support
+project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+if project_root not in sys.path:
+    sys.path.insert(0, project_root)
+from shared.utils.llm_service import LLMService
+from shared.utils.vllm_service import VLLMService
+from shared.utils.gpt_service import GPTService
+sys.path.insert(0, os.path.join(project_root, 'shared', 'utils'))
+from json_parser import parse_review_markdown
+def convert_ai_researcher(review: dict) -> str:
+    """
+    Convert the review text from ai-researcher format to unified review system format.
+    """
+    summary = review["Summary"]
+    strengths = "\n".join(f"- {s}" for s in review["Strengths"])
+    weaknesses = "\n".join(f"- {w}" for w in review["Weaknesses"])
+    # scores
+    originality = review["Originality"]
+    quality = review["Quality"]
+    clarity = review["Clarity"]
+    significance = review["Significance"]
+    questions = "\n".join(f"- {q}" for q in review["Questions"])
+    limitations = "\n".join(f"- {l}" for l in review["Limitations"])
+    ethical_concerns = review["Ethical Concerns"]
+    # scores again
+    soundness = review["Soundness"]
+    presentation = review["Presentation"]
+    contribution = review["Contribution"]
+    overall = review["Overall"]
+    confidence = review["Confidence"]
+    # final decision
+    decision = review["Decision"]
+    meta_review = {
+        "rating": overall,
+        "soundness": soundness,
+        "presentation": presentation,
+        "contribution": contribution,
+        "confidence": confidence,
+        "decision": decision.lower().strip(),
+    }
+    return f"Summary: {summary}\nStrengths: {strengths}\nWeaknesses: {weaknesses}\nOriginality: {originality}\nQuality: {quality}\nClarity: {clarity}\nSignificance: {significance}\nQuestions: {questions}\nLimitations: {limitations}\nEthical Concerns: {ethical_concerns}\nSoundness: {soundness}\nPresentation: {presentation}\nContribution: {contribution}\nOverall: {overall}\nConfidence: {confidence}\nDecision: {decision}", meta_review
+def convert_agenticreview(review_text: str) -> tuple:
+    """
+    Convert the review text from agenticreview format to unified review system format.
+    The agenticreview format has text like:
+    "Overall rating: 5\n\nSignificance and novelty: ..."
+    Args:
+        review_text: Raw review text string
+    Returns:
+        Tuple of (formatted_review_text, meta_review_dict)
+    """
+    # Extract rating from "Overall rating: x" format
+    rating = None
+    rating_match = re.search(r'Overall\s+rating\s*[:=]\s*(\d+\.?\d*)', review_text, re.IGNORECASE)
+    if rating_match:
+        try:
+            rating = float(rating_match.group(1))
+        except (ValueError, IndexError):
+            pass
+    # If not found, try alternative patterns
+    if rating is None:
+        rating_match = re.search(r'(?:rating|score)\s*[:=]\s*(\d+\.?\d*)', review_text, re.IGNORECASE)
+        if rating_match:
+            try:
+                rating = float(rating_match.group(1))
+            except (ValueError, IndexError):
+                pass
+    # Try to extract from parse_review_markdown as fallback
+    if rating is None:
+        try:
+            parsed = parse_review_markdown(review_text)
+            rating = parsed.get('rating')
+        except Exception:
+            pass
+    # Create meta_review dict - agenticreview only has rating, no other scores
+    meta_review = {
+        "rating": rating,
+        "soundness": None,
+        "presentation": None,
+        "contribution": None,
+        "confidence": None,
+        "decision": None,
+    }
+    # Return the review text as-is (it's already in a readable format)
+    return review_text, meta_review
+class ReviewProcessor:
+    """Handles the extraction and processing of reviews from different sources."""
+    @staticmethod
+    def extract_review_content(pred_context):
+        """
+        Extract the review content from the prediction context.
+        Args:
+            pred_context: Raw prediction data that contains the review
+        Returns:
+            str: Extracted review content
+        """
+        try:
+            # First attempt to extract from boxed format
+            return pred_context.split(r'\boxed_review{')[-1].split('\n}')[0]
+        except Exception:
+            # Alternative extraction if the first method fails
+            if isinstance(pred_context, dict) and 'output' in pred_context:
+                return pred_context['output'].split(r'\boxed_review{')[-1].split('\n}')[0]
+            else:
+                # Return as is if extraction fails
+                return pred_context
+# ============================================================================
+# Semantic Evaluation Functions (from 2_evaluate_direct.py)
+# ============================================================================
+def load_prompt_template(yaml_path: str) -> str:
+    """Load the evaluator prompt from YAML file."""
+    with open(yaml_path, 'r', encoding='utf-8') as f:
+        prompts = yaml.safe_load(f)
+    return prompts.get('v1_evaluator_prompt', '')
+def build_evaluation_prompt(
+    rubrics: List[Dict[str, Any]],
+    paper_content: str,
+    review: str,
+    prompt_template: str
+) -> str:
+    """Build the evaluation prompt by replacing placeholders."""
+    rubrics_json = json.dumps(rubrics, indent=4, ensure_ascii=False)
+    prompt = prompt_template.replace('{rubrics_json}', rubrics_json)
+    prompt = prompt.replace('<<paper_content>>', paper_content)
+    prompt = prompt.replace('<<review>>', review)
+    return prompt
+def calculate_weighted_scores(
+    raw_scores: Dict[str, Dict[str, Any]],
+    rubrics: List[Dict[str, Any]]
+) -> Dict[str, float]:
+    """Calculate weighted scores for each rubric."""
+    rubric_weights = {r['title']: r['weight'] for r in rubrics}
+    weighted_scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        if rubric_title not in rubric_weights:
+            continue
+        rubric_score = rubric_data.get('score', 0)
+        if isinstance(rubric_score, str):
+            try:
+                rubric_score = int(rubric_score)
+            except ValueError:
+                rubric_score = 0
+        if rubric_score not in [0, 1]:
+            rubric_score = 1 if rubric_score > 0 else 0
+        weight = rubric_weights[rubric_title]
+        weighted_scores[rubric_title] = rubric_score * weight
+    return weighted_scores
+def calculate_scores(raw_scores: Dict[str, Dict[str, Any]]) -> Dict[str, float]:
+    """Calculate scores for each rubric."""
+    scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        scores[rubric_title] = rubric_data.get('score', 0)
+    return scores
+def evaluate_review_semantic(
+    entry: Dict[str, Any],
+    paper_content: str,
+    prompt_template: str,
+    llm_service: LLMService
+) -> Dict[str, Any]:
+    """Evaluate a single review using article-specific rubrics."""
+    entry_id = entry.get('id', 'unknown')
+    rubrics = entry.get('rubrics', [])
+    model_review = entry.get('model_review', '')
+    if not rubrics:
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': 'No valid rubrics found',
+            'raw_response': ''
+        }
+    # Build prompt
+    prompt = build_evaluation_prompt(rubrics, paper_content, model_review, prompt_template)
+    # Call LLM
+    try:
+        messages = [{"role": "user", "content": prompt}]
+        response = llm_service.generate(messages=messages)
+        # Parse response
+        raw_scores = parse_llm_response(response)
+        weighted_scores = calculate_scores(raw_scores)
+        total_score = sum(weighted_scores.values())
+        return {
+            'id': entry_id,
+            'raw_scores': raw_scores,
+            'weighted_scores': weighted_scores,
+            'total_score': total_score,
+            'raw_response': response
+        }
+    except Exception as e:
+        print(f"[ERROR] Error evaluating review {entry_id}: {e}")
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': str(e),
+            'raw_response': ''
+        }
+def calculate_per_rubric_statistics(
+    valid_results: List[Dict[str, Any]],
+    rubric_titles: List[str]
+) -> Dict[str, Dict[str, float]]:
+    """Calculate per-rubric statistics from evaluation results."""
+    rubric_scores = {title: [] for title in rubric_titles}
+    for result in valid_results:
+        weighted_scores = result.get('weighted_scores', {})
+        if not isinstance(weighted_scores, dict):
+            continue
+        for rubric_title in rubric_titles:
+            if rubric_title in weighted_scores:
+                score = weighted_scores[rubric_title]
+                if isinstance(score, str):
+                    try:
+                        score = float(score)
+                    except ValueError:
+                        continue
+                elif isinstance(score, (int, float)):
+                    score = float(score)
+                else:
+                    continue
+                rubric_scores[rubric_title].append(score)
+    per_rubric_stats = {}
+    for rubric_title in rubric_titles:
+        scores = rubric_scores[rubric_title]
+        if not scores:
+            continue
+        mean_score = sum(scores) / len(scores)
+        min_score = min(scores)
+        max_score = max(scores)
+        count = len(scores)
+        if rubric_title == "False or Contradictory Claims":
+            pass_count = sum(1 for s in scores if s >= 0)
+        else:
+            pass_count = sum(1 for s in scores if s >= 1)
+        pass_rate = pass_count / count if count > 0 else 0.0
+        per_rubric_stats[rubric_title] = {
+            'mean': mean_score,
+            'min': min_score,
+            'max': max_score,
+            'count': count,
+            'pass_rate': pass_rate
+        }
+    return per_rubric_stats
+# ============================================================================
+# Auto-Metric Evaluation Functions (from 3_rule_evaluate.py)
+# ============================================================================
+def extract_scores_from_review(review_text: str) -> Dict[str, Any]:
+    """Extract numeric scores and decision from a review markdown text."""
+    if not review_text:
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+    try:
+        parsed = parse_review_markdown(review_text)
+        decision = parsed.get('decision', '')
+        if decision:
+            decision_lower = decision.lower().strip()
+            if 'accept' in decision_lower:
+                decision = 'accept'
+            elif 'reject' in decision_lower:
+                decision = 'reject'
+            elif 'undecided' in decision_lower:
+                decision = 'undecided'
+            else:
+                decision = decision_lower
+        else:
+            decision = None
+        return {
+            'soundness': parsed.get('soundness'),
+            'presentation': parsed.get('presentation'),
+            'rating': parsed.get('rating'),
+            'confidence': parsed.get('confidence'),
+            'decision': decision
+        }
+    except Exception as e:
+        print(f"Warning: Failed to parse review text: {e}")
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+def calculate_mse(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Squared Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return (predicted - ground_truth) ** 2
+def calculate_mae(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Absolute Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return abs(predicted - ground_truth)
+def normalize_to_discrete_scale(score: Optional[float], scale_type: str) -> Optional[float]:
+    """
+    Normalize a float score to the nearest discrete value based on scale type.
+    Uses round-half-up tie-breaking (e.g., 3.5 rounds to 4, 1.5 rounds to 2).
+    Args:
+        score: The float score to normalize (can be None)
+        scale_type: Either '0-5' for 0-5 scale (discrete: 0,1,2,3,4,5)
+                    or '0-10' for 0-10 scale (discrete: 0,2,4,6,8,10)
+    Returns:
+        Normalized discrete score, or None if input is None
+    """
+    if score is None:
+        return None
+    try:
+        score = float(score)
+    except (ValueError, TypeError):
+        return None
+    if scale_type == '0-5':
+        # Discrete values: 0, 1, 2, 3, 4, 5
+        discrete_values = [0, 1, 2, 3, 4, 5]
+        # Clamp to valid range
+        score = max(0, min(5, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        # For ties, prefer the higher value
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    elif scale_type == '0-10':
+        # Discrete values: 0, 2, 4, 6, 8, 10
+        discrete_values = [0, 2, 4, 6, 8, 10]
+        # Clamp to valid range
+        score = max(0, min(10, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    else:
+        raise ValueError(f"Unknown scale_type: {scale_type}. Must be '0-5' or '0-10'")
+def normalize_scores_dict(scores: Dict[str, Optional[float]]) -> Dict[str, Optional[float]]:
+    """
+    Normalize all scores in a dictionary to their appropriate discrete scales.
+    Args:
+        scores: Dictionary with keys 'soundness', 'presentation', 'rating', 'confidence'
+    Returns:
+        Dictionary with normalized scores
+    """
+    normalized = {}
+    # soundness, presentation, confidence use 0-5 scale
+    for key in ['soundness', 'presentation', 'confidence']:
+        normalized[key] = normalize_to_discrete_scale(scores.get(key), '0-5')
+    # rating uses 0-10 scale
+    normalized['rating'] = normalize_to_discrete_scale(scores.get('rating'), '0-10')
+    return normalized
+def calculate_score_metrics(
+    model_scores: Dict[str, float],
+    ground_truth_scores: Dict[str, float],
+    normalize: bool = False
+) -> Dict[str, Any]:
+    """
+    Calculate MSE and MAE metrics for each scoring dimension.
+    Args:
+        model_scores: Dictionary with model scores
+        ground_truth_scores: Dictionary with ground truth scores
+        normalize: If True, normalize scores to discrete scales before computing metrics
+    Returns:
+        Dictionary with MSE, MAE metrics and optionally normalized scores
+    """
+    dimensions = ['soundness', 'presentation', 'rating', 'confidence']
+    # Normalize scores to discrete scales if requested
+    if normalize:
+        model_scores_normalized = normalize_scores_dict(model_scores)
+        gt_scores_normalized = normalize_scores_dict(ground_truth_scores)
+    else:
+        model_scores_normalized = model_scores
+        gt_scores_normalized = ground_truth_scores
+    mse_values = {}
+    mae_values = {}
+    valid_count = 0
+    for dim in dimensions:
+        # Use normalized scores for metric calculation
+        mse = calculate_mse(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mae = calculate_mae(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mse_values[f'{dim}_mse'] = mse
+        mae_values[f'{dim}_mae'] = mae
+        if mse is not None:
+            valid_count += 1
+    overall_error = sum([v for v in mse_values.values() if v is not None])
+    result = {
+        **mse_values,
+        **mae_values,
+        'overall_error': overall_error if valid_count > 0 else None,
+        'valid_dimensions': valid_count
+    }
+    # Include normalized scores in result for transparency (only if normalize=True)
+    if normalize:
+        result['model_scores_normalized'] = model_scores_normalized
+        result['gt_scores_normalized'] = gt_scores_normalized
+    return result
+def normalize_score_value(value):
+    """Normalize score value to float, handling string representations."""
+    if value is None:
+        return None
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        # Try to extract numeric value from string (e.g., "2.75" -> 2.75)
+        try:
+            import re
+            match = re.search(r'(\d+\.?\d*)', value)
+            if match:
+                return float(match.group(1))
+        except:
+            pass
+    return None
+def normalize_decision(decision):
+    """Normalize decision string to standard format."""
+    if decision is None:
+        return None
+    decision_lower = str(decision).lower().strip()
+    if 'accept' in decision_lower:
+        return 'accept'
+    elif 'reject' in decision_lower:
+        return 'reject'
+    elif 'undecided' in decision_lower:
+        return 'undecided'
+    else:
+        return decision_lower
+def extract_scores_from_dict(scores_dict: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Extract scores from a structured dictionary (scores or initial_scores format).
+    Args:
+        scores_dict: Dict containing scores (e.g., {'rating': 5.75, 'soundness': '2.75', ...})
+    Returns:
+        Dict with normalized scores: {'soundness', 'presentation', 'rating', 'confidence', 'decision'}
+    """
+    if not scores_dict:
+        return {
+            'soundness': None,
+            'presentation': None,
+            'rating': None,
+            'confidence': None,
+            'decision': None
+        }
+    return {
+        'soundness': normalize_score_value(scores_dict.get('soundness')),
+        'presentation': normalize_score_value(scores_dict.get('presentation')),
+        'rating': normalize_score_value(scores_dict.get('rating')),
+        'confidence': normalize_score_value(scores_dict.get('confidence')),
+        'decision': normalize_decision(scores_dict.get('decision'))
+    }
+def evaluate_review_auto_metric(entry: Dict[str, Any], use_initial_scores: bool = False, strict_mode: bool = False) -> Dict[str, Any]:
+    """
+    Evaluate a single entry by extracting scores and calculating metrics.
+    Args:
+        entry: Evaluation entry containing model_review, scores, initial_scores, etc.
+        use_initial_scores: If True, use initial_scores instead of refined scores (for refined format)
+    Returns:
+        Dict containing evaluation metrics
+    """
+    entry_id = entry.get('id', 'unknown')
+    model_review = entry.get('model_review', '')
+    format_type = entry.get('format', 'unknown')
+    # Extract scores based on format
+    model_scores = {}
+    model_decision = None
+    if format_type == 'refined' and not use_initial_scores:
+        # Use refined scores from structured data
+        scores_dict = entry.get('scores', {})
+        model_data = extract_scores_from_dict(scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'refined' and use_initial_scores:
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'original':
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+        # Fallback: If confidence is missing from structured data, try to extract from review text
+        # (meta_review may not have confidence field, but review text might)
+        if model_scores.get('confidence') is None and model_review:
+            try:
+                review_data = extract_scores_from_review(model_review)
+                if review_data.get('confidence') is not None:
+                    model_scores['confidence'] = review_data.get('confidence')
+            except Exception:
+                pass  # Keep confidence as None if extraction fails
+    else:
+        # Fallback: extract from markdown review text
+        model_data = extract_scores_from_review(model_review)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    # Get ground truth scores from golden_review ONLY
+    # Ground truth must ONLY come from golden_review, never from model output
+    # If extraction fails, leave fields as None (do not use model_review as fallback)
+    ground_truth_review = entry.get('golden_review', '')
+    ground_truth_scores = {}
+    gt_decision = None
+    if not ground_truth_review:
+        print(f"Warning: No golden_review found for entry {entry_id}. Ground truth scores will be empty.")
+    else:
+        try:
+            # Extract scores from golden_review markdown text
+            gt_data = extract_scores_from_review(ground_truth_review)
+            if not gt_data:
+                print(f"Warning: Failed to parse golden_review for entry {entry_id}. Ground truth scores will be empty.")
+            else:
+                ground_truth_scores = {
+                    'soundness': gt_data.get('soundness'),
+                    'presentation': gt_data.get('presentation'),
+                    'rating': gt_data.get('rating'),
+                    'confidence': gt_data.get('confidence')
+                }
+                gt_decision = normalize_decision(gt_data.get('decision'))
+                # Note: If any field is None, it stays None - we do NOT use model_review as fallback
+                # Using model output as ground truth would inflate evaluation scores
+        except Exception as e:
+            print(f"Warning: Failed to extract scores from golden_review for {entry_id}: {e}")
+            print(f"  Ground truth scores will be empty. Error: {str(e)}")
+    # Calculate MSE and MAE metrics (with optional normalization in strict mode)
+    score_metrics = calculate_score_metrics(model_scores, ground_truth_scores, normalize=strict_mode)
+    # Calculate decision accuracy
+    decision_match = False
+    decision_accuracy = None
+    if model_decision is not None and gt_decision is not None:
+        model_decision_normalized = normalize_decision(model_decision)
+        decision_match = (model_decision_normalized == gt_decision)
+        decision_accuracy = 1.0 if decision_match else 0.0
+    result = {
+        'id': entry_id,
+        'format': format_type,
+        'model_soundness': model_scores.get('soundness'),
+        'model_presentation': model_scores.get('presentation'),
+        'model_rating': model_scores.get('rating'),
+        'model_confidence': model_scores.get('confidence'),
+        'model_decision': model_decision,
+        'gt_soundness': ground_truth_scores.get('soundness'),
+        'gt_presentation': ground_truth_scores.get('presentation'),
+        'gt_rating': ground_truth_scores.get('rating'),
+        'gt_confidence': ground_truth_scores.get('confidence'),
+        'gt_decision': gt_decision,
+        'decision_match': decision_match,
+        'decision_accuracy': decision_accuracy,
+        **score_metrics
+    }
+    # Add prefix to indicate which scores were used
+    if format_type == 'refined':
+        if use_initial_scores:
+            result['score_type'] = 'initial'
+        else:
+            result['score_type'] = 'refined'
+    else:
+        result['score_type'] = 'auto'
+    return result
+def calculate_pairwise_accuracies(paper_scores: List[Dict[str, float]]) -> Dict[str, float]:
+    """Calculate pairwise accuracy for each metric by comparing rankings."""
+    if len(paper_scores) < 2:
+        return {}
+    total_valid_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    correct_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    for paper1, paper2 in combinations(paper_scores, 2):
+        # Check rating ranking
+        if (paper1.get('true_rating') is not None and paper2.get('true_rating') is not None and
+            paper1.get('pred_rating') is not None and paper2.get('pred_rating') is not None):
+            total_valid_pairs['rating'] += 1
+            true_order = paper1['true_rating'] > paper2['true_rating']
+            pred_order = paper1['pred_rating'] > paper2['pred_rating']
+            if true_order == pred_order:
+                correct_pairs['rating'] += 1
+        # Similar for other dimensions...
+        # (abbreviated for space, similar logic for soundness, presentation, confidence)
+        for metric in ['soundness', 'presentation', 'confidence']:
+            true_key = f'true_{metric}'
+            pred_key = f'pred_{metric}'
+            if (paper1.get(true_key) is not None and paper2.get(true_key) is not None and
+                paper1.get(pred_key) is not None and paper2.get(pred_key) is not None):
+                total_valid_pairs[metric] += 1
+                true_order = paper1[true_key] > paper2[true_key]
+                pred_order = paper1[pred_key] > paper2[pred_key]
+                if true_order == pred_order:
+                    correct_pairs[metric] += 1
+    pairwise_accuracies = {
+        metric: correct_pairs[metric] / total_valid_pairs[metric] if total_valid_pairs[metric] > 0 else 0.0
+        for metric in ['rating', 'soundness', 'presentation', 'confidence']
+    }
+    return pairwise_accuracies
+# ============================================================================
+# Data Loading Functions
+# ============================================================================
+def load_rubrics_json(rubrics_path: str) -> Dict[str, Dict[str, Any]]:
+    """Load rubrics JSON and create lookup by id."""
+    with open(rubrics_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, list):
+        return {item['id']: item for item in data}
+    elif isinstance(data, dict):
+        return data
+    else:
+        raise ValueError(f"Invalid rubrics JSON format: expected list or dict, got {type(data)}")
+def load_model_reviews_json(reviews_path: str, format_override: Optional[str] = None) -> Dict[str, Dict[str, Any]]:
+    """
+    Load model reviews JSON and extract reviews by id.
+    Supports two input formats:
+    1. Refined format: Contains 'scores' and 'initial_scores' fields (from refinement pipeline)
+    2. Original format: Contains 'model_prediction' with 'meta_review' and 'decision' (like ours.json)
+    Args:
+        reviews_path: Path to JSON file containing model reviews
+        format_override: Optional format override ('refined', 'original', or None for auto-detect)
+    Returns:
+        Dict mapping paper_id to dict containing:
+        - 'review': review text (markdown)
+        - 'scores': refined scores dict (if available)
+        - 'initial_scores': initial scores dict (if available)
+        - 'format': 'refined' or 'original'
+    """
+    with open(reviews_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, dict):
+        data = list(data.values())
+    reviews_dict = {}
+    for item in data:
+        item_id = None
+        review_text = ''
+        scores = None
+        initial_scores = None
+        format_type = None
+        # Use format override if provided, otherwise auto-detect
+        if format_override and format_override != 'auto':
+            # Force use specified format
+            if format_override == 'refined':
+                item_id = item.get('paper_id') or item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'refined'
+                review_text = item.get('review_markdown', '') or item.get('review', '')
+                scores = item.get('scores', {})
+                initial_scores = item.get('initial_scores', {})
+            elif format_override == 'original':
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                meta_review = model_prediction.get('meta_review', {})
+                review_text = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'decision': model_prediction.get('decision'),
+                }
+            else:
+                raise ValueError(f"Unknown format_override: {format_override}. Must be 'refined', 'original', or 'auto'")
+        else:
+            # Auto-detect format
+            if "paper_id" in item:
+                # Refined format (from refinement pipeline)
+                item_id = item.get('paper_id')
+                if not item_id:
+                    continue
+                # Check if this is refined format (has scores and initial_scores)
+                if 'scores' in item and 'initial_scores' in item:
+                    format_type = 'refined'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+                    scores = item.get('scores', {})
+                    initial_scores = item.get('initial_scores', {})
+                else:
+                    # Standard format with paper_id
+                    format_type = 'standard'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+            elif "model_prediction" in item:
+                # Original format (like ours.json) or agenticreview format
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                review_text = model_prediction.get('raw_text', '')
+                if review_text is None:
+                    continue
+                # Detect format: agenticreview has raw_text as string with "Overall rating: x"
+                # ai_researcher format has raw_text as dict or JSON string with structured fields
+                is_agenticreview = False
+                if isinstance(review_text, str):
+                    # Check if it's a JSON string (ai_researcher format)
+                    try:
+                        parsed_json = json.loads(review_text)
+                        if isinstance(parsed_json, dict) and any(key in parsed_json for key in ["Summary", "Strengths", "Overall", "Decision"]):
+                            # It's ai_researcher format
+                            review_text = parsed_json
+                            review_text, meta_review = convert_ai_researcher(review_text)
+                        else:
+                            # It's agenticreview format (plain text with "Overall rating: x")
+                            is_agenticreview = True
+                    except (json.JSONDecodeError, TypeError):
+                        # Not JSON, check if it contains "Overall rating:" pattern
+                        if re.search(r'Overall\s+rating\s*[:=]', review_text, re.IGNORECASE):
+                            is_agenticreview = True
+                        else:
+                            # Try to parse as ai_researcher anyway
+                            try:
+                                review_text = json.loads(review_text)
+                                review_text, meta_review = convert_ai_researcher(review_text)
+                            except:
+                                review_text = 'Empty Review'
+                                meta_review = {}
+                elif isinstance(review_text, dict):
+                    # It's ai_researcher format (dict)
+                    review_text, meta_review = convert_ai_researcher(review_text)
+                else:
+                    review_text = 'Empty Review'
+                    meta_review = {}
+                # Handle agenticreview format
+                if is_agenticreview:
+                    review_text, meta_review = convert_agenticreview(review_text)
+                # Extract initial scores
+                # Use meta_review as primary source (from convert_ai_researcher or convert_agenticreview)
+                # Fallback to model_prediction.get('decision') if not in meta_review
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'confidence': meta_review.get('confidence'),
+                    'decision': meta_review.get('decision') or model_prediction.get('decision'),
+                }
+            else:
+                # Legacy format (pred_fast_mode)
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'legacy'
+                review_dict = item.get('pred_fast_mode', {})
+                if isinstance(review_dict, dict):
+                    review_text = review_dict.get('raw_text', '')
+                else:
+                    review_text = str(review_dict)
+        # Extract review content from the review text field
+        try:
+            if review_text:
+                # extracted_review = ReviewProcessor.extract_review_content(review_text)
+                extracted_review = review_text
+            else:
+                extracted_review = ''
+            reviews_dict[item_id] = {
+                'review': extracted_review,
+                'scores': scores,
+                'initial_scores': initial_scores,
+                'format': format_type
+            }
+        except Exception as e:
+            print(f"[WARN] Failed to extract review for {item_id}: {e}")
+            continue
+    return reviews_dict
+def combine_rubrics_and_reviews(
+    rubrics_data: Dict[str, Dict[str, Any]],
+    reviews_dict: Dict[str, Dict[str, Any]]
+) -> List[Dict[str, Any]]:
+    """
+    Combine rubrics and reviews into evaluation entries.
+    Args:
+        rubrics_data: Dict mapping paper_id to rubric entry
+        reviews_dict: Dict mapping paper_id to dict containing 'review', 'scores', 'initial_scores', 'format'
+    Returns:
+        List of evaluation entries with model_review, scores, initial_scores, and format info
+    """
+    combined = []
+    missing_reviews = []
+    for paper_id, rubric_entry in rubrics_data.items():
+        review_data = reviews_dict.get(paper_id)
+        if not review_data or not review_data.get('review'):
+            missing_reviews.append(paper_id)
+            continue
+        entry = {
+            'id': paper_id,
+            'paper_context': rubric_entry.get('paper_context', ''),
+            'decision': rubric_entry.get('decision', ''),
+            'golden_review': rubric_entry.get('golden_review', ''),
+            'rubrics': rubric_entry.get('rubrics', []),
+            'model_review': review_data.get('review', ''),
+            'scores': review_data.get('scores'),  # Refined scores (if available)
+            'initial_scores': review_data.get('initial_scores'),  # Initial scores (if available)
+            'format': review_data.get('format', 'unknown')  # Format type
+        }
+        combined.append(entry)
+    if missing_reviews:
+        print(f"[WARN] {len(missing_reviews)} papers have no model review, skipping them")
+    return combined
+# ============================================================================
+# LLM Service Configuration
+# ============================================================================
+def load_llm_config(config_path: str) -> Dict[str, Any]:
+    """Load LLM configuration from YAML file."""
+    with open(config_path, 'r', encoding='utf-8') as f:
+        config = yaml.safe_load(f)
+    return config
+def create_llm_service_from_config(config: Dict[str, Any]) -> LLMService:
+    """Create LLM service from configuration."""
+    mode = config.get('mode', 'gpt').lower()
+    if mode == 'gpt':
+        gpt_config = config.get('gpt', {})
+        api_key = gpt_config.get('api_key') or os.getenv('OPENAI_API_KEY')
+        if not api_key:
+            raise ValueError("GPT mode requires api_key in configs.yaml or OPENAI_API_KEY environment variable")
+        service = GPTService(
+            api_key=api_key,
+            model_name=gpt_config.get('model_name', 'gpt-4o'),
+            base_url=gpt_config.get('base_url'),
+            timeout=gpt_config.get('timeout', 300)
+        )
+        return service
+    elif mode == 'vllm':
+        vllm_config = config.get('vllm', {})
+        service = VLLMService(
+            base_url=vllm_config.get('base_url', 'http://localhost:8000/v1'),
+            api_key=vllm_config.get('api_key', 'dummy-key'),
+            model_name=vllm_config.get('model_name'),
+            timeout=vllm_config.get('timeout', 300),
+            max_concurrent_requests=vllm_config.get('max_concurrent_requests', 64),
+            max_retries=vllm_config.get('max_retries', 3),
+            retry_delay=vllm_config.get('retry_delay', 1.0),
+            retry_backoff=vllm_config.get('retry_backoff', 2.0)
+        )
+        return service
+    else:
+        raise ValueError(f"Unknown mode: {mode}. Must be 'gpt' or 'vllm'")
+# ============================================================================
+# Main Evaluation Functions
+# ============================================================================
+def run_semantic_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    prompt_template: str,
+    llm_service: LLMService,
+    max_workers: int
+) -> tuple:
+    """Run semantic evaluation and return results and summary."""
+    print(f"\n{'='*80}")
+    print("RUNNING SEMANTIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} reviews using {max_workers} workers...")
+    results = []
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        future_to_entry = {
+            executor.submit(
+                evaluate_review_semantic,
+                entry,
+                entry['paper_context'],
+                prompt_template,
+                llm_service
+            ): entry
+            for entry in evaluation_data
+        }
+        for future in tqdm(as_completed(future_to_entry), total=len(evaluation_data), desc="Semantic evaluation"):
+            try:
+                result = future.result()
+                results.append(result)
+            except Exception as e:
+                entry = future_to_entry[future]
+                print(f"\n[ERROR] Failed to process entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'raw_scores': {},
+                    'weighted_scores': {},
+                    'total_score': 0.0,
+                    'error': str(e),
+                    'raw_response': ''
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r and r.get('weighted_scores')]
+    review_scores = [r.get('total_score', 0.0) for r in valid_results]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'failed_entries': len(results) - len(valid_results)
+    }
+    if review_scores:
+        summary['overall_score'] = {
+            'mean': sum(review_scores) / len(review_scores),
+            'min': min(review_scores),
+            'max': max(review_scores)
+        }
+    # Calculate per-rubric statistics (extract rubric titles from first entry)
+    if evaluation_data and evaluation_data[0].get('rubrics'):
+        rubric_titles = [r['title'] for r in evaluation_data[0]['rubrics']]
+        per_rubric_stats = calculate_per_rubric_statistics(valid_results, rubric_titles)
+        summary['per_rubric_statistics'] = per_rubric_stats
+    return results, summary
+def run_auto_metric_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    strict_mode: bool = False
+) -> tuple:
+    """
+    Run auto-metric evaluation and return results and summary.
+    For refined format (has scores and initial_scores), evaluates both:
+    - Refined scores evaluation
+    - Initial scores evaluation
+    For original format (only initial_scores), evaluates:
+    - Initial scores evaluation only
+    Returns:
+        Tuple of (results_list, summary_dict)
+        - results_list: List of evaluation results (may contain both refined and initial results for refined format)
+        - summary_dict: Summary statistics
+    """
+    print(f"\n{'='*80}")
+    print("RUNNING AUTO-METRIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} entries...")
+    # Detect format types
+    refined_format_count = sum(1 for e in evaluation_data if e.get('format') == 'refined')
+    original_format_count = sum(1 for e in evaluation_data if e.get('format') == 'original')
+    if refined_format_count > 0:
+        print(f"Detected {refined_format_count} entries in refined format (will evaluate both refined and initial scores)")
+    if original_format_count > 0:
+        print(f"Detected {original_format_count} entries in original format (will evaluate initial scores only)")
+    results = []
+    for entry in tqdm(evaluation_data, desc="Auto-metric evaluation"):
+        format_type = entry.get('format', 'unknown')
+        if format_type == 'refined':
+            # Evaluate both refined scores and initial scores
+            try:
+                entry_id = entry.get('id', 'unknown')
+                # Evaluate refined scores
+                refined_result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                refined_result['paper_id'] = entry_id  # Keep original paper_id
+                refined_result['id'] = f"{entry_id}_refined"
+                results.append(refined_result)
+                # Evaluate initial scores
+                initial_result = evaluate_review_auto_metric(entry, use_initial_scores=True, strict_mode=strict_mode)
+                initial_result['paper_id'] = entry_id  # Keep original paper_id
+                initial_result['id'] = f"{entry_id}_initial"
+                results.append(initial_result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+        else:
+            # Evaluate initial scores only (or extract from markdown)
+            try:
+                result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                results.append(result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r]
+    mse_results = [r for r in valid_results if r.get('overall_error') is not None]
+    # Separate refined and initial results for refined format
+    refined_results = [r for r in valid_results if r.get('score_type') == 'refined']
+    initial_results = [r for r in valid_results if r.get('score_type') == 'initial']
+    auto_results = [r for r in valid_results if r.get('score_type') == 'auto' or r.get('score_type') is None]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'mse_entries': len(mse_results),
+        'refined_results_count': len(refined_results),
+        'initial_results_count': len(initial_results),
+        'auto_results_count': len(auto_results)
+    }
+    # Calculate MSE/MAE statistics
+    # For refined format, only use refined results for overall statistics (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Refined format: use only refined results for overall statistics
+        stats_results = [r for r in refined_results if r.get('overall_error') is not None]
+    else:
+        # Original/other formats: use all results
+        stats_results = mse_results
+    if stats_results:
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        mse_stats = {}
+        mae_stats = {}
+        for dim in dimensions:
+            mse_list = [r.get(f'{dim}_mse') for r in stats_results if r.get(f'{dim}_mse') is not None]
+            mae_list = [r.get(f'{dim}_mae') for r in stats_results if r.get(f'{dim}_mae') is not None]
+            mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            if mse_clean:
+                mse_stats[dim] = {
+                    'mean': sum(mse_clean) / len(mse_clean),
+                    'count': len(mse_clean)
+                }
+            if mae_clean:
+                mae_stats[dim] = {
+                    'mean': sum(mae_clean) / len(mae_clean),
+                    'count': len(mae_clean)
+                }
+        overall_errors = [r.get('overall_error') for r in stats_results if r.get('overall_error') is not None]
+        overall_clean = [x for x in overall_errors if x is not None and not (isinstance(x, float) and math.isnan(x))]
+        if overall_clean:
+            summary['overall_error'] = {
+                'mean': sum(overall_clean) / len(overall_clean),
+                'count': len(overall_clean)
+            }
+        summary['mse_statistics'] = mse_stats
+        summary['mae_statistics'] = mae_stats
+        # Calculate separate statistics for refined and initial results
+        if refined_results:
+            refined_mse_results = [r for r in refined_results if r.get('overall_error') is not None]
+            if refined_mse_results:
+                refined_mse_stats = {}
+                refined_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in refined_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in refined_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        refined_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        refined_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['refined_mse_statistics'] = refined_mse_stats
+                summary['refined_mae_statistics'] = refined_mae_stats
+        if initial_results:
+            initial_mse_results = [r for r in initial_results if r.get('overall_error') is not None]
+            if initial_mse_results:
+                initial_mse_stats = {}
+                initial_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in initial_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in initial_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        initial_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        initial_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['initial_mse_statistics'] = initial_mse_stats
+                summary['initial_mae_statistics'] = initial_mae_stats
+    # Calculate Spearman correlations
+    def filter_valid_pairs(true_list, pred_list):
+        filtered_true = []
+        filtered_pred = []
+        for t, p in zip(true_list, pred_list):
+            if (t is not None and p is not None and
+                not (isinstance(t, float) and math.isnan(t)) and
+                not (isinstance(p, float) and math.isnan(p))):
+                filtered_true.append(t)
+                filtered_pred.append(p)
+        return filtered_true, filtered_pred
+    # Calculate Spearman correlations
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined spearman correlations
+        refined_spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in refined_results]
+            pred_values = [r.get(f'model_{dim}') for r in refined_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        refined_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Calculate initial spearman correlations
+        initial_spearman_stats = {}
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in initial_results]
+            pred_values = [r.get(f'model_{dim}') for r in initial_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        initial_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Use refined for overall statistics (avoid double counting)
+        summary['spearman_correlations'] = refined_spearman_stats
+        summary['refined_spearman_correlations'] = refined_spearman_stats
+        summary['initial_spearman_correlations'] = initial_spearman_stats
+    else:
+        # Original/other formats: use all results
+        correlation_results = valid_results
+        spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in correlation_results]
+            pred_values = [r.get(f'model_{dim}') for r in correlation_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        summary['spearman_correlations'] = spearman_stats
+    # Calculate Decision metrics
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined decision metrics
+        refined_decision_results = [r for r in refined_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if refined_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in refined_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['refined_decision_metrics'] = refined_decision_metrics
+                summary['decision_metrics'] = refined_decision_metrics  # Use refined for overall
+        # Calculate initial decision metrics
+        initial_decision_results = [r for r in initial_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if initial_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in initial_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['initial_decision_metrics'] = initial_decision_metrics
+    else:
+        # Original/other formats: use all results
+        decision_results = [r for r in valid_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+    # Calculate Pairwise comparison
+    # For refined format, only use refined results (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        pairwise_results = refined_results
+    else:
+        pairwise_results = valid_results
+    paper_scores = []
+    for r in pairwise_results:
+        if (r.get('gt_rating') is not None and r.get('model_rating') is not None) or \
+           (r.get('gt_soundness') is not None and r.get('model_soundness') is not None):
+            paper_scores.append({
+                'true_rating': r.get('gt_rating'),
+                'pred_rating': r.get('model_rating'),
+                'true_soundness': r.get('gt_soundness'),
+                'pred_soundness': r.get('model_soundness'),
+                'true_presentation': r.get('gt_presentation'),
+                'pred_presentation': r.get('model_presentation'),
+                'true_confidence': r.get('gt_confidence'),
+                'pred_confidence': r.get('model_confidence')
+            })
+    if len(paper_scores) >= 2:
+        pairwise_accuracies = calculate_pairwise_accuracies(paper_scores)
+        summary['pairwise_accuracies'] = pairwise_accuracies
+    return results, summary
+# ============================================================================
+# Main Function
+# ============================================================================
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Unified evaluation script for semantic and auto-metric evaluation")
+    # Input paths
+    parser.add_argument("--rubrics_path", type=str, required=True,
+                       help="Path to eval_rubrics.json file (from 1_generate_review_based_rubrics.py)")
+    parser.add_argument("--reviews_path", type=str, required=True,
+                       help="Path to JSON file with model reviews (contains pred_fast_mode)")
+    # Evaluation mode
+    parser.add_argument("--mode", type=str, choices=["semantic", "auto_metric", "both"], default="both",
+                       help="Evaluation mode: semantic (LLM-based), auto_metric (rule-based), or both")
+    # Output paths
+    parser.add_argument("--semantic_output", type=str, default=None,
+                       help="Path to output JSON file for semantic evaluation results (required if mode is semantic or both)")
+    parser.add_argument("--auto_metric_output", type=str, default=None,
+                       help="Path to output JSON file for auto-metric evaluation results (required if mode is auto_metric or both)")
+    # Semantic evaluation settings
+    parser.add_argument("--yaml_path", type=str, default=None,
+                       help="Path to prompts.yaml file (required for semantic evaluation)")
+    parser.add_argument("--config_path", type=str, default=None,
+                       help="Path to configs.yaml file (required for semantic evaluation)")
+    # Multi-threading
+    parser.add_argument("--max_workers", type=int, default=None,
+                       help="Maximum number of worker threads for semantic evaluation (default: 5)")
+    # Strict mode (normalize scores to discrete scales)
+    parser.add_argument("--strict_mode", action="store_true", default=False,
+                       help="Enable strict mode: normalize scores to discrete scales before computing metrics (default: False)")
+    # Input format override
+    parser.add_argument("--input_format", type=str, choices=['auto', 'refined', 'original'], default='auto',
+                       help="Manually specify input JSON format: 'refined' (has scores and initial_scores), 'original' (has model_prediction), or 'auto' for auto-detection (default: 'auto')")
+    return parser.parse_args()
+def main():
+    """Main execution function."""
+    args = parse_args()
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    # Resolve paths
+    rubrics_path = args.rubrics_path
+    if not os.path.isabs(rubrics_path):
+        rubrics_path = os.path.join(script_dir, rubrics_path)
+    reviews_path = args.reviews_path
+    if not os.path.isabs(reviews_path):
+        reviews_path = os.path.join(script_dir, reviews_path)
+    max_workers = args.max_workers or int(os.getenv("MAX_WORKERS", "5"))
+    # Validate mode and output paths
+    if args.mode in ["semantic", "both"]:
+        if not args.semantic_output:
+            raise ValueError("--semantic_output is required when mode is 'semantic' or 'both'")
+        if not args.yaml_path:
+            raise ValueError("--yaml_path is required for semantic evaluation")
+        if not args.config_path:
+            raise ValueError("--config_path is required for semantic evaluation")
+    if args.mode in ["auto_metric", "both"]:
+        if not args.auto_metric_output:
+            raise ValueError("--auto_metric_output is required when mode is 'auto_metric' or 'both'")
+    # Check if files exist
+    if not os.path.exists(rubrics_path):
+        raise FileNotFoundError(f"Rubrics file not found: {rubrics_path}")
+    if not os.path.exists(reviews_path):
+        raise FileNotFoundError(f"Reviews file not found: {reviews_path}")
+    # Load data
+    print(f"Loading rubrics from {rubrics_path}...")
+    rubrics_data = load_rubrics_json(rubrics_path)
+    print(f"Loaded {len(rubrics_data)} rubrics entries")
+    print(f"Loading model reviews from {reviews_path}...")
+    if args.input_format != 'auto':
+        print(f"Using manually specified format: {args.input_format}")
+    else:
+        print("Auto-detecting input format...")
+    reviews_dict = load_model_reviews_json(reviews_path, format_override=args.input_format if args.input_format != 'auto' else None)
+    print(f"Loaded {len(reviews_dict)} model reviews")
+    # Combine rubrics and reviews
+    print("Combining rubrics and reviews...")
+    evaluation_data = combine_rubrics_and_reviews(rubrics_data, reviews_dict)
+    print(f"Prepared {len(evaluation_data)} entries for evaluation")
+    # Run evaluations based on mode
+    if args.mode in ["semantic", "both"]:
+        # Resolve semantic evaluation paths
+        yaml_path = args.yaml_path
+        if not os.path.isabs(yaml_path):
+            yaml_path = os.path.join(script_dir, yaml_path)
+        config_path = args.config_path
+        if not os.path.isabs(config_path):
+            config_path = os.path.join(script_dir, config_path)
+        if not os.path.exists(yaml_path):
+            raise FileNotFoundError(f"YAML file not found: {yaml_path}")
+        if not os.path.exists(config_path):
+            raise FileNotFoundError(f"Config file not found: {config_path}")
+        # Load prompt template
+        print(f"Loading prompt template from {yaml_path}...")
+        prompt_template = load_prompt_template(yaml_path)
+        if not prompt_template:
+            raise ValueError("Could not find 'v1_evaluator_prompt' in YAML file")
+        # Initialize LLM service
+        print(f"Loading LLM configuration from {config_path}...")
+        llm_config = load_llm_config(config_path)
+        llm_service = create_llm_service_from_config(llm_config)
+        mode = llm_config.get('mode', 'gpt')
+        print(f"LLM service initialized (mode: {mode})")
+        if hasattr(llm_service, 'model_name'):
+            print(f"Using model: {llm_service.model_name}")
+        # Run semantic evaluation
+        semantic_results, semantic_summary = run_semantic_evaluation(
+            evaluation_data, prompt_template, llm_service, max_workers
+        )
+        # Save semantic results
+        semantic_output = args.semantic_output
+        if not os.path.isabs(semantic_output):
+            semantic_output = os.path.join(script_dir, semantic_output)
+        output_dir = os.path.dirname(semantic_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(semantic_output, 'w', encoding='utf-8') as f:
+            json.dump(semantic_results, f, ensure_ascii=False, indent=2)
+        print(f"\nSemantic evaluation results saved to {semantic_output}")
+        # Save semantic summary
+        semantic_summary_path = semantic_output.replace('.json', '_summary.json')
+        with open(semantic_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(semantic_summary, f, ensure_ascii=False, indent=2)
+        print(f"Semantic evaluation summary saved to {semantic_summary_path}")
+        # Print semantic summary
+        print("\n" + "="*80)
+        print("SEMANTIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {semantic_summary['total_entries']}")
+        print(f"Valid entries: {semantic_summary['valid_entries']}")
+        print(f"Failed entries: {semantic_summary['failed_entries']}")
+        if 'overall_score' in semantic_summary:
+            score = semantic_summary['overall_score']
+            print(f"\nOverall Score:")
+            print(f"  Mean: {score['mean']:.2f}")
+            print(f"  Min: {score['min']:.2f}")
+            print(f"  Max: {score['max']:.2f}")
+    if args.mode in ["auto_metric", "both"]:
+        # Run auto-metric evaluation
+        auto_metric_results, auto_metric_summary = run_auto_metric_evaluation(
+            evaluation_data,
+            strict_mode=args.strict_mode
+        )
+        # Save auto-metric results
+        auto_metric_output = args.auto_metric_output
+        if not os.path.isabs(auto_metric_output):
+            auto_metric_output = os.path.join(script_dir, auto_metric_output)
+        output_dir = os.path.dirname(auto_metric_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(auto_metric_output, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_results, f, ensure_ascii=False, indent=2)
+        print(f"\nAuto-metric evaluation results saved to {auto_metric_output}")
+        # Save auto-metric summary
+        auto_metric_summary_path = auto_metric_output.replace('.json', '_summary.json')
+        with open(auto_metric_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_summary, f, ensure_ascii=False, indent=2)
+        print(f"Auto-metric evaluation summary saved to {auto_metric_summary_path}")
+        # Print auto-metric summary
+        print("\n" + "="*80)
+        print("AUTO-METRIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {auto_metric_summary['total_entries']}")
+        print(f"Valid entries: {auto_metric_summary['valid_entries']}")
+        print(f"MSE entries: {auto_metric_summary['mse_entries']}")
+        if 'mse_statistics' in auto_metric_summary:
+            print("\nMSE Statistics:")
+            for dim, stats in auto_metric_summary['mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'mae_statistics' in auto_metric_summary:
+            print("\nMAE Statistics:")
+            for dim, stats in auto_metric_summary['mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        # Print refined and initial statistics if available
+        if 'refined_mse_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'refined_mae_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mse_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mae_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'spearman_correlations' in auto_metric_summary:
+            print("\nSpearman Correlations:")
+            for dim, stats in auto_metric_summary['spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        # Print refined and initial spearman correlations if available
+        if 'refined_spearman_correlations' in auto_metric_summary:
+            print("\nRefined Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['refined_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'initial_spearman_correlations' in auto_metric_summary:
+            print("\nInitial Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['initial_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'decision_metrics' in auto_metric_summary:
+            dm = auto_metric_summary['decision_metrics']
+            print(f"\nDecision Metrics:")
+            print(f"  Accuracy: {dm['accuracy']:.4f} (n={dm['count']})")
+            if 'f1_macro' in dm:
+                print(f"  F1 (macro): {dm['f1_macro']:.4f}")
+        # Print refined and initial decision metrics if available
+        if 'refined_decision_metrics' in auto_metric_summary:
+            print("\nRefined Scores - Decision Metrics:")
+            rdm = auto_metric_summary['refined_decision_metrics']
+            print(f"  Accuracy: {rdm['accuracy']:.4f} (n={rdm['count']})")
+            if 'f1_macro' in rdm:
+                print(f"  F1 (macro): {rdm['f1_macro']:.4f}")
+        if 'initial_decision_metrics' in auto_metric_summary:
+            print("\nInitial Scores - Decision Metrics:")
+            idm = auto_metric_summary['initial_decision_metrics']
+            print(f"  Accuracy: {idm['accuracy']:.4f} (n={idm['count']})")
+            if 'f1_macro' in idm:
+                print(f"  F1 (macro): {idm['f1_macro']:.4f}")
+    print("\n" + "="*80)
+    print("EVALUATION COMPLETE")
+    print("="*80)
+if __name__ == "__main__":
+    main()

src/evaluator/2_evaluate_aiscientist.py ADDED Viewed

	@@ -0,0 +1,1866 @@

+"""
+Unified evaluation script for semantic (LLM-based) and auto_metric (rule-based) evaluation.
+This script:
+1. Reads eval_rubrics.json (from 1_generate_review_based_rubrics.py) containing rubrics for each paper
+2. Reads input JSON file containing model reviews (supports multiple formats)
+3. Supports three evaluation modes:
+   - semantic: LLM-based rubrics evaluation (from 2_evaluate_direct.py)
+   - auto_metric: Rule-based metrics evaluation (from 3_rule_evaluate.py)
+   - both: Run both evaluations separately
+4. Supports strict mode: normalize scores to discrete scales before computing metrics (--strict_mode)
+5. Outputs separate JSON files for results and summaries
+Usage:
+    # Semantic evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode semantic \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --max_workers 5
+    # Auto-metric evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json
+    # Auto-metric evaluation with strict mode (normalize scores to discrete scales)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --strict_mode
+    # Auto-metric evaluation with manually specified input format (refined)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format refined
+    # Auto-metric evaluation with manually specified input format (original)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path ours.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format original
+    # Both evaluations
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode both \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --auto_metric_output auto_metric_results.json \
+        --max_workers 32
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import argparse
+import yaml
+import math
+import re
+from typing import Dict, List, Any, Optional
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+from itertools import combinations
+from scipy.stats import spearmanr
+from sklearn.metrics import precision_recall_fscore_support
+# Add parent directory to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Import parse_llm_response from local llm_service module
+import llm_service as local_llm_service
+parse_llm_response = local_llm_service.parse_llm_response
+# Import from shared/utils for gpt/vllm support
+project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+if project_root not in sys.path:
+    sys.path.insert(0, project_root)
+from shared.utils.llm_service import LLMService
+from shared.utils.vllm_service import VLLMService
+from shared.utils.gpt_service import GPTService
+sys.path.insert(0, os.path.join(project_root, 'shared', 'utils'))
+from json_parser import parse_review_markdown
+def convert_ai_researcher(review: dict) -> str:
+    """
+    Convert the review text from ai-researcher format to unified review system format.
+    """
+    summary = review["Summary"]
+    strengths = "\n".join(f"- {s}" for s in review["Strengths"])
+    weaknesses = "\n".join(f"- {w}" for w in review["Weaknesses"])
+    # scores
+    originality = review["Originality"]
+    quality = review["Quality"]
+    clarity = review["Clarity"]
+    significance = review["Significance"]
+    questions = "\n".join(f"- {q}" for q in review["Questions"])
+    limitations = "\n".join(f"- {l}" for l in review["Limitations"])
+    ethical_concerns = review["Ethical Concerns"]
+    # scores again
+    soundness = review["Soundness"]
+    presentation = review["Presentation"]
+    contribution = review["Contribution"]
+    overall = review["Overall"]
+    confidence = review["Confidence"]
+    # final decision
+    decision = review["Decision"]
+    meta_review = {
+        "rating": overall,
+        "soundness": soundness,
+        "presentation": presentation,
+        "contribution": contribution,
+        "confidence": confidence,
+        "decision": decision.lower().strip(),
+    }
+    return f"Summary: {summary}\nStrengths: {strengths}\nWeaknesses: {weaknesses}\nOriginality: {originality}\nQuality: {quality}\nClarity: {clarity}\nSignificance: {significance}\nQuestions: {questions}\nLimitations: {limitations}\nEthical Concerns: {ethical_concerns}\nSoundness: {soundness}\nPresentation: {presentation}\nContribution: {contribution}\nOverall: {overall}\nConfidence: {confidence}\nDecision: {decision}", meta_review
+def convert_agenticreview(review_text: str) -> tuple:
+    """
+    Convert the review text from agenticreview format to unified review system format.
+    The agenticreview format has text like:
+    "Overall rating: 5\n\nSignificance and novelty: ..."
+    Args:
+        review_text: Raw review text string
+    Returns:
+        Tuple of (formatted_review_text, meta_review_dict)
+    """
+    # Extract rating from "Overall rating: x" format
+    rating = None
+    rating_match = re.search(r'Overall\s+rating\s*[:=]\s*(\d+\.?\d*)', review_text, re.IGNORECASE)
+    if rating_match:
+        try:
+            rating = float(rating_match.group(1))
+        except (ValueError, IndexError):
+            pass
+    # If not found, try alternative patterns
+    if rating is None:
+        rating_match = re.search(r'(?:rating|score)\s*[:=]\s*(\d+\.?\d*)', review_text, re.IGNORECASE)
+        if rating_match:
+            try:
+                rating = float(rating_match.group(1))
+            except (ValueError, IndexError):
+                pass
+    # Try to extract from parse_review_markdown as fallback
+    if rating is None:
+        try:
+            parsed = parse_review_markdown(review_text)
+            rating = parsed.get('rating')
+        except Exception:
+            pass
+    # Create meta_review dict - agenticreview only has rating, no other scores
+    meta_review = {
+        "rating": rating,
+        "soundness": None,
+        "presentation": None,
+        "contribution": None,
+        "confidence": None,
+        "decision": None,
+    }
+    # Return the review text as-is (it's already in a readable format)
+    return review_text, meta_review
+class ReviewProcessor:
+    """Handles the extraction and processing of reviews from different sources."""
+    @staticmethod
+    def extract_review_content(pred_context):
+        """
+        Extract the review content from the prediction context.
+        Args:
+            pred_context: Raw prediction data that contains the review
+        Returns:
+            str: Extracted review content
+        """
+        try:
+            # First attempt to extract from boxed format
+            return pred_context.split(r'\boxed_review{')[-1].split('\n}')[0]
+        except Exception:
+            # Alternative extraction if the first method fails
+            if isinstance(pred_context, dict) and 'output' in pred_context:
+                return pred_context['output'].split(r'\boxed_review{')[-1].split('\n}')[0]
+            else:
+                # Return as is if extraction fails
+                return pred_context
+# ============================================================================
+# Semantic Evaluation Functions (from 2_evaluate_direct.py)
+# ============================================================================
+def load_prompt_template(yaml_path: str) -> str:
+    """Load the evaluator prompt from YAML file."""
+    with open(yaml_path, 'r', encoding='utf-8') as f:
+        prompts = yaml.safe_load(f)
+    return prompts.get('v1_evaluator_prompt', '')
+def build_evaluation_prompt(
+    rubrics: List[Dict[str, Any]],
+    paper_content: str,
+    review: str,
+    prompt_template: str
+) -> str:
+    """Build the evaluation prompt by replacing placeholders."""
+    rubrics_json = json.dumps(rubrics, indent=4, ensure_ascii=False)
+    prompt = prompt_template.replace('{rubrics_json}', rubrics_json)
+    prompt = prompt.replace('<<paper_content>>', paper_content)
+    prompt = prompt.replace('<<review>>', review)
+    return prompt
+def calculate_weighted_scores(
+    raw_scores: Dict[str, Dict[str, Any]],
+    rubrics: List[Dict[str, Any]]
+) -> Dict[str, float]:
+    """Calculate weighted scores for each rubric."""
+    rubric_weights = {r['title']: r['weight'] for r in rubrics}
+    weighted_scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        if rubric_title not in rubric_weights:
+            continue
+        rubric_score = rubric_data.get('score', 0)
+        if isinstance(rubric_score, str):
+            try:
+                rubric_score = int(rubric_score)
+            except ValueError:
+                rubric_score = 0
+        if rubric_score not in [0, 1]:
+            rubric_score = 1 if rubric_score > 0 else 0
+        weight = rubric_weights[rubric_title]
+        weighted_scores[rubric_title] = rubric_score * weight
+    return weighted_scores
+def calculate_scores(raw_scores: Dict[str, Dict[str, Any]]) -> Dict[str, float]:
+    """Calculate scores for each rubric."""
+    scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        scores[rubric_title] = rubric_data.get('score', 0)
+    return scores
+def evaluate_review_semantic(
+    entry: Dict[str, Any],
+    paper_content: str,
+    prompt_template: str,
+    llm_service: LLMService
+) -> Dict[str, Any]:
+    """Evaluate a single review using article-specific rubrics."""
+    entry_id = entry.get('id', 'unknown')
+    rubrics = entry.get('rubrics', [])
+    model_review = entry.get('model_review', '')
+    if not rubrics:
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': 'No valid rubrics found',
+            'raw_response': ''
+        }
+    # Build prompt
+    prompt = build_evaluation_prompt(rubrics, paper_content, model_review, prompt_template)
+    # Call LLM
+    try:
+        messages = [{"role": "user", "content": prompt}]
+        response = llm_service.generate(messages=messages)
+        # Parse response
+        raw_scores = parse_llm_response(response)
+        weighted_scores = calculate_scores(raw_scores)
+        total_score = sum(weighted_scores.values())
+        return {
+            'id': entry_id,
+            'raw_scores': raw_scores,
+            'weighted_scores': weighted_scores,
+            'total_score': total_score,
+            'raw_response': response
+        }
+    except Exception as e:
+        print(f"[ERROR] Error evaluating review {entry_id}: {e}")
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': str(e),
+            'raw_response': ''
+        }
+def calculate_per_rubric_statistics(
+    valid_results: List[Dict[str, Any]],
+    rubric_titles: List[str]
+) -> Dict[str, Dict[str, float]]:
+    """Calculate per-rubric statistics from evaluation results."""
+    rubric_scores = {title: [] for title in rubric_titles}
+    for result in valid_results:
+        weighted_scores = result.get('weighted_scores', {})
+        if not isinstance(weighted_scores, dict):
+            continue
+        for rubric_title in rubric_titles:
+            if rubric_title in weighted_scores:
+                score = weighted_scores[rubric_title]
+                if isinstance(score, str):
+                    try:
+                        score = float(score)
+                    except ValueError:
+                        continue
+                elif isinstance(score, (int, float)):
+                    score = float(score)
+                else:
+                    continue
+                rubric_scores[rubric_title].append(score)
+    per_rubric_stats = {}
+    for rubric_title in rubric_titles:
+        scores = rubric_scores[rubric_title]
+        if not scores:
+            continue
+        mean_score = sum(scores) / len(scores)
+        min_score = min(scores)
+        max_score = max(scores)
+        count = len(scores)
+        if rubric_title == "False or Contradictory Claims":
+            pass_count = sum(1 for s in scores if s >= 0)
+        else:
+            pass_count = sum(1 for s in scores if s >= 1)
+        pass_rate = pass_count / count if count > 0 else 0.0
+        per_rubric_stats[rubric_title] = {
+            'mean': mean_score,
+            'min': min_score,
+            'max': max_score,
+            'count': count,
+            'pass_rate': pass_rate
+        }
+    return per_rubric_stats
+# ============================================================================
+# Auto-Metric Evaluation Functions (from 3_rule_evaluate.py)
+# ============================================================================
+def extract_scores_from_review(review_text: str) -> Dict[str, Any]:
+    """Extract numeric scores and decision from a review markdown text."""
+    if not review_text:
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+    try:
+        parsed = parse_review_markdown(review_text)
+        decision = parsed.get('decision', '')
+        if decision:
+            decision_lower = decision.lower().strip()
+            if 'accept' in decision_lower:
+                decision = 'accept'
+            elif 'reject' in decision_lower:
+                decision = 'reject'
+            elif 'undecided' in decision_lower:
+                decision = 'undecided'
+            else:
+                decision = decision_lower
+        else:
+            decision = None
+        return {
+            'soundness': parsed.get('soundness'),
+            'presentation': parsed.get('presentation'),
+            'rating': parsed.get('rating'),
+            'confidence': parsed.get('confidence'),
+            'decision': decision
+        }
+    except Exception as e:
+        print(f"Warning: Failed to parse review text: {e}")
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+def calculate_mse(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Squared Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return (predicted - ground_truth) ** 2
+def calculate_mae(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Absolute Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return abs(predicted - ground_truth)
+def normalize_to_discrete_scale(score: Optional[float], scale_type: str) -> Optional[float]:
+    """
+    Normalize a float score to the nearest discrete value based on scale type.
+    Uses round-half-up tie-breaking (e.g., 3.5 rounds to 4, 1.5 rounds to 2).
+    Args:
+        score: The float score to normalize (can be None)
+        scale_type: Either '0-5' for 0-5 scale (discrete: 0,1,2,3,4,5)
+                    or '0-10' for 0-10 scale (discrete: 0,2,4,6,8,10)
+    Returns:
+        Normalized discrete score, or None if input is None
+    """
+    if score is None:
+        return None
+    try:
+        score = float(score)
+    except (ValueError, TypeError):
+        return None
+    if scale_type == '0-5':
+        # Discrete values: 0, 1, 2, 3, 4, 5
+        discrete_values = [0, 1, 2, 3, 4, 5]
+        # Clamp to valid range
+        score = max(0, min(5, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        # For ties, prefer the higher value
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    elif scale_type == '0-10':
+        # Discrete values: 0, 2, 4, 6, 8, 10
+        discrete_values = [0, 2, 4, 6, 8, 10]
+        # Clamp to valid range
+        score = max(0, min(10, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    else:
+        raise ValueError(f"Unknown scale_type: {scale_type}. Must be '0-5' or '0-10'")
+def normalize_scores_dict(scores: Dict[str, Optional[float]]) -> Dict[str, Optional[float]]:
+    """
+    Normalize all scores in a dictionary to their appropriate discrete scales.
+    Args:
+        scores: Dictionary with keys 'soundness', 'presentation', 'rating', 'confidence'
+    Returns:
+        Dictionary with normalized scores
+    """
+    normalized = {}
+    # soundness, presentation, confidence use 0-5 scale
+    for key in ['soundness', 'presentation', 'confidence']:
+        normalized[key] = normalize_to_discrete_scale(scores.get(key), '0-5')
+    # rating uses 0-10 scale
+    normalized['rating'] = normalize_to_discrete_scale(scores.get('rating'), '0-10')
+    return normalized
+def calculate_score_metrics(
+    model_scores: Dict[str, float],
+    ground_truth_scores: Dict[str, float],
+    normalize: bool = False
+) -> Dict[str, Any]:
+    """
+    Calculate MSE and MAE metrics for each scoring dimension.
+    Args:
+        model_scores: Dictionary with model scores
+        ground_truth_scores: Dictionary with ground truth scores
+        normalize: If True, normalize scores to discrete scales before computing metrics
+    Returns:
+        Dictionary with MSE, MAE metrics and optionally normalized scores
+    """
+    dimensions = ['soundness', 'presentation', 'rating', 'confidence']
+    # Normalize scores to discrete scales if requested
+    if normalize:
+        model_scores_normalized = normalize_scores_dict(model_scores)
+        gt_scores_normalized = normalize_scores_dict(ground_truth_scores)
+    else:
+        model_scores_normalized = model_scores
+        gt_scores_normalized = ground_truth_scores
+    mse_values = {}
+    mae_values = {}
+    valid_count = 0
+    for dim in dimensions:
+        # Use normalized scores for metric calculation
+        mse = calculate_mse(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mae = calculate_mae(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mse_values[f'{dim}_mse'] = mse
+        mae_values[f'{dim}_mae'] = mae
+        if mse is not None:
+            valid_count += 1
+    overall_error = sum([v for v in mse_values.values() if v is not None])
+    result = {
+        **mse_values,
+        **mae_values,
+        'overall_error': overall_error if valid_count > 0 else None,
+        'valid_dimensions': valid_count
+    }
+    # Include normalized scores in result for transparency (only if normalize=True)
+    if normalize:
+        result['model_scores_normalized'] = model_scores_normalized
+        result['gt_scores_normalized'] = gt_scores_normalized
+    return result
+def normalize_score_value(value):
+    """Normalize score value to float, handling string representations."""
+    if value is None:
+        return None
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        # Try to extract numeric value from string (e.g., "2.75" -> 2.75)
+        try:
+            import re
+            match = re.search(r'(\d+\.?\d*)', value)
+            if match:
+                return float(match.group(1))
+        except:
+            pass
+    return None
+def normalize_decision(decision):
+    """Normalize decision string to standard format."""
+    if decision is None:
+        return None
+    decision_lower = str(decision).lower().strip()
+    if 'accept' in decision_lower:
+        return 'accept'
+    elif 'reject' in decision_lower:
+        return 'reject'
+    elif 'undecided' in decision_lower:
+        return 'undecided'
+    else:
+        return decision_lower
+def extract_scores_from_dict(scores_dict: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Extract scores from a structured dictionary (scores or initial_scores format).
+    Args:
+        scores_dict: Dict containing scores (e.g., {'rating': 5.75, 'soundness': '2.75', ...})
+    Returns:
+        Dict with normalized scores: {'soundness', 'presentation', 'rating', 'confidence', 'decision'}
+    """
+    if not scores_dict:
+        return {
+            'soundness': None,
+            'presentation': None,
+            'rating': None,
+            'confidence': None,
+            'decision': None
+        }
+    return {
+        'soundness': normalize_score_value(scores_dict.get('soundness')),
+        'presentation': normalize_score_value(scores_dict.get('presentation')),
+        'rating': normalize_score_value(scores_dict.get('rating')),
+        'confidence': normalize_score_value(scores_dict.get('confidence')),
+        'decision': normalize_decision(scores_dict.get('decision'))
+    }
+def evaluate_review_auto_metric(entry: Dict[str, Any], use_initial_scores: bool = False, strict_mode: bool = False) -> Dict[str, Any]:
+    """
+    Evaluate a single entry by extracting scores and calculating metrics.
+    Args:
+        entry: Evaluation entry containing model_review, scores, initial_scores, etc.
+        use_initial_scores: If True, use initial_scores instead of refined scores (for refined format)
+    Returns:
+        Dict containing evaluation metrics
+    """
+    entry_id = entry.get('id', 'unknown')
+    model_review = entry.get('model_review', '')
+    format_type = entry.get('format', 'unknown')
+    # Extract scores based on format
+    model_scores = {}
+    model_decision = None
+    if format_type == 'refined' and not use_initial_scores:
+        # Use refined scores from structured data
+        scores_dict = entry.get('scores', {})
+        model_data = extract_scores_from_dict(scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'refined' and use_initial_scores:
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'original':
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+        # Fallback: If confidence is missing from structured data, try to extract from review text
+        # (meta_review may not have confidence field, but review text might)
+        if model_scores.get('confidence') is None and model_review:
+            try:
+                review_data = extract_scores_from_review(model_review)
+                if review_data.get('confidence') is not None:
+                    model_scores['confidence'] = review_data.get('confidence')
+            except Exception:
+                pass  # Keep confidence as None if extraction fails
+    else:
+        # Fallback: extract from markdown review text
+        model_data = extract_scores_from_review(model_review)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    # Get ground truth scores from golden_review ONLY
+    # Ground truth must ONLY come from golden_review, never from model output
+    # If extraction fails, leave fields as None (do not use model_review as fallback)
+    ground_truth_review = entry.get('golden_review', '')
+    ground_truth_scores = {}
+    gt_decision = None
+    if not ground_truth_review:
+        print(f"Warning: No golden_review found for entry {entry_id}. Ground truth scores will be empty.")
+    else:
+        try:
+            # Extract scores from golden_review markdown text
+            gt_data = extract_scores_from_review(ground_truth_review)
+            if not gt_data:
+                print(f"Warning: Failed to parse golden_review for entry {entry_id}. Ground truth scores will be empty.")
+            else:
+                ground_truth_scores = {
+                    'soundness': gt_data.get('soundness'),
+                    'presentation': gt_data.get('presentation'),
+                    'rating': gt_data.get('rating'),
+                    'confidence': gt_data.get('confidence')
+                }
+                gt_decision = normalize_decision(gt_data.get('decision'))
+                # Note: If any field is None, it stays None - we do NOT use model_review as fallback
+                # Using model output as ground truth would inflate evaluation scores
+        except Exception as e:
+            print(f"Warning: Failed to extract scores from golden_review for {entry_id}: {e}")
+            print(f"  Ground truth scores will be empty. Error: {str(e)}")
+    # Calculate MSE and MAE metrics (with optional normalization in strict mode)
+    score_metrics = calculate_score_metrics(model_scores, ground_truth_scores, normalize=strict_mode)
+    # Calculate decision accuracy
+    decision_match = False
+    decision_accuracy = None
+    if model_decision is not None and gt_decision is not None:
+        model_decision_normalized = normalize_decision(model_decision)
+        decision_match = (model_decision_normalized == gt_decision)
+        decision_accuracy = 1.0 if decision_match else 0.0
+    result = {
+        'id': entry_id,
+        'format': format_type,
+        'model_soundness': model_scores.get('soundness'),
+        'model_presentation': model_scores.get('presentation'),
+        'model_rating': model_scores.get('rating'),
+        'model_confidence': model_scores.get('confidence'),
+        'model_decision': model_decision,
+        'gt_soundness': ground_truth_scores.get('soundness'),
+        'gt_presentation': ground_truth_scores.get('presentation'),
+        'gt_rating': ground_truth_scores.get('rating'),
+        'gt_confidence': ground_truth_scores.get('confidence'),
+        'gt_decision': gt_decision,
+        'decision_match': decision_match,
+        'decision_accuracy': decision_accuracy,
+        **score_metrics
+    }
+    # Add prefix to indicate which scores were used
+    if format_type == 'refined':
+        if use_initial_scores:
+            result['score_type'] = 'initial'
+        else:
+            result['score_type'] = 'refined'
+    else:
+        result['score_type'] = 'auto'
+    return result
+def calculate_pairwise_accuracies(paper_scores: List[Dict[str, float]]) -> Dict[str, float]:
+    """Calculate pairwise accuracy for each metric by comparing rankings."""
+    if len(paper_scores) < 2:
+        return {}
+    total_valid_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    correct_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    for paper1, paper2 in combinations(paper_scores, 2):
+        # Check rating ranking
+        if (paper1.get('true_rating') is not None and paper2.get('true_rating') is not None and
+            paper1.get('pred_rating') is not None and paper2.get('pred_rating') is not None):
+            total_valid_pairs['rating'] += 1
+            true_order = paper1['true_rating'] > paper2['true_rating']
+            pred_order = paper1['pred_rating'] > paper2['pred_rating']
+            if true_order == pred_order:
+                correct_pairs['rating'] += 1
+        # Similar for other dimensions...
+        # (abbreviated for space, similar logic for soundness, presentation, confidence)
+        for metric in ['soundness', 'presentation', 'confidence']:
+            true_key = f'true_{metric}'
+            pred_key = f'pred_{metric}'
+            if (paper1.get(true_key) is not None and paper2.get(true_key) is not None and
+                paper1.get(pred_key) is not None and paper2.get(pred_key) is not None):
+                total_valid_pairs[metric] += 1
+                true_order = paper1[true_key] > paper2[true_key]
+                pred_order = paper1[pred_key] > paper2[pred_key]
+                if true_order == pred_order:
+                    correct_pairs[metric] += 1
+    pairwise_accuracies = {
+        metric: correct_pairs[metric] / total_valid_pairs[metric] if total_valid_pairs[metric] > 0 else 0.0
+        for metric in ['rating', 'soundness', 'presentation', 'confidence']
+    }
+    return pairwise_accuracies
+# ============================================================================
+# Data Loading Functions
+# ============================================================================
+def load_rubrics_json(rubrics_path: str) -> Dict[str, Dict[str, Any]]:
+    """Load rubrics JSON and create lookup by id."""
+    with open(rubrics_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, list):
+        return {item['id']: item for item in data}
+    elif isinstance(data, dict):
+        return data
+    else:
+        raise ValueError(f"Invalid rubrics JSON format: expected list or dict, got {type(data)}")
+def load_model_reviews_json(reviews_path: str, format_override: Optional[str] = None) -> Dict[str, Dict[str, Any]]:
+    """
+    Load model reviews JSON and extract reviews by id.
+    Supports two input formats:
+    1. Refined format: Contains 'scores' and 'initial_scores' fields (from refinement pipeline)
+    2. Original format: Contains 'model_prediction' with 'meta_review' and 'decision' (like ours.json)
+    Args:
+        reviews_path: Path to JSON file containing model reviews
+        format_override: Optional format override ('refined', 'original', or None for auto-detect)
+    Returns:
+        Dict mapping paper_id to dict containing:
+        - 'review': review text (markdown)
+        - 'scores': refined scores dict (if available)
+        - 'initial_scores': initial scores dict (if available)
+        - 'format': 'refined' or 'original'
+    """
+    with open(reviews_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, dict):
+        data = list(data.values())
+    reviews_dict = {}
+    for item in data:
+        item_id = None
+        review_text = ''
+        scores = None
+        initial_scores = None
+        format_type = None
+        # Use format override if provided, otherwise auto-detect
+        if format_override and format_override != 'auto':
+            # Force use specified format
+            if format_override == 'refined':
+                item_id = item.get('paper_id') or item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'refined'
+                review_text = item.get('review_markdown', '') or item.get('review', '')
+                scores = item.get('scores', {})
+                initial_scores = item.get('initial_scores', {})
+            elif format_override == 'original':
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                meta_review = model_prediction.get('meta_review', {})
+                review_text = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'decision': model_prediction.get('decision'),
+                }
+            else:
+                raise ValueError(f"Unknown format_override: {format_override}. Must be 'refined', 'original', or 'auto'")
+        else:
+            # Auto-detect format
+            if "paper_id" in item:
+                # Refined format (from refinement pipeline)
+                item_id = item.get('paper_id')
+                if not item_id:
+                    continue
+                # Check if this is refined format (has scores and initial_scores)
+                if 'scores' in item and 'initial_scores' in item:
+                    format_type = 'refined'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+                    scores = item.get('scores', {})
+                    initial_scores = item.get('initial_scores', {})
+                else:
+                    # Standard format with paper_id
+                    format_type = 'standard'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+            elif "model_prediction" in item:
+                # Original format (like ours.json)
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                review_text = model_prediction.get('raw_text', '')
+                if review_text is None:
+                    continue
+                # Detect format: agenticreview has raw_text as string with "Overall rating: x"
+                # ai_researcher format has raw_text as dict or JSON string with structured fields
+                is_agenticreview = False
+                if isinstance(review_text, str):
+                    # Check if it's a JSON string (ai_researcher format)
+                    try:
+                        parsed_json = json.loads(review_text)
+                        if isinstance(parsed_json, dict) and any(key in parsed_json for key in ["Summary", "Strengths", "Overall", "Decision"]):
+                            # It's ai_researcher format
+                            review_text = parsed_json
+                            review_text, meta_review = convert_ai_researcher(review_text)
+                        else:
+                            # It's agenticreview format (plain text with "Overall rating: x")
+                            is_agenticreview = True
+                    except (json.JSONDecodeError, TypeError):
+                        # Not JSON, check if it contains "Overall rating:" pattern
+                        if re.search(r'Overall\s+rating\s*[:=]', review_text, re.IGNORECASE):
+                            is_agenticreview = True
+                        else:
+                            # Try to parse as ai_researcher anyway
+                            try:
+                                review_text = json.loads(review_text)
+                                review_text, meta_review = convert_ai_researcher(review_text)
+                            except:
+                                review_text = 'Empty Review'
+                                meta_review = {}
+                elif isinstance(review_text, dict):
+                    # It's ai_researcher format (dict)
+                    review_text, meta_review = convert_ai_researcher(review_text)
+                else:
+                    review_text = 'Empty Review'
+                    meta_review = {}
+                # Handle agenticreview format
+                if is_agenticreview:
+                    review_text, meta_review = convert_agenticreview(review_text)
+                # Extract initial scores
+                # Use meta_review as primary source (from convert_ai_researcher or convert_agenticreview)
+                # Fallback to model_prediction.get('decision') if not in meta_review
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'confidence': meta_review.get('confidence'),
+                    'decision': meta_review.get('decision') or model_prediction.get('decision'),
+                }
+            else:
+                # Legacy format (pred_fast_mode)
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'legacy'
+                review_dict = item.get('pred_fast_mode', {})
+                if isinstance(review_dict, dict):
+                    review_text = review_dict.get('raw_text', '')
+                else:
+                    review_text = str(review_dict)
+        # Extract review content from the review text field
+        try:
+            if review_text:
+                # extracted_review = ReviewProcessor.extract_review_content(review_text)
+                extracted_review = review_text
+            else:
+                extracted_review = ''
+            reviews_dict[item_id] = {
+                'review': extracted_review,
+                'scores': scores,
+                'initial_scores': initial_scores,
+                'format': format_type
+            }
+        except Exception as e:
+            print(f"[WARN] Failed to extract review for {item_id}: {e}")
+            continue
+    return reviews_dict
+def combine_rubrics_and_reviews(
+    rubrics_data: Dict[str, Dict[str, Any]],
+    reviews_dict: Dict[str, Dict[str, Any]]
+) -> List[Dict[str, Any]]:
+    """
+    Combine rubrics and reviews into evaluation entries.
+    Args:
+        rubrics_data: Dict mapping paper_id to rubric entry
+        reviews_dict: Dict mapping paper_id to dict containing 'review', 'scores', 'initial_scores', 'format'
+    Returns:
+        List of evaluation entries with model_review, scores, initial_scores, and format info
+    """
+    combined = []
+    missing_reviews = []
+    for paper_id, rubric_entry in rubrics_data.items():
+        review_data = reviews_dict.get(paper_id)
+        if not review_data or not review_data.get('review'):
+            missing_reviews.append(paper_id)
+            continue
+        entry = {
+            'id': paper_id,
+            'paper_context': rubric_entry.get('paper_context', ''),
+            'decision': rubric_entry.get('decision', ''),
+            'golden_review': rubric_entry.get('golden_review', ''),
+            'rubrics': rubric_entry.get('rubrics', []),
+            'model_review': review_data.get('review', ''),
+            'scores': review_data.get('scores'),  # Refined scores (if available)
+            'initial_scores': review_data.get('initial_scores'),  # Initial scores (if available)
+            'format': review_data.get('format', 'unknown')  # Format type
+        }
+        combined.append(entry)
+    if missing_reviews:
+        print(f"[WARN] {len(missing_reviews)} papers have no model review, skipping them")
+    return combined
+# ============================================================================
+# LLM Service Configuration
+# ============================================================================
+def load_llm_config(config_path: str) -> Dict[str, Any]:
+    """Load LLM configuration from YAML file."""
+    with open(config_path, 'r', encoding='utf-8') as f:
+        config = yaml.safe_load(f)
+    return config
+def create_llm_service_from_config(config: Dict[str, Any]) -> LLMService:
+    """Create LLM service from configuration."""
+    mode = config.get('mode', 'gpt').lower()
+    if mode == 'gpt':
+        gpt_config = config.get('gpt', {})
+        api_key = gpt_config.get('api_key') or os.getenv('OPENAI_API_KEY')
+        if not api_key:
+            raise ValueError("GPT mode requires api_key in configs.yaml or OPENAI_API_KEY environment variable")
+        service = GPTService(
+            api_key=api_key,
+            model_name=gpt_config.get('model_name', 'gpt-4o'),
+            base_url=gpt_config.get('base_url'),
+            timeout=gpt_config.get('timeout', 300)
+        )
+        return service
+    elif mode == 'vllm':
+        vllm_config = config.get('vllm', {})
+        service = VLLMService(
+            base_url=vllm_config.get('base_url', 'http://localhost:8000/v1'),
+            api_key=vllm_config.get('api_key', 'dummy-key'),
+            model_name=vllm_config.get('model_name'),
+            timeout=vllm_config.get('timeout', 300),
+            max_concurrent_requests=vllm_config.get('max_concurrent_requests', 64),
+            max_retries=vllm_config.get('max_retries', 3),
+            retry_delay=vllm_config.get('retry_delay', 1.0),
+            retry_backoff=vllm_config.get('retry_backoff', 2.0)
+        )
+        return service
+    else:
+        raise ValueError(f"Unknown mode: {mode}. Must be 'gpt' or 'vllm'")
+# ============================================================================
+# Main Evaluation Functions
+# ============================================================================
+def run_semantic_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    prompt_template: str,
+    llm_service: LLMService,
+    max_workers: int
+) -> tuple:
+    """Run semantic evaluation and return results and summary."""
+    print(f"\n{'='*80}")
+    print("RUNNING SEMANTIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} reviews using {max_workers} workers...")
+    results = []
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        future_to_entry = {
+            executor.submit(
+                evaluate_review_semantic,
+                entry,
+                entry['paper_context'],
+                prompt_template,
+                llm_service
+            ): entry
+            for entry in evaluation_data
+        }
+        for future in tqdm(as_completed(future_to_entry), total=len(evaluation_data), desc="Semantic evaluation"):
+            try:
+                result = future.result()
+                results.append(result)
+            except Exception as e:
+                entry = future_to_entry[future]
+                print(f"\n[ERROR] Failed to process entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'raw_scores': {},
+                    'weighted_scores': {},
+                    'total_score': 0.0,
+                    'error': str(e),
+                    'raw_response': ''
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r and r.get('weighted_scores')]
+    review_scores = [r.get('total_score', 0.0) for r in valid_results]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'failed_entries': len(results) - len(valid_results)
+    }
+    if review_scores:
+        summary['overall_score'] = {
+            'mean': sum(review_scores) / len(review_scores),
+            'min': min(review_scores),
+            'max': max(review_scores)
+        }
+    # Calculate per-rubric statistics (extract rubric titles from first entry)
+    if evaluation_data and evaluation_data[0].get('rubrics'):
+        rubric_titles = [r['title'] for r in evaluation_data[0]['rubrics']]
+        per_rubric_stats = calculate_per_rubric_statistics(valid_results, rubric_titles)
+        summary['per_rubric_statistics'] = per_rubric_stats
+    return results, summary
+def run_auto_metric_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    strict_mode: bool = False
+) -> tuple:
+    """
+    Run auto-metric evaluation and return results and summary.
+    For refined format (has scores and initial_scores), evaluates both:
+    - Refined scores evaluation
+    - Initial scores evaluation
+    For original format (only initial_scores), evaluates:
+    - Initial scores evaluation only
+    Returns:
+        Tuple of (results_list, summary_dict)
+        - results_list: List of evaluation results (may contain both refined and initial results for refined format)
+        - summary_dict: Summary statistics
+    """
+    print(f"\n{'='*80}")
+    print("RUNNING AUTO-METRIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} entries...")
+    # Detect format types
+    refined_format_count = sum(1 for e in evaluation_data if e.get('format') == 'refined')
+    original_format_count = sum(1 for e in evaluation_data if e.get('format') == 'original')
+    if refined_format_count > 0:
+        print(f"Detected {refined_format_count} entries in refined format (will evaluate both refined and initial scores)")
+    if original_format_count > 0:
+        print(f"Detected {original_format_count} entries in original format (will evaluate initial scores only)")
+    results = []
+    for entry in tqdm(evaluation_data, desc="Auto-metric evaluation"):
+        format_type = entry.get('format', 'unknown')
+        if format_type == 'refined':
+            # Evaluate both refined scores and initial scores
+            try:
+                entry_id = entry.get('id', 'unknown')
+                # Evaluate refined scores
+                refined_result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                refined_result['paper_id'] = entry_id  # Keep original paper_id
+                refined_result['id'] = f"{entry_id}_refined"
+                results.append(refined_result)
+                # Evaluate initial scores
+                initial_result = evaluate_review_auto_metric(entry, use_initial_scores=True, strict_mode=strict_mode)
+                initial_result['paper_id'] = entry_id  # Keep original paper_id
+                initial_result['id'] = f"{entry_id}_initial"
+                results.append(initial_result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+        else:
+            # Evaluate initial scores only (or extract from markdown)
+            try:
+                result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                results.append(result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r]
+    mse_results = [r for r in valid_results if r.get('overall_error') is not None]
+    # Separate refined and initial results for refined format
+    refined_results = [r for r in valid_results if r.get('score_type') == 'refined']
+    initial_results = [r for r in valid_results if r.get('score_type') == 'initial']
+    auto_results = [r for r in valid_results if r.get('score_type') == 'auto' or r.get('score_type') is None]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'mse_entries': len(mse_results),
+        'refined_results_count': len(refined_results),
+        'initial_results_count': len(initial_results),
+        'auto_results_count': len(auto_results)
+    }
+    # Calculate MSE/MAE statistics
+    # For refined format, only use refined results for overall statistics (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Refined format: use only refined results for overall statistics
+        stats_results = [r for r in refined_results if r.get('overall_error') is not None]
+    else:
+        # Original/other formats: use all results
+        stats_results = mse_results
+    if stats_results:
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        mse_stats = {}
+        mae_stats = {}
+        for dim in dimensions:
+            mse_list = [r.get(f'{dim}_mse') for r in stats_results if r.get(f'{dim}_mse') is not None]
+            mae_list = [r.get(f'{dim}_mae') for r in stats_results if r.get(f'{dim}_mae') is not None]
+            mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            if mse_clean:
+                mse_stats[dim] = {
+                    'mean': sum(mse_clean) / len(mse_clean),
+                    'count': len(mse_clean)
+                }
+            if mae_clean:
+                mae_stats[dim] = {
+                    'mean': sum(mae_clean) / len(mae_clean),
+                    'count': len(mae_clean)
+                }
+        overall_errors = [r.get('overall_error') for r in stats_results if r.get('overall_error') is not None]
+        overall_clean = [x for x in overall_errors if x is not None and not (isinstance(x, float) and math.isnan(x))]
+        if overall_clean:
+            summary['overall_error'] = {
+                'mean': sum(overall_clean) / len(overall_clean),
+                'count': len(overall_clean)
+            }
+        summary['mse_statistics'] = mse_stats
+        summary['mae_statistics'] = mae_stats
+        # Calculate separate statistics for refined and initial results
+        if refined_results:
+            refined_mse_results = [r for r in refined_results if r.get('overall_error') is not None]
+            if refined_mse_results:
+                refined_mse_stats = {}
+                refined_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in refined_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in refined_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        refined_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        refined_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['refined_mse_statistics'] = refined_mse_stats
+                summary['refined_mae_statistics'] = refined_mae_stats
+        if initial_results:
+            initial_mse_results = [r for r in initial_results if r.get('overall_error') is not None]
+            if initial_mse_results:
+                initial_mse_stats = {}
+                initial_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in initial_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in initial_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        initial_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        initial_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['initial_mse_statistics'] = initial_mse_stats
+                summary['initial_mae_statistics'] = initial_mae_stats
+    # Calculate Spearman correlations
+    def filter_valid_pairs(true_list, pred_list):
+        filtered_true = []
+        filtered_pred = []
+        for t, p in zip(true_list, pred_list):
+            if (t is not None and p is not None and
+                not (isinstance(t, float) and math.isnan(t)) and
+                not (isinstance(p, float) and math.isnan(p))):
+                filtered_true.append(t)
+                filtered_pred.append(p)
+        return filtered_true, filtered_pred
+    # Calculate Spearman correlations
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined spearman correlations
+        refined_spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in refined_results]
+            pred_values = [r.get(f'model_{dim}') for r in refined_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        refined_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Calculate initial spearman correlations
+        initial_spearman_stats = {}
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in initial_results]
+            pred_values = [r.get(f'model_{dim}') for r in initial_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        initial_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Use refined for overall statistics (avoid double counting)
+        summary['spearman_correlations'] = refined_spearman_stats
+        summary['refined_spearman_correlations'] = refined_spearman_stats
+        summary['initial_spearman_correlations'] = initial_spearman_stats
+    else:
+        # Original/other formats: use all results
+        correlation_results = valid_results
+        spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in correlation_results]
+            pred_values = [r.get(f'model_{dim}') for r in correlation_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        summary['spearman_correlations'] = spearman_stats
+    # Calculate Decision metrics
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined decision metrics
+        refined_decision_results = [r for r in refined_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if refined_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in refined_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['refined_decision_metrics'] = refined_decision_metrics
+                summary['decision_metrics'] = refined_decision_metrics  # Use refined for overall
+        # Calculate initial decision metrics
+        initial_decision_results = [r for r in initial_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if initial_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in initial_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['initial_decision_metrics'] = initial_decision_metrics
+    else:
+        # Original/other formats: use all results
+        decision_results = [r for r in valid_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+    # Calculate Pairwise comparison
+    # For refined format, only use refined results (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        pairwise_results = refined_results
+    else:
+        pairwise_results = valid_results
+    paper_scores = []
+    for r in pairwise_results:
+        if (r.get('gt_rating') is not None and r.get('model_rating') is not None) or \
+           (r.get('gt_soundness') is not None and r.get('model_soundness') is not None):
+            paper_scores.append({
+                'true_rating': r.get('gt_rating'),
+                'pred_rating': r.get('model_rating'),
+                'true_soundness': r.get('gt_soundness'),
+                'pred_soundness': r.get('model_soundness'),
+                'true_presentation': r.get('gt_presentation'),
+                'pred_presentation': r.get('model_presentation'),
+                'true_confidence': r.get('gt_confidence'),
+                'pred_confidence': r.get('model_confidence')
+            })
+    if len(paper_scores) >= 2:
+        pairwise_accuracies = calculate_pairwise_accuracies(paper_scores)
+        summary['pairwise_accuracies'] = pairwise_accuracies
+    return results, summary
+# ============================================================================
+# Main Function
+# ============================================================================
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Unified evaluation script for semantic and auto-metric evaluation")
+    # Input paths
+    parser.add_argument("--rubrics_path", type=str, required=True,
+                       help="Path to eval_rubrics.json file (from 1_generate_review_based_rubrics.py)")
+    parser.add_argument("--reviews_path", type=str, required=True,
+                       help="Path to JSON file with model reviews (contains pred_fast_mode)")
+    # Evaluation mode
+    parser.add_argument("--mode", type=str, choices=["semantic", "auto_metric", "both"], default="both",
+                       help="Evaluation mode: semantic (LLM-based), auto_metric (rule-based), or both")
+    # Output paths
+    parser.add_argument("--semantic_output", type=str, default=None,
+                       help="Path to output JSON file for semantic evaluation results (required if mode is semantic or both)")
+    parser.add_argument("--auto_metric_output", type=str, default=None,
+                       help="Path to output JSON file for auto-metric evaluation results (required if mode is auto_metric or both)")
+    # Semantic evaluation settings
+    parser.add_argument("--yaml_path", type=str, default=None,
+                       help="Path to prompts.yaml file (required for semantic evaluation)")
+    parser.add_argument("--config_path", type=str, default=None,
+                       help="Path to configs.yaml file (required for semantic evaluation)")
+    # Multi-threading
+    parser.add_argument("--max_workers", type=int, default=None,
+                       help="Maximum number of worker threads for semantic evaluation (default: 5)")
+    # Strict mode (normalize scores to discrete scales)
+    parser.add_argument("--strict_mode", action="store_true", default=False,
+                       help="Enable strict mode: normalize scores to discrete scales before computing metrics (default: False)")
+    # Input format override
+    parser.add_argument("--input_format", type=str, choices=['auto', 'refined', 'original'], default='auto',
+                       help="Manually specify input JSON format: 'refined' (has scores and initial_scores), 'original' (has model_prediction), or 'auto' for auto-detection (default: 'auto')")
+    return parser.parse_args()
+def main():
+    """Main execution function."""
+    args = parse_args()
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    # Resolve paths
+    rubrics_path = args.rubrics_path
+    if not os.path.isabs(rubrics_path):
+        rubrics_path = os.path.join(script_dir, rubrics_path)
+    reviews_path = args.reviews_path
+    if not os.path.isabs(reviews_path):
+        reviews_path = os.path.join(script_dir, reviews_path)
+    max_workers = args.max_workers or int(os.getenv("MAX_WORKERS", "5"))
+    # Validate mode and output paths
+    if args.mode in ["semantic", "both"]:
+        if not args.semantic_output:
+            raise ValueError("--semantic_output is required when mode is 'semantic' or 'both'")
+        if not args.yaml_path:
+            raise ValueError("--yaml_path is required for semantic evaluation")
+        if not args.config_path:
+            raise ValueError("--config_path is required for semantic evaluation")
+    if args.mode in ["auto_metric", "both"]:
+        if not args.auto_metric_output:
+            raise ValueError("--auto_metric_output is required when mode is 'auto_metric' or 'both'")
+    # Check if files exist
+    if not os.path.exists(rubrics_path):
+        raise FileNotFoundError(f"Rubrics file not found: {rubrics_path}")
+    if not os.path.exists(reviews_path):
+        raise FileNotFoundError(f"Reviews file not found: {reviews_path}")
+    # Load data
+    print(f"Loading rubrics from {rubrics_path}...")
+    rubrics_data = load_rubrics_json(rubrics_path)
+    print(f"Loaded {len(rubrics_data)} rubrics entries")
+    print(f"Loading model reviews from {reviews_path}...")
+    if args.input_format != 'auto':
+        print(f"Using manually specified format: {args.input_format}")
+    else:
+        print("Auto-detecting input format...")
+    reviews_dict = load_model_reviews_json(reviews_path, format_override=args.input_format if args.input_format != 'auto' else None)
+    print(f"Loaded {len(reviews_dict)} model reviews")
+    # Combine rubrics and reviews
+    print("Combining rubrics and reviews...")
+    evaluation_data = combine_rubrics_and_reviews(rubrics_data, reviews_dict)
+    print(f"Prepared {len(evaluation_data)} entries for evaluation")
+    # Run evaluations based on mode
+    if args.mode in ["semantic", "both"]:
+        # Resolve semantic evaluation paths
+        yaml_path = args.yaml_path
+        if not os.path.isabs(yaml_path):
+            yaml_path = os.path.join(script_dir, yaml_path)
+        config_path = args.config_path
+        if not os.path.isabs(config_path):
+            config_path = os.path.join(script_dir, config_path)
+        if not os.path.exists(yaml_path):
+            raise FileNotFoundError(f"YAML file not found: {yaml_path}")
+        if not os.path.exists(config_path):
+            raise FileNotFoundError(f"Config file not found: {config_path}")
+        # Load prompt template
+        print(f"Loading prompt template from {yaml_path}...")
+        prompt_template = load_prompt_template(yaml_path)
+        if not prompt_template:
+            raise ValueError("Could not find 'v1_evaluator_prompt' in YAML file")
+        # Initialize LLM service
+        print(f"Loading LLM configuration from {config_path}...")
+        llm_config = load_llm_config(config_path)
+        llm_service = create_llm_service_from_config(llm_config)
+        mode = llm_config.get('mode', 'gpt')
+        print(f"LLM service initialized (mode: {mode})")
+        if hasattr(llm_service, 'model_name'):
+            print(f"Using model: {llm_service.model_name}")
+        # Run semantic evaluation
+        semantic_results, semantic_summary = run_semantic_evaluation(
+            evaluation_data, prompt_template, llm_service, max_workers
+        )
+        # Save semantic results
+        semantic_output = args.semantic_output
+        if not os.path.isabs(semantic_output):
+            semantic_output = os.path.join(script_dir, semantic_output)
+        output_dir = os.path.dirname(semantic_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(semantic_output, 'w', encoding='utf-8') as f:
+            json.dump(semantic_results, f, ensure_ascii=False, indent=2)
+        print(f"\nSemantic evaluation results saved to {semantic_output}")
+        # Save semantic summary
+        semantic_summary_path = semantic_output.replace('.json', '_summary.json')
+        with open(semantic_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(semantic_summary, f, ensure_ascii=False, indent=2)
+        print(f"Semantic evaluation summary saved to {semantic_summary_path}")
+        # Print semantic summary
+        print("\n" + "="*80)
+        print("SEMANTIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {semantic_summary['total_entries']}")
+        print(f"Valid entries: {semantic_summary['valid_entries']}")
+        print(f"Failed entries: {semantic_summary['failed_entries']}")
+        if 'overall_score' in semantic_summary:
+            score = semantic_summary['overall_score']
+            print(f"\nOverall Score:")
+            print(f"  Mean: {score['mean']:.2f}")
+            print(f"  Min: {score['min']:.2f}")
+            print(f"  Max: {score['max']:.2f}")
+    if args.mode in ["auto_metric", "both"]:
+        # Run auto-metric evaluation
+        auto_metric_results, auto_metric_summary = run_auto_metric_evaluation(
+            evaluation_data,
+            strict_mode=args.strict_mode
+        )
+        # Save auto-metric results
+        auto_metric_output = args.auto_metric_output
+        if not os.path.isabs(auto_metric_output):
+            auto_metric_output = os.path.join(script_dir, auto_metric_output)
+        output_dir = os.path.dirname(auto_metric_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(auto_metric_output, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_results, f, ensure_ascii=False, indent=2)
+        print(f"\nAuto-metric evaluation results saved to {auto_metric_output}")
+        # Save auto-metric summary
+        auto_metric_summary_path = auto_metric_output.replace('.json', '_summary.json')
+        with open(auto_metric_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_summary, f, ensure_ascii=False, indent=2)
+        print(f"Auto-metric evaluation summary saved to {auto_metric_summary_path}")
+        # Print auto-metric summary
+        print("\n" + "="*80)
+        print("AUTO-METRIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {auto_metric_summary['total_entries']}")
+        print(f"Valid entries: {auto_metric_summary['valid_entries']}")
+        print(f"MSE entries: {auto_metric_summary['mse_entries']}")
+        if 'mse_statistics' in auto_metric_summary:
+            print("\nMSE Statistics:")
+            for dim, stats in auto_metric_summary['mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'mae_statistics' in auto_metric_summary:
+            print("\nMAE Statistics:")
+            for dim, stats in auto_metric_summary['mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        # Print refined and initial statistics if available
+        if 'refined_mse_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'refined_mae_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mse_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mae_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'spearman_correlations' in auto_metric_summary:
+            print("\nSpearman Correlations:")
+            for dim, stats in auto_metric_summary['spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        # Print refined and initial spearman correlations if available
+        if 'refined_spearman_correlations' in auto_metric_summary:
+            print("\nRefined Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['refined_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'initial_spearman_correlations' in auto_metric_summary:
+            print("\nInitial Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['initial_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'decision_metrics' in auto_metric_summary:
+            dm = auto_metric_summary['decision_metrics']
+            print(f"\nDecision Metrics:")
+            print(f"  Accuracy: {dm['accuracy']:.4f} (n={dm['count']})")
+            if 'f1_macro' in dm:
+                print(f"  F1 (macro): {dm['f1_macro']:.4f}")
+        # Print refined and initial decision metrics if available
+        if 'refined_decision_metrics' in auto_metric_summary:
+            print("\nRefined Scores - Decision Metrics:")
+            rdm = auto_metric_summary['refined_decision_metrics']
+            print(f"  Accuracy: {rdm['accuracy']:.4f} (n={rdm['count']})")
+            if 'f1_macro' in rdm:
+                print(f"  F1 (macro): {rdm['f1_macro']:.4f}")
+        if 'initial_decision_metrics' in auto_metric_summary:
+            print("\nInitial Scores - Decision Metrics:")
+            idm = auto_metric_summary['initial_decision_metrics']
+            print(f"  Accuracy: {idm['accuracy']:.4f} (n={idm['count']})")
+            if 'f1_macro' in idm:
+                print(f"  F1 (macro): {idm['f1_macro']:.4f}")
+    print("\n" + "="*80)
+    print("EVALUATION COMPLETE")
+    print("="*80)
+if __name__ == "__main__":
+    main()

src/evaluator/2_evaluate_cyclereviewer.py ADDED Viewed

	@@ -0,0 +1,1837 @@

+"""
+Unified evaluation script for semantic (LLM-based) and auto_metric (rule-based) evaluation.
+This script:
+1. Reads eval_rubrics.json (from 1_generate_review_based_rubrics.py) containing rubrics for each paper
+2. Reads input JSON file containing model reviews (supports multiple formats)
+3. Supports three evaluation modes:
+   - semantic: LLM-based rubrics evaluation (from 2_evaluate_direct.py)
+   - auto_metric: Rule-based metrics evaluation (from 3_rule_evaluate.py)
+   - both: Run both evaluations separately
+4. Supports strict mode: normalize scores to discrete scales before computing metrics (--strict_mode)
+5. Outputs separate JSON files for results and summaries
+Usage:
+    # Semantic evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode semantic \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --max_workers 5
+    # Auto-metric evaluation only
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json
+    # Auto-metric evaluation with strict mode (normalize scores to discrete scales)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --strict_mode
+    # Auto-metric evaluation with manually specified input format (refined)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format refined
+    # Auto-metric evaluation with manually specified input format (original)
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path ours.json \
+        --mode auto_metric \
+        --auto_metric_output auto_metric_results.json \
+        --input_format original
+    # Both evaluations
+    python 2_evaluate.py \
+        --rubrics_path eval_rubrics.json \
+        --reviews_path model_reviews.json \
+        --mode both \
+        --yaml_path prompts.yaml \
+        --config_path configs.yaml \
+        --semantic_output semantic_results.json \
+        --auto_metric_output auto_metric_results.json \
+        --max_workers 32
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import argparse
+import yaml
+import math
+import re
+from typing import Dict, List, Any, Optional
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+from itertools import combinations
+from scipy.stats import spearmanr
+from sklearn.metrics import precision_recall_fscore_support
+# Add parent directory to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Import parse_llm_response from local llm_service module
+import llm_service as local_llm_service
+parse_llm_response = local_llm_service.parse_llm_response
+# Import from shared/utils for gpt/vllm support
+project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+if project_root not in sys.path:
+    sys.path.insert(0, project_root)
+from shared.utils.llm_service import LLMService
+from shared.utils.vllm_service import VLLMService
+from shared.utils.gpt_service import GPTService
+sys.path.insert(0, os.path.join(project_root, 'shared', 'utils'))
+from json_parser import parse_review_markdown
+def convert_cyclereviewer(review_text: str) -> tuple:
+    """
+    Convert the review text from cyclereviewer format to unified review system format.
+    The cyclereviewer format has markdown sections like:
+    "## Rating\n\n3: reject, not good enough\n\n## Confidence\n\n4: You are confident...\n\n"
+    Args:
+        review_text: Raw review text string (markdown format)
+    Returns:
+        Tuple of (formatted_review_text, meta_review_dict)
+    """
+    # Use parse_review_markdown to extract scores from markdown sections
+    parsed = {}
+    try:
+        parsed = parse_review_markdown(review_text)
+    except Exception:
+        pass
+    # Extract rating - can be from "## Rating\n\n3: reject..." or "## Score: 6: ..."
+    rating = parsed.get('rating')
+    if rating is None:
+        # Try to extract from "## Rating\n\n3: reject..." format
+        rating_match = re.search(r'##\s*Rating\s*\n\n\s*(\d+\.?\d*)\s*:', review_text, re.IGNORECASE | re.MULTILINE)
+        if rating_match:
+            try:
+                rating = float(rating_match.group(1))
+            except (ValueError, IndexError):
+                pass
+        # Try "## Score: 6: ..." format
+        if rating is None:
+            score_match = re.search(r'##\s*Score\s*:\s*(\d+\.?\d*)\s*:', review_text, re.IGNORECASE | re.MULTILINE)
+            if score_match:
+                try:
+                    rating = float(score_match.group(1))
+                except (ValueError, IndexError):
+                    pass
+    # Extract confidence
+    confidence = parsed.get('confidence')
+    if confidence is None:
+        # Try to extract from "## Confidence\n\n4: You are confident..." format
+        confidence_match = re.search(r'##\s*Confidence\s*\n\n\s*(\d+\.?\d*)\s*:', review_text, re.IGNORECASE | re.MULTILINE)
+        if confidence_match:
+            try:
+                confidence = float(confidence_match.group(1))
+            except (ValueError, IndexError):
+                pass
+    # Extract decision from rating text (e.g., "3: reject, not good enough")
+    decision = None
+    if rating is not None:
+        # Look for decision in rating section
+        rating_section_match = re.search(r'##\s*Rating\s*\n\n(.*?)(?=\n##|$)', review_text, re.IGNORECASE | re.DOTALL)
+        if rating_section_match:
+            rating_content = rating_section_match.group(1)
+            # Extract decision from text like "3: reject, not good enough"
+            decision_match = re.search(r':\s*(accept|reject|undecided)', rating_content, re.IGNORECASE)
+            if decision_match:
+                decision = decision_match.group(1).lower()
+        # Also try Score section
+        if decision is None:
+            score_section_match = re.search(r'##\s*Score\s*:\s*\d+\s*:\s*(.*?)(?=\n##|$)', review_text, re.IGNORECASE | re.DOTALL)
+            if score_section_match:
+                score_content = score_section_match.group(1)
+                decision_match = re.search(r'(accept|reject|undecided)', score_content, re.IGNORECASE)
+                if decision_match:
+                    decision = decision_match.group(1).lower()
+    # Extract soundness, presentation from parsed data or markdown
+    soundness = parsed.get('soundness')
+    presentation = parsed.get('presentation')
+    contribution = parsed.get('contribution')
+    # Create meta_review dict
+    meta_review = {
+        "rating": rating,
+        "soundness": soundness,
+        "presentation": presentation,
+        "contribution": contribution,
+        "confidence": confidence,
+        "decision": decision,
+    }
+    # Return the review text as-is (it's already in markdown format)
+    return review_text, meta_review
+class ReviewProcessor:
+    """Handles the extraction and processing of reviews from different sources."""
+    @staticmethod
+    def extract_review_content(pred_context):
+        """
+        Extract the review content from the prediction context.
+        Args:
+            pred_context: Raw prediction data that contains the review
+        Returns:
+            str: Extracted review content
+        """
+        try:
+            # First attempt to extract from boxed format
+            return pred_context.split(r'\boxed_review{')[-1].split('\n}')[0]
+        except Exception:
+            # Alternative extraction if the first method fails
+            if isinstance(pred_context, dict) and 'output' in pred_context:
+                return pred_context['output'].split(r'\boxed_review{')[-1].split('\n}')[0]
+            else:
+                # Return as is if extraction fails
+                return pred_context
+# ============================================================================
+# Semantic Evaluation Functions (from 2_evaluate_direct.py)
+# ============================================================================
+def load_prompt_template(yaml_path: str) -> str:
+    """Load the evaluator prompt from YAML file."""
+    with open(yaml_path, 'r', encoding='utf-8') as f:
+        prompts = yaml.safe_load(f)
+    return prompts.get('v1_evaluator_prompt', '')
+def build_evaluation_prompt(
+    rubrics: List[Dict[str, Any]],
+    paper_content: str,
+    review: str,
+    prompt_template: str
+) -> str:
+    """Build the evaluation prompt by replacing placeholders."""
+    rubrics_json = json.dumps(rubrics, indent=4, ensure_ascii=False)
+    prompt = prompt_template.replace('{rubrics_json}', rubrics_json)
+    prompt = prompt.replace('<<paper_content>>', paper_content)
+    prompt = prompt.replace('<<review>>', review)
+    return prompt
+def calculate_weighted_scores(
+    raw_scores: Dict[str, Dict[str, Any]],
+    rubrics: List[Dict[str, Any]]
+) -> Dict[str, float]:
+    """Calculate weighted scores for each rubric."""
+    rubric_weights = {r['title']: r['weight'] for r in rubrics}
+    weighted_scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        if rubric_title not in rubric_weights:
+            continue
+        rubric_score = rubric_data.get('score', 0)
+        if isinstance(rubric_score, str):
+            try:
+                rubric_score = int(rubric_score)
+            except ValueError:
+                rubric_score = 0
+        if rubric_score not in [0, 1]:
+            rubric_score = 1 if rubric_score > 0 else 0
+        weight = rubric_weights[rubric_title]
+        weighted_scores[rubric_title] = rubric_score * weight
+    return weighted_scores
+def calculate_scores(raw_scores: Dict[str, Dict[str, Any]]) -> Dict[str, float]:
+    """Calculate scores for each rubric."""
+    scores = {}
+    for rubric_title, rubric_data in raw_scores.items():
+        scores[rubric_title] = rubric_data.get('score', 0)
+    return scores
+def evaluate_review_semantic(
+    entry: Dict[str, Any],
+    paper_content: str,
+    prompt_template: str,
+    llm_service: LLMService
+) -> Dict[str, Any]:
+    """Evaluate a single review using article-specific rubrics."""
+    entry_id = entry.get('id', 'unknown')
+    rubrics = entry.get('rubrics', [])
+    model_review = entry.get('model_review', '')
+    if not rubrics:
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': 'No valid rubrics found',
+            'raw_response': ''
+        }
+    # Build prompt
+    prompt = build_evaluation_prompt(rubrics, paper_content, model_review, prompt_template)
+    # Call LLM
+    try:
+        messages = [{"role": "user", "content": prompt}]
+        response = llm_service.generate(messages=messages)
+        # Parse response
+        raw_scores = parse_llm_response(response)
+        weighted_scores = calculate_scores(raw_scores)
+        total_score = sum(weighted_scores.values())
+        return {
+            'id': entry_id,
+            'raw_scores': raw_scores,
+            'weighted_scores': weighted_scores,
+            'total_score': total_score,
+            'raw_response': response
+        }
+    except Exception as e:
+        print(f"[ERROR] Error evaluating review {entry_id}: {e}")
+        return {
+            'id': entry_id,
+            'raw_scores': {},
+            'weighted_scores': {},
+            'total_score': 0.0,
+            'error': str(e),
+            'raw_response': ''
+        }
+def calculate_per_rubric_statistics(
+    valid_results: List[Dict[str, Any]],
+    rubric_titles: List[str]
+) -> Dict[str, Dict[str, float]]:
+    """Calculate per-rubric statistics from evaluation results."""
+    rubric_scores = {title: [] for title in rubric_titles}
+    for result in valid_results:
+        weighted_scores = result.get('weighted_scores', {})
+        if not isinstance(weighted_scores, dict):
+            continue
+        for rubric_title in rubric_titles:
+            if rubric_title in weighted_scores:
+                score = weighted_scores[rubric_title]
+                if isinstance(score, str):
+                    try:
+                        score = float(score)
+                    except ValueError:
+                        continue
+                elif isinstance(score, (int, float)):
+                    score = float(score)
+                else:
+                    continue
+                rubric_scores[rubric_title].append(score)
+    per_rubric_stats = {}
+    for rubric_title in rubric_titles:
+        scores = rubric_scores[rubric_title]
+        if not scores:
+            continue
+        mean_score = sum(scores) / len(scores)
+        min_score = min(scores)
+        max_score = max(scores)
+        count = len(scores)
+        if rubric_title == "False or Contradictory Claims":
+            pass_count = sum(1 for s in scores if s >= 0)
+        else:
+            pass_count = sum(1 for s in scores if s >= 1)
+        pass_rate = pass_count / count if count > 0 else 0.0
+        per_rubric_stats[rubric_title] = {
+            'mean': mean_score,
+            'min': min_score,
+            'max': max_score,
+            'count': count,
+            'pass_rate': pass_rate
+        }
+    return per_rubric_stats
+# ============================================================================
+# Auto-Metric Evaluation Functions (from 3_rule_evaluate.py)
+# ============================================================================
+def extract_scores_from_review(review_text: str) -> Dict[str, Any]:
+    """Extract numeric scores and decision from a review markdown text."""
+    if not review_text:
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+    try:
+        parsed = parse_review_markdown(review_text)
+        decision = parsed.get('decision', '')
+        if decision:
+            decision_lower = decision.lower().strip()
+            if 'accept' in decision_lower:
+                decision = 'accept'
+            elif 'reject' in decision_lower:
+                decision = 'reject'
+            elif 'undecided' in decision_lower:
+                decision = 'undecided'
+            else:
+                decision = decision_lower
+        else:
+            decision = None
+        return {
+            'soundness': parsed.get('soundness'),
+            'presentation': parsed.get('presentation'),
+            'rating': parsed.get('rating'),
+            'confidence': parsed.get('confidence'),
+            'decision': decision
+        }
+    except Exception as e:
+        print(f"Warning: Failed to parse review text: {e}")
+        return {'soundness': None, 'presentation': None, 'rating': None, 'confidence': None, 'decision': None}
+def calculate_mse(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Squared Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return (predicted - ground_truth) ** 2
+def calculate_mae(predicted: float, ground_truth: float) -> Optional[float]:
+    """Calculate Mean Absolute Error for a single value."""
+    if predicted is None or ground_truth is None:
+        return None
+    return abs(predicted - ground_truth)
+def normalize_to_discrete_scale(score: Optional[float], scale_type: str) -> Optional[float]:
+    """
+    Normalize a float score to the nearest discrete value based on scale type.
+    Uses round-half-up tie-breaking (e.g., 3.5 rounds to 4, 1.5 rounds to 2).
+    Args:
+        score: The float score to normalize (can be None)
+        scale_type: Either '0-5' for 0-5 scale (discrete: 0,1,2,3,4,5)
+                    or '0-10' for 0-10 scale (discrete: 0,2,4,6,8,10)
+    Returns:
+        Normalized discrete score, or None if input is None
+    """
+    if score is None:
+        return None
+    try:
+        score = float(score)
+    except (ValueError, TypeError):
+        return None
+    if scale_type == '0-5':
+        # Discrete values: 0, 1, 2, 3, 4, 5
+        discrete_values = [0, 1, 2, 3, 4, 5]
+        # Clamp to valid range
+        score = max(0, min(5, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        # For ties, prefer the higher value
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    elif scale_type == '0-10':
+        # Discrete values: 0, 2, 4, 6, 8, 10
+        discrete_values = [0, 2, 4, 6, 8, 10]
+        # Clamp to valid range
+        score = max(0, min(10, score))
+        # Find nearest discrete value, with round-half-up tie-breaking
+        best_value = None
+        best_distance = float('inf')
+        for val in discrete_values:
+            distance = abs(val - score)
+            if distance < best_distance:
+                best_distance = distance
+                best_value = val
+            elif distance == best_distance and val > best_value:
+                # Tie-breaking: prefer higher value (round-half-up)
+                best_value = val
+        return best_value
+    else:
+        raise ValueError(f"Unknown scale_type: {scale_type}. Must be '0-5' or '0-10'")
+def normalize_scores_dict(scores: Dict[str, Optional[float]]) -> Dict[str, Optional[float]]:
+    """
+    Normalize all scores in a dictionary to their appropriate discrete scales.
+    Args:
+        scores: Dictionary with keys 'soundness', 'presentation', 'rating', 'confidence'
+    Returns:
+        Dictionary with normalized scores
+    """
+    normalized = {}
+    # soundness, presentation, confidence use 0-5 scale
+    for key in ['soundness', 'presentation', 'confidence']:
+        normalized[key] = normalize_to_discrete_scale(scores.get(key), '0-5')
+    # rating uses 0-10 scale
+    normalized['rating'] = normalize_to_discrete_scale(scores.get('rating'), '0-10')
+    return normalized
+def calculate_score_metrics(
+    model_scores: Dict[str, float],
+    ground_truth_scores: Dict[str, float],
+    normalize: bool = False
+) -> Dict[str, Any]:
+    """
+    Calculate MSE and MAE metrics for each scoring dimension.
+    Args:
+        model_scores: Dictionary with model scores
+        ground_truth_scores: Dictionary with ground truth scores
+        normalize: If True, normalize scores to discrete scales before computing metrics
+    Returns:
+        Dictionary with MSE, MAE metrics and optionally normalized scores
+    """
+    dimensions = ['soundness', 'presentation', 'rating', 'confidence']
+    # Normalize scores to discrete scales if requested
+    if normalize:
+        model_scores_normalized = normalize_scores_dict(model_scores)
+        gt_scores_normalized = normalize_scores_dict(ground_truth_scores)
+    else:
+        model_scores_normalized = model_scores
+        gt_scores_normalized = ground_truth_scores
+    mse_values = {}
+    mae_values = {}
+    valid_count = 0
+    for dim in dimensions:
+        # Use normalized scores for metric calculation
+        mse = calculate_mse(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mae = calculate_mae(model_scores_normalized.get(dim), gt_scores_normalized.get(dim))
+        mse_values[f'{dim}_mse'] = mse
+        mae_values[f'{dim}_mae'] = mae
+        if mse is not None:
+            valid_count += 1
+    overall_error = sum([v for v in mse_values.values() if v is not None])
+    result = {
+        **mse_values,
+        **mae_values,
+        'overall_error': overall_error if valid_count > 0 else None,
+        'valid_dimensions': valid_count
+    }
+    # Include normalized scores in result for transparency (only if normalize=True)
+    if normalize:
+        result['model_scores_normalized'] = model_scores_normalized
+        result['gt_scores_normalized'] = gt_scores_normalized
+    return result
+def normalize_score_value(value):
+    """Normalize score value to float, handling string representations."""
+    if value is None:
+        return None
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        # Try to extract numeric value from string (e.g., "2.75" -> 2.75)
+        try:
+            import re
+            match = re.search(r'(\d+\.?\d*)', value)
+            if match:
+                return float(match.group(1))
+        except:
+            pass
+    return None
+def normalize_decision(decision):
+    """Normalize decision string to standard format."""
+    if decision is None:
+        return None
+    decision_lower = str(decision).lower().strip()
+    if 'accept' in decision_lower:
+        return 'accept'
+    elif 'reject' in decision_lower:
+        return 'reject'
+    elif 'undecided' in decision_lower:
+        return 'undecided'
+    else:
+        return decision_lower
+def extract_scores_from_dict(scores_dict: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Extract scores from a structured dictionary (scores or initial_scores format).
+    Args:
+        scores_dict: Dict containing scores (e.g., {'rating': 5.75, 'soundness': '2.75', ...})
+    Returns:
+        Dict with normalized scores: {'soundness', 'presentation', 'rating', 'confidence', 'decision'}
+    """
+    if not scores_dict:
+        return {
+            'soundness': None,
+            'presentation': None,
+            'rating': None,
+            'confidence': None,
+            'decision': None
+        }
+    return {
+        'soundness': normalize_score_value(scores_dict.get('soundness')),
+        'presentation': normalize_score_value(scores_dict.get('presentation')),
+        'rating': normalize_score_value(scores_dict.get('rating')),
+        'confidence': normalize_score_value(scores_dict.get('confidence')),
+        'decision': normalize_decision(scores_dict.get('decision'))
+    }
+def evaluate_review_auto_metric(entry: Dict[str, Any], use_initial_scores: bool = False, strict_mode: bool = False) -> Dict[str, Any]:
+    """
+    Evaluate a single entry by extracting scores and calculating metrics.
+    Args:
+        entry: Evaluation entry containing model_review, scores, initial_scores, etc.
+        use_initial_scores: If True, use initial_scores instead of refined scores (for refined format)
+    Returns:
+        Dict containing evaluation metrics
+    """
+    entry_id = entry.get('id', 'unknown')
+    model_review = entry.get('model_review', '')
+    format_type = entry.get('format', 'unknown')
+    # Extract scores based on format
+    model_scores = {}
+    model_decision = None
+    if format_type == 'refined' and not use_initial_scores:
+        # Use refined scores from structured data
+        scores_dict = entry.get('scores', {})
+        model_data = extract_scores_from_dict(scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'refined' and use_initial_scores:
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    elif format_type == 'original':
+        # Use initial scores from structured data
+        initial_scores_dict = entry.get('initial_scores', {})
+        model_data = extract_scores_from_dict(initial_scores_dict)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+        # Fallback: If confidence is missing from structured data, try to extract from review text
+        # (meta_review may not have confidence field, but review text might)
+        if model_scores.get('confidence') is None and model_review:
+            try:
+                review_data = extract_scores_from_review(model_review)
+                if review_data.get('confidence') is not None:
+                    model_scores['confidence'] = review_data.get('confidence')
+            except Exception:
+                pass  # Keep confidence as None if extraction fails
+    else:
+        # Fallback: extract from markdown review text
+        model_data = extract_scores_from_review(model_review)
+        model_scores = {
+            'soundness': model_data.get('soundness'),
+            'presentation': model_data.get('presentation'),
+            'rating': model_data.get('rating'),
+            'confidence': model_data.get('confidence')
+        }
+        model_decision = model_data.get('decision')
+    # Get ground truth scores from golden_review ONLY
+    # Ground truth must ONLY come from golden_review, never from model output
+    # If extraction fails, leave fields as None (do not use model_review as fallback)
+    ground_truth_review = entry.get('golden_review', '')
+    ground_truth_scores = {}
+    gt_decision = None
+    if not ground_truth_review:
+        print(f"Warning: No golden_review found for entry {entry_id}. Ground truth scores will be empty.")
+    else:
+        try:
+            # Extract scores from golden_review markdown text
+            gt_data = extract_scores_from_review(ground_truth_review)
+            if not gt_data:
+                print(f"Warning: Failed to parse golden_review for entry {entry_id}. Ground truth scores will be empty.")
+            else:
+                ground_truth_scores = {
+                    'soundness': gt_data.get('soundness'),
+                    'presentation': gt_data.get('presentation'),
+                    'rating': gt_data.get('rating'),
+                    'confidence': gt_data.get('confidence')
+                }
+                gt_decision = normalize_decision(gt_data.get('decision'))
+                # Note: If any field is None, it stays None - we do NOT use model_review as fallback
+                # Using model output as ground truth would inflate evaluation scores
+        except Exception as e:
+            print(f"Warning: Failed to extract scores from golden_review for {entry_id}: {e}")
+            print(f"  Ground truth scores will be empty. Error: {str(e)}")
+    # Calculate MSE and MAE metrics (with optional normalization in strict mode)
+    score_metrics = calculate_score_metrics(model_scores, ground_truth_scores, normalize=strict_mode)
+    # Calculate decision accuracy
+    decision_match = False
+    decision_accuracy = None
+    if model_decision is not None and gt_decision is not None:
+        model_decision_normalized = normalize_decision(model_decision)
+        decision_match = (model_decision_normalized == gt_decision)
+        decision_accuracy = 1.0 if decision_match else 0.0
+    result = {
+        'id': entry_id,
+        'format': format_type,
+        'model_soundness': model_scores.get('soundness'),
+        'model_presentation': model_scores.get('presentation'),
+        'model_rating': model_scores.get('rating'),
+        'model_confidence': model_scores.get('confidence'),
+        'model_decision': model_decision,
+        'gt_soundness': ground_truth_scores.get('soundness'),
+        'gt_presentation': ground_truth_scores.get('presentation'),
+        'gt_rating': ground_truth_scores.get('rating'),
+        'gt_confidence': ground_truth_scores.get('confidence'),
+        'gt_decision': gt_decision,
+        'decision_match': decision_match,
+        'decision_accuracy': decision_accuracy,
+        **score_metrics
+    }
+    # Add prefix to indicate which scores were used
+    if format_type == 'refined':
+        if use_initial_scores:
+            result['score_type'] = 'initial'
+        else:
+            result['score_type'] = 'refined'
+    else:
+        result['score_type'] = 'auto'
+    return result
+def calculate_pairwise_accuracies(paper_scores: List[Dict[str, float]]) -> Dict[str, float]:
+    """Calculate pairwise accuracy for each metric by comparing rankings."""
+    if len(paper_scores) < 2:
+        return {}
+    total_valid_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    correct_pairs = {'rating': 0, 'soundness': 0, 'presentation': 0, 'confidence': 0}
+    for paper1, paper2 in combinations(paper_scores, 2):
+        # Check rating ranking
+        if (paper1.get('true_rating') is not None and paper2.get('true_rating') is not None and
+            paper1.get('pred_rating') is not None and paper2.get('pred_rating') is not None):
+            total_valid_pairs['rating'] += 1
+            true_order = paper1['true_rating'] > paper2['true_rating']
+            pred_order = paper1['pred_rating'] > paper2['pred_rating']
+            if true_order == pred_order:
+                correct_pairs['rating'] += 1
+        # Similar for other dimensions...
+        # (abbreviated for space, similar logic for soundness, presentation, confidence)
+        for metric in ['soundness', 'presentation', 'confidence']:
+            true_key = f'true_{metric}'
+            pred_key = f'pred_{metric}'
+            if (paper1.get(true_key) is not None and paper2.get(true_key) is not None and
+                paper1.get(pred_key) is not None and paper2.get(pred_key) is not None):
+                total_valid_pairs[metric] += 1
+                true_order = paper1[true_key] > paper2[true_key]
+                pred_order = paper1[pred_key] > paper2[pred_key]
+                if true_order == pred_order:
+                    correct_pairs[metric] += 1
+    pairwise_accuracies = {
+        metric: correct_pairs[metric] / total_valid_pairs[metric] if total_valid_pairs[metric] > 0 else 0.0
+        for metric in ['rating', 'soundness', 'presentation', 'confidence']
+    }
+    return pairwise_accuracies
+# ============================================================================
+# Data Loading Functions
+# ============================================================================
+def load_rubrics_json(rubrics_path: str) -> Dict[str, Dict[str, Any]]:
+    """Load rubrics JSON and create lookup by id."""
+    with open(rubrics_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, list):
+        return {item['id']: item for item in data}
+    elif isinstance(data, dict):
+        return data
+    else:
+        raise ValueError(f"Invalid rubrics JSON format: expected list or dict, got {type(data)}")
+def load_model_reviews_json(reviews_path: str, format_override: Optional[str] = None) -> Dict[str, Dict[str, Any]]:
+    """
+    Load model reviews JSON and extract reviews by id.
+    Supports two input formats:
+    1. Refined format: Contains 'scores' and 'initial_scores' fields (from refinement pipeline)
+    2. Original format: Contains 'model_prediction' with 'meta_review' and 'decision' (like ours.json)
+    Args:
+        reviews_path: Path to JSON file containing model reviews
+        format_override: Optional format override ('refined', 'original', or None for auto-detect)
+    Returns:
+        Dict mapping paper_id to dict containing:
+        - 'review': review text (markdown)
+        - 'scores': refined scores dict (if available)
+        - 'initial_scores': initial scores dict (if available)
+        - 'format': 'refined' or 'original'
+    """
+    with open(reviews_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, dict):
+        data = list(data.values())
+    reviews_dict = {}
+    for item in data:
+        item_id = None
+        review_text = ''
+        scores = None
+        initial_scores = None
+        format_type = None
+        # Use format override if provided, otherwise auto-detect
+        if format_override and format_override != 'auto':
+            # Force use specified format
+            if format_override == 'refined':
+                item_id = item.get('paper_id') or item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'refined'
+                review_text = item.get('review_markdown', '') or item.get('review', '')
+                scores = item.get('scores', {})
+                initial_scores = item.get('initial_scores', {})
+            elif format_override == 'original':
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                meta_review = model_prediction.get('meta_review', {})
+                review_text = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'decision': model_prediction.get('decision'),
+                }
+            else:
+                raise ValueError(f"Unknown format_override: {format_override}. Must be 'refined', 'original', or 'auto'")
+        else:
+            # Auto-detect format
+            if "paper_id" in item:
+                # Refined format (from refinement pipeline)
+                item_id = item.get('paper_id')
+                if not item_id:
+                    continue
+                # Check if this is refined format (has scores and initial_scores)
+                if 'scores' in item and 'initial_scores' in item:
+                    format_type = 'refined'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+                    scores = item.get('scores', {})
+                    initial_scores = item.get('initial_scores', {})
+                else:
+                    # Standard format with paper_id
+                    format_type = 'standard'
+                    review_text = item.get('review_markdown', '') or item.get('review', '')
+            elif "model_prediction" in item:
+                # Original format (like ours.json) or cyclereviewer format
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'original'
+                model_prediction = item.get('model_prediction', {})
+                meta_review = model_prediction.get('meta_review', {})
+                # Extract review content (prefer meta_review.content, fallback to raw_text)
+                review_text = meta_review.get('content', '') or model_prediction.get('raw_text', '')
+                # Detect cyclereviewer format: has raw_text as markdown string with "## Rating" or "## Score:" patterns
+                is_cyclereviewer = False
+                if isinstance(review_text, str) and review_text:
+                    # Check if it contains cyclereviewer patterns
+                    if (re.search(r'##\s*(Rating|Score)\s*:', review_text, re.IGNORECASE) or
+                        re.search(r'##\s*Rating\s*\n\n\s*\d+\s*:', review_text, re.IGNORECASE | re.MULTILINE)):
+                        is_cyclereviewer = True
+                # Handle cyclereviewer format
+                if is_cyclereviewer:
+                    review_text, meta_review = convert_cyclereviewer(review_text)
+                # Extract initial scores
+                # Use meta_review as primary source (from convert_cyclereviewer or original meta_review)
+                # Fallback to model_prediction.get('decision') if not in meta_review
+                initial_scores = {
+                    'rating': meta_review.get('rating'),
+                    'soundness': meta_review.get('soundness'),
+                    'presentation': meta_review.get('presentation'),
+                    'contribution': meta_review.get('contribution'),
+                    'confidence': meta_review.get('confidence'),
+                    'decision': meta_review.get('decision') or model_prediction.get('decision'),
+                }
+            else:
+                # Legacy format (pred_fast_mode)
+                item_id = item.get('id')
+                if not item_id:
+                    continue
+                format_type = 'legacy'
+                review_dict = item.get('pred_fast_mode', {})
+                if isinstance(review_dict, dict):
+                    # review_text = review_dict.get('raw_text', '')
+                    review_text = review_dict
+                else:
+                    review_text = str(review_dict)
+        # Extract review content from the review text field
+        try:
+            if review_text:
+                extracted_review = ReviewProcessor.extract_review_content(review_text)
+            else:
+                extracted_review = ''
+            reviews_dict[item_id] = {
+                'review': extracted_review,
+                'scores': scores,
+                'initial_scores': initial_scores,
+                'format': format_type
+            }
+        except Exception as e:
+            print(f"[WARN] Failed to extract review for {item_id}: {e}")
+            continue
+    return reviews_dict
+def combine_rubrics_and_reviews(
+    rubrics_data: Dict[str, Dict[str, Any]],
+    reviews_dict: Dict[str, Dict[str, Any]]
+) -> List[Dict[str, Any]]:
+    """
+    Combine rubrics and reviews into evaluation entries.
+    Args:
+        rubrics_data: Dict mapping paper_id to rubric entry
+        reviews_dict: Dict mapping paper_id to dict containing 'review', 'scores', 'initial_scores', 'format'
+    Returns:
+        List of evaluation entries with model_review, scores, initial_scores, and format info
+    """
+    combined = []
+    missing_reviews = []
+    for paper_id, rubric_entry in rubrics_data.items():
+        review_data = reviews_dict.get(paper_id)
+        if not review_data or not review_data.get('review'):
+            missing_reviews.append(paper_id)
+            continue
+        entry = {
+            'id': paper_id,
+            'paper_context': rubric_entry.get('paper_context', ''),
+            'decision': rubric_entry.get('decision', ''),
+            'golden_review': rubric_entry.get('golden_review', ''),
+            'rubrics': rubric_entry.get('rubrics', []),
+            'model_review': review_data.get('review', ''),
+            'scores': review_data.get('scores'),  # Refined scores (if available)
+            'initial_scores': review_data.get('initial_scores'),  # Initial scores (if available)
+            'format': review_data.get('format', 'unknown')  # Format type
+        }
+        combined.append(entry)
+    if missing_reviews:
+        print(f"[WARN] {len(missing_reviews)} papers have no model review, skipping them")
+    return combined
+# ============================================================================
+# LLM Service Configuration
+# ============================================================================
+def load_llm_config(config_path: str) -> Dict[str, Any]:
+    """Load LLM configuration from YAML file."""
+    with open(config_path, 'r', encoding='utf-8') as f:
+        config = yaml.safe_load(f)
+    return config
+def create_llm_service_from_config(config: Dict[str, Any]) -> LLMService:
+    """Create LLM service from configuration."""
+    mode = config.get('mode', 'gpt').lower()
+    if mode == 'gpt':
+        gpt_config = config.get('gpt', {})
+        api_key = gpt_config.get('api_key') or os.getenv('OPENAI_API_KEY')
+        if not api_key:
+            raise ValueError("GPT mode requires api_key in configs.yaml or OPENAI_API_KEY environment variable")
+        service = GPTService(
+            api_key=api_key,
+            model_name=gpt_config.get('model_name', 'gpt-4o'),
+            base_url=gpt_config.get('base_url'),
+            timeout=gpt_config.get('timeout', 300)
+        )
+        return service
+    elif mode == 'vllm':
+        vllm_config = config.get('vllm', {})
+        service = VLLMService(
+            base_url=vllm_config.get('base_url', 'http://localhost:8000/v1'),
+            api_key=vllm_config.get('api_key', 'dummy-key'),
+            model_name=vllm_config.get('model_name'),
+            timeout=vllm_config.get('timeout', 300),
+            max_concurrent_requests=vllm_config.get('max_concurrent_requests', 64),
+            max_retries=vllm_config.get('max_retries', 3),
+            retry_delay=vllm_config.get('retry_delay', 1.0),
+            retry_backoff=vllm_config.get('retry_backoff', 2.0)
+        )
+        return service
+    else:
+        raise ValueError(f"Unknown mode: {mode}. Must be 'gpt' or 'vllm'")
+# ============================================================================
+# Main Evaluation Functions
+# ============================================================================
+def run_semantic_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    prompt_template: str,
+    llm_service: LLMService,
+    max_workers: int
+) -> tuple:
+    """Run semantic evaluation and return results and summary."""
+    print(f"\n{'='*80}")
+    print("RUNNING SEMANTIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} reviews using {max_workers} workers...")
+    results = []
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        future_to_entry = {
+            executor.submit(
+                evaluate_review_semantic,
+                entry,
+                entry['paper_context'],
+                prompt_template,
+                llm_service
+            ): entry
+            for entry in evaluation_data
+        }
+        for future in tqdm(as_completed(future_to_entry), total=len(evaluation_data), desc="Semantic evaluation"):
+            try:
+                result = future.result()
+                results.append(result)
+            except Exception as e:
+                entry = future_to_entry[future]
+                print(f"\n[ERROR] Failed to process entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'raw_scores': {},
+                    'weighted_scores': {},
+                    'total_score': 0.0,
+                    'error': str(e),
+                    'raw_response': ''
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r and r.get('weighted_scores')]
+    review_scores = [r.get('total_score', 0.0) for r in valid_results]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'failed_entries': len(results) - len(valid_results)
+    }
+    if review_scores:
+        summary['overall_score'] = {
+            'mean': sum(review_scores) / len(review_scores),
+            'min': min(review_scores),
+            'max': max(review_scores)
+        }
+    # Calculate per-rubric statistics (extract rubric titles from first entry)
+    if evaluation_data and evaluation_data[0].get('rubrics'):
+        rubric_titles = [r['title'] for r in evaluation_data[0]['rubrics']]
+        per_rubric_stats = calculate_per_rubric_statistics(valid_results, rubric_titles)
+        summary['per_rubric_statistics'] = per_rubric_stats
+    return results, summary
+def run_auto_metric_evaluation(
+    evaluation_data: List[Dict[str, Any]],
+    strict_mode: bool = False
+) -> tuple:
+    """
+    Run auto-metric evaluation and return results and summary.
+    For refined format (has scores and initial_scores), evaluates both:
+    - Refined scores evaluation
+    - Initial scores evaluation
+    For original format (only initial_scores), evaluates:
+    - Initial scores evaluation only
+    Returns:
+        Tuple of (results_list, summary_dict)
+        - results_list: List of evaluation results (may contain both refined and initial results for refined format)
+        - summary_dict: Summary statistics
+    """
+    print(f"\n{'='*80}")
+    print("RUNNING AUTO-METRIC EVALUATION")
+    print(f"{'='*80}")
+    print(f"Evaluating {len(evaluation_data)} entries...")
+    # Detect format types
+    refined_format_count = sum(1 for e in evaluation_data if e.get('format') == 'refined')
+    original_format_count = sum(1 for e in evaluation_data if e.get('format') == 'original')
+    if refined_format_count > 0:
+        print(f"Detected {refined_format_count} entries in refined format (will evaluate both refined and initial scores)")
+    if original_format_count > 0:
+        print(f"Detected {original_format_count} entries in original format (will evaluate initial scores only)")
+    results = []
+    for entry in tqdm(evaluation_data, desc="Auto-metric evaluation"):
+        format_type = entry.get('format', 'unknown')
+        if format_type == 'refined':
+            # Evaluate both refined scores and initial scores
+            try:
+                entry_id = entry.get('id', 'unknown')
+                # Evaluate refined scores
+                refined_result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                refined_result['paper_id'] = entry_id  # Keep original paper_id
+                refined_result['id'] = f"{entry_id}_refined"
+                results.append(refined_result)
+                # Evaluate initial scores
+                initial_result = evaluate_review_auto_metric(entry, use_initial_scores=True, strict_mode=strict_mode)
+                initial_result['paper_id'] = entry_id  # Keep original paper_id
+                initial_result['id'] = f"{entry_id}_initial"
+                results.append(initial_result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+        else:
+            # Evaluate initial scores only (or extract from markdown)
+            try:
+                result = evaluate_review_auto_metric(entry, use_initial_scores=False, strict_mode=strict_mode)
+                results.append(result)
+            except Exception as e:
+                print(f"Error evaluating entry {entry.get('id', 'unknown')}: {e}")
+                results.append({
+                    'id': entry.get('id', 'unknown'),
+                    'error': str(e)
+                })
+    # Calculate statistics
+    valid_results = [r for r in results if 'error' not in r]
+    mse_results = [r for r in valid_results if r.get('overall_error') is not None]
+    # Separate refined and initial results for refined format
+    refined_results = [r for r in valid_results if r.get('score_type') == 'refined']
+    initial_results = [r for r in valid_results if r.get('score_type') == 'initial']
+    auto_results = [r for r in valid_results if r.get('score_type') == 'auto' or r.get('score_type') is None]
+    summary = {
+        'total_entries': len(results),
+        'valid_entries': len(valid_results),
+        'mse_entries': len(mse_results),
+        'refined_results_count': len(refined_results),
+        'initial_results_count': len(initial_results),
+        'auto_results_count': len(auto_results)
+    }
+    # Calculate MSE/MAE statistics
+    # For refined format, only use refined results for overall statistics (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Refined format: use only refined results for overall statistics
+        stats_results = [r for r in refined_results if r.get('overall_error') is not None]
+    else:
+        # Original/other formats: use all results
+        stats_results = mse_results
+    if stats_results:
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        mse_stats = {}
+        mae_stats = {}
+        for dim in dimensions:
+            mse_list = [r.get(f'{dim}_mse') for r in stats_results if r.get(f'{dim}_mse') is not None]
+            mae_list = [r.get(f'{dim}_mae') for r in stats_results if r.get(f'{dim}_mae') is not None]
+            mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+            if mse_clean:
+                mse_stats[dim] = {
+                    'mean': sum(mse_clean) / len(mse_clean),
+                    'count': len(mse_clean)
+                }
+            if mae_clean:
+                mae_stats[dim] = {
+                    'mean': sum(mae_clean) / len(mae_clean),
+                    'count': len(mae_clean)
+                }
+        overall_errors = [r.get('overall_error') for r in stats_results if r.get('overall_error') is not None]
+        overall_clean = [x for x in overall_errors if x is not None and not (isinstance(x, float) and math.isnan(x))]
+        if overall_clean:
+            summary['overall_error'] = {
+                'mean': sum(overall_clean) / len(overall_clean),
+                'count': len(overall_clean)
+            }
+        summary['mse_statistics'] = mse_stats
+        summary['mae_statistics'] = mae_stats
+        # Calculate separate statistics for refined and initial results
+        if refined_results:
+            refined_mse_results = [r for r in refined_results if r.get('overall_error') is not None]
+            if refined_mse_results:
+                refined_mse_stats = {}
+                refined_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in refined_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in refined_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        refined_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        refined_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['refined_mse_statistics'] = refined_mse_stats
+                summary['refined_mae_statistics'] = refined_mae_stats
+        if initial_results:
+            initial_mse_results = [r for r in initial_results if r.get('overall_error') is not None]
+            if initial_mse_results:
+                initial_mse_stats = {}
+                initial_mae_stats = {}
+                for dim in dimensions:
+                    mse_list = [r.get(f'{dim}_mse') for r in initial_mse_results if r.get(f'{dim}_mse') is not None]
+                    mae_list = [r.get(f'{dim}_mae') for r in initial_mse_results if r.get(f'{dim}_mae') is not None]
+                    mse_clean = [x for x in mse_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    mae_clean = [x for x in mae_list if x is not None and not (isinstance(x, float) and math.isnan(x))]
+                    if mse_clean:
+                        initial_mse_stats[dim] = {'mean': sum(mse_clean) / len(mse_clean), 'count': len(mse_clean)}
+                    if mae_clean:
+                        initial_mae_stats[dim] = {'mean': sum(mae_clean) / len(mae_clean), 'count': len(mae_clean)}
+                summary['initial_mse_statistics'] = initial_mse_stats
+                summary['initial_mae_statistics'] = initial_mae_stats
+    # Calculate Spearman correlations
+    def filter_valid_pairs(true_list, pred_list):
+        filtered_true = []
+        filtered_pred = []
+        for t, p in zip(true_list, pred_list):
+            if (t is not None and p is not None and
+                not (isinstance(t, float) and math.isnan(t)) and
+                not (isinstance(p, float) and math.isnan(p))):
+                filtered_true.append(t)
+                filtered_pred.append(p)
+        return filtered_true, filtered_pred
+    # Calculate Spearman correlations
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined spearman correlations
+        refined_spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in refined_results]
+            pred_values = [r.get(f'model_{dim}') for r in refined_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        refined_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Calculate initial spearman correlations
+        initial_spearman_stats = {}
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in initial_results]
+            pred_values = [r.get(f'model_{dim}') for r in initial_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        initial_spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        # Use refined for overall statistics (avoid double counting)
+        summary['spearman_correlations'] = refined_spearman_stats
+        summary['refined_spearman_correlations'] = refined_spearman_stats
+        summary['initial_spearman_correlations'] = initial_spearman_stats
+    else:
+        # Original/other formats: use all results
+        correlation_results = valid_results
+        spearman_stats = {}
+        dimensions = ['soundness', 'presentation', 'confidence', 'rating']
+        for dim in dimensions:
+            true_values = [r.get(f'gt_{dim}') for r in correlation_results]
+            pred_values = [r.get(f'model_{dim}') for r in correlation_results]
+            true_clean, pred_clean = filter_valid_pairs(true_values, pred_values)
+            if len(true_clean) >= 2 and len(pred_clean) >= 2:
+                try:
+                    corr, _ = spearmanr(true_clean, pred_clean)
+                    if not math.isnan(corr):
+                        spearman_stats[dim] = {
+                            'correlation': corr,
+                            'count': len(true_clean)
+                        }
+                except Exception:
+                    pass
+        summary['spearman_correlations'] = spearman_stats
+    # Calculate Decision metrics
+    # For refined format, calculate separately for refined and initial, and use refined for overall
+    # For other formats, use all results
+    if refined_format_count > 0:
+        # Calculate refined decision metrics
+        refined_decision_results = [r for r in refined_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if refined_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in refined_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    refined_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['refined_decision_metrics'] = refined_decision_metrics
+                summary['decision_metrics'] = refined_decision_metrics  # Use refined for overall
+        # Calculate initial decision metrics
+        initial_decision_results = [r for r in initial_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if initial_decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in initial_decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    initial_decision_metrics = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+                summary['initial_decision_metrics'] = initial_decision_metrics
+    else:
+        # Original/other formats: use all results
+        decision_results = [r for r in valid_results if r.get('gt_decision') is not None and r.get('model_decision') is not None]
+        if decision_results:
+            true_decisions = []
+            pred_decisions = []
+            decision_acc = []
+            for r in decision_results:
+                gt_decision = str(r.get('gt_decision', '')).lower().strip()
+                pred_decision = str(r.get('model_decision', '')).lower().strip()
+                if 'accept' in pred_decision:
+                    pred_binary = 1
+                else:
+                    pred_binary = 0
+                if 'accept' in gt_decision:
+                    gt_binary = 1
+                else:
+                    gt_binary = 0
+                true_decisions.append(gt_binary)
+                pred_decisions.append(pred_binary)
+                if pred_decision == gt_decision or ('accept' in pred_decision and 'accept' in gt_decision) or ('reject' in pred_decision and 'reject' in gt_decision):
+                    decision_acc.append(1.0)
+                else:
+                    decision_acc.append(0.0)
+            if decision_acc:
+                decision_accuracy = sum(decision_acc) / len(decision_acc)
+                try:
+                    _, _, f1_score, _ = precision_recall_fscore_support(true_decisions, pred_decisions, average='macro')
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'f1_macro': f1_score,
+                        'count': len(decision_acc)
+                    }
+                except Exception:
+                    summary['decision_metrics'] = {
+                        'accuracy': decision_accuracy,
+                        'count': len(decision_acc)
+                    }
+    # Calculate Pairwise comparison
+    # For refined format, only use refined results (avoid double counting)
+    # For other formats, use all results
+    if refined_format_count > 0:
+        pairwise_results = refined_results
+    else:
+        pairwise_results = valid_results
+    paper_scores = []
+    for r in pairwise_results:
+        if (r.get('gt_rating') is not None and r.get('model_rating') is not None) or \
+           (r.get('gt_soundness') is not None and r.get('model_soundness') is not None):
+            paper_scores.append({
+                'true_rating': r.get('gt_rating'),
+                'pred_rating': r.get('model_rating'),
+                'true_soundness': r.get('gt_soundness'),
+                'pred_soundness': r.get('model_soundness'),
+                'true_presentation': r.get('gt_presentation'),
+                'pred_presentation': r.get('model_presentation'),
+                'true_confidence': r.get('gt_confidence'),
+                'pred_confidence': r.get('model_confidence')
+            })
+    if len(paper_scores) >= 2:
+        pairwise_accuracies = calculate_pairwise_accuracies(paper_scores)
+        summary['pairwise_accuracies'] = pairwise_accuracies
+    return results, summary
+# ============================================================================
+# Main Function
+# ============================================================================
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Unified evaluation script for semantic and auto-metric evaluation")
+    # Input paths
+    parser.add_argument("--rubrics_path", type=str, required=True,
+                       help="Path to eval_rubrics.json file (from 1_generate_review_based_rubrics.py)")
+    parser.add_argument("--reviews_path", type=str, required=True,
+                       help="Path to JSON file with model reviews (contains pred_fast_mode)")
+    # Evaluation mode
+    parser.add_argument("--mode", type=str, choices=["semantic", "auto_metric", "both"], default="both",
+                       help="Evaluation mode: semantic (LLM-based), auto_metric (rule-based), or both")
+    # Output paths
+    parser.add_argument("--semantic_output", type=str, default=None,
+                       help="Path to output JSON file for semantic evaluation results (required if mode is semantic or both)")
+    parser.add_argument("--auto_metric_output", type=str, default=None,
+                       help="Path to output JSON file for auto-metric evaluation results (required if mode is auto_metric or both)")
+    # Semantic evaluation settings
+    parser.add_argument("--yaml_path", type=str, default=None,
+                       help="Path to prompts.yaml file (required for semantic evaluation)")
+    parser.add_argument("--config_path", type=str, default=None,
+                       help="Path to configs.yaml file (required for semantic evaluation)")
+    # Multi-threading
+    parser.add_argument("--max_workers", type=int, default=None,
+                       help="Maximum number of worker threads for semantic evaluation (default: 5)")
+    # Strict mode (normalize scores to discrete scales)
+    parser.add_argument("--strict_mode", action="store_true", default=False,
+                       help="Enable strict mode: normalize scores to discrete scales before computing metrics (default: False)")
+    # Input format override
+    parser.add_argument("--input_format", type=str, choices=['auto', 'refined', 'original'], default='auto',
+                       help="Manually specify input JSON format: 'refined' (has scores and initial_scores), 'original' (has model_prediction), or 'auto' for auto-detection (default: 'auto')")
+    return parser.parse_args()
+def main():
+    """Main execution function."""
+    args = parse_args()
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    # Resolve paths
+    rubrics_path = args.rubrics_path
+    if not os.path.isabs(rubrics_path):
+        rubrics_path = os.path.join(script_dir, rubrics_path)
+    reviews_path = args.reviews_path
+    if not os.path.isabs(reviews_path):
+        reviews_path = os.path.join(script_dir, reviews_path)
+    max_workers = args.max_workers or int(os.getenv("MAX_WORKERS", "5"))
+    # Validate mode and output paths
+    if args.mode in ["semantic", "both"]:
+        if not args.semantic_output:
+            raise ValueError("--semantic_output is required when mode is 'semantic' or 'both'")
+        if not args.yaml_path:
+            raise ValueError("--yaml_path is required for semantic evaluation")
+        if not args.config_path:
+            raise ValueError("--config_path is required for semantic evaluation")
+    if args.mode in ["auto_metric", "both"]:
+        if not args.auto_metric_output:
+            raise ValueError("--auto_metric_output is required when mode is 'auto_metric' or 'both'")
+    # Check if files exist
+    if not os.path.exists(rubrics_path):
+        raise FileNotFoundError(f"Rubrics file not found: {rubrics_path}")
+    if not os.path.exists(reviews_path):
+        raise FileNotFoundError(f"Reviews file not found: {reviews_path}")
+    # Load data
+    print(f"Loading rubrics from {rubrics_path}...")
+    rubrics_data = load_rubrics_json(rubrics_path)
+    print(f"Loaded {len(rubrics_data)} rubrics entries")
+    print(f"Loading model reviews from {reviews_path}...")
+    if args.input_format != 'auto':
+        print(f"Using manually specified format: {args.input_format}")
+    else:
+        print("Auto-detecting input format...")
+    reviews_dict = load_model_reviews_json(reviews_path, format_override=args.input_format if args.input_format != 'auto' else None)
+    print(f"Loaded {len(reviews_dict)} model reviews")
+    # Combine rubrics and reviews
+    print("Combining rubrics and reviews...")
+    evaluation_data = combine_rubrics_and_reviews(rubrics_data, reviews_dict)
+    print(f"Prepared {len(evaluation_data)} entries for evaluation")
+    # Run evaluations based on mode
+    if args.mode in ["semantic", "both"]:
+        # Resolve semantic evaluation paths
+        yaml_path = args.yaml_path
+        if not os.path.isabs(yaml_path):
+            yaml_path = os.path.join(script_dir, yaml_path)
+        config_path = args.config_path
+        if not os.path.isabs(config_path):
+            config_path = os.path.join(script_dir, config_path)
+        if not os.path.exists(yaml_path):
+            raise FileNotFoundError(f"YAML file not found: {yaml_path}")
+        if not os.path.exists(config_path):
+            raise FileNotFoundError(f"Config file not found: {config_path}")
+        # Load prompt template
+        print(f"Loading prompt template from {yaml_path}...")
+        prompt_template = load_prompt_template(yaml_path)
+        if not prompt_template:
+            raise ValueError("Could not find 'v1_evaluator_prompt' in YAML file")
+        # Initialize LLM service
+        print(f"Loading LLM configuration from {config_path}...")
+        llm_config = load_llm_config(config_path)
+        llm_service = create_llm_service_from_config(llm_config)
+        mode = llm_config.get('mode', 'gpt')
+        print(f"LLM service initialized (mode: {mode})")
+        if hasattr(llm_service, 'model_name'):
+            print(f"Using model: {llm_service.model_name}")
+        # Run semantic evaluation
+        semantic_results, semantic_summary = run_semantic_evaluation(
+            evaluation_data, prompt_template, llm_service, max_workers
+        )
+        # Save semantic results
+        semantic_output = args.semantic_output
+        if not os.path.isabs(semantic_output):
+            semantic_output = os.path.join(script_dir, semantic_output)
+        output_dir = os.path.dirname(semantic_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(semantic_output, 'w', encoding='utf-8') as f:
+            json.dump(semantic_results, f, ensure_ascii=False, indent=2)
+        print(f"\nSemantic evaluation results saved to {semantic_output}")
+        # Save semantic summary
+        semantic_summary_path = semantic_output.replace('.json', '_summary.json')
+        with open(semantic_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(semantic_summary, f, ensure_ascii=False, indent=2)
+        print(f"Semantic evaluation summary saved to {semantic_summary_path}")
+        # Print semantic summary
+        print("\n" + "="*80)
+        print("SEMANTIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {semantic_summary['total_entries']}")
+        print(f"Valid entries: {semantic_summary['valid_entries']}")
+        print(f"Failed entries: {semantic_summary['failed_entries']}")
+        if 'overall_score' in semantic_summary:
+            score = semantic_summary['overall_score']
+            print(f"\nOverall Score:")
+            print(f"  Mean: {score['mean']:.2f}")
+            print(f"  Min: {score['min']:.2f}")
+            print(f"  Max: {score['max']:.2f}")
+    if args.mode in ["auto_metric", "both"]:
+        # Run auto-metric evaluation
+        auto_metric_results, auto_metric_summary = run_auto_metric_evaluation(
+            evaluation_data,
+            strict_mode=args.strict_mode
+        )
+        # Save auto-metric results
+        auto_metric_output = args.auto_metric_output
+        if not os.path.isabs(auto_metric_output):
+            auto_metric_output = os.path.join(script_dir, auto_metric_output)
+        output_dir = os.path.dirname(auto_metric_output)
+        os.makedirs(output_dir, exist_ok=True)
+        with open(auto_metric_output, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_results, f, ensure_ascii=False, indent=2)
+        print(f"\nAuto-metric evaluation results saved to {auto_metric_output}")
+        # Save auto-metric summary
+        auto_metric_summary_path = auto_metric_output.replace('.json', '_summary.json')
+        with open(auto_metric_summary_path, 'w', encoding='utf-8') as f:
+            json.dump(auto_metric_summary, f, ensure_ascii=False, indent=2)
+        print(f"Auto-metric evaluation summary saved to {auto_metric_summary_path}")
+        # Print auto-metric summary
+        print("\n" + "="*80)
+        print("AUTO-METRIC EVALUATION SUMMARY")
+        print("="*80)
+        print(f"Total entries: {auto_metric_summary['total_entries']}")
+        print(f"Valid entries: {auto_metric_summary['valid_entries']}")
+        print(f"MSE entries: {auto_metric_summary['mse_entries']}")
+        if 'mse_statistics' in auto_metric_summary:
+            print("\nMSE Statistics:")
+            for dim, stats in auto_metric_summary['mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'mae_statistics' in auto_metric_summary:
+            print("\nMAE Statistics:")
+            for dim, stats in auto_metric_summary['mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        # Print refined and initial statistics if available
+        if 'refined_mse_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'refined_mae_statistics' in auto_metric_summary:
+            print("\nRefined Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['refined_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mse_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MSE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mse_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'initial_mae_statistics' in auto_metric_summary:
+            print("\nInitial Scores - MAE Statistics:")
+            for dim, stats in auto_metric_summary['initial_mae_statistics'].items():
+                print(f"  {dim.capitalize()}: Mean={stats['mean']:.4f}, Count={stats['count']}")
+        if 'spearman_correlations' in auto_metric_summary:
+            print("\nSpearman Correlations:")
+            for dim, stats in auto_metric_summary['spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        # Print refined and initial spearman correlations if available
+        if 'refined_spearman_correlations' in auto_metric_summary:
+            print("\nRefined Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['refined_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'initial_spearman_correlations' in auto_metric_summary:
+            print("\nInitial Scores - Spearman Correlations:")
+            for dim, stats in auto_metric_summary['initial_spearman_correlations'].items():
+                print(f"  {dim.capitalize()}: {stats['correlation']:.4f} (n={stats['count']})")
+        if 'decision_metrics' in auto_metric_summary:
+            dm = auto_metric_summary['decision_metrics']
+            print(f"\nDecision Metrics:")
+            print(f"  Accuracy: {dm['accuracy']:.4f} (n={dm['count']})")
+            if 'f1_macro' in dm:
+                print(f"  F1 (macro): {dm['f1_macro']:.4f}")
+        # Print refined and initial decision metrics if available
+        if 'refined_decision_metrics' in auto_metric_summary:
+            print("\nRefined Scores - Decision Metrics:")
+            rdm = auto_metric_summary['refined_decision_metrics']
+            print(f"  Accuracy: {rdm['accuracy']:.4f} (n={rdm['count']})")
+            if 'f1_macro' in rdm:
+                print(f"  F1 (macro): {rdm['f1_macro']:.4f}")
+        if 'initial_decision_metrics' in auto_metric_summary:
+            print("\nInitial Scores - Decision Metrics:")
+            idm = auto_metric_summary['initial_decision_metrics']
+            print(f"  Accuracy: {idm['accuracy']:.4f} (n={idm['count']})")
+            if 'f1_macro' in idm:
+                print(f"  F1 (macro): {idm['f1_macro']:.4f}")
+    print("\n" + "="*80)
+    print("EVALUATION COMPLETE")
+    print("="*80)
+if __name__ == "__main__":
+    main()

src/evaluator/configs.yaml ADDED Viewed

	@@ -0,0 +1,38 @@

+# LLM Service Configuration for Rubric Generation
+# Choose one mode: "gpt" or "vllm"
+mode: "vllm"  # Options: "gpt" or "vllm"
+# GPT API Configuration (used when mode="gpt")
+gpt:
+  api_key: "your-api-key-here"  # Replace with your actual OpenAI API key or set OPENAI_API_KEY env var
+  model_name: "gpt-4o"  # Options: gpt-4o, gpt-4-turbo, gpt-3.5-turbo, gpt-5, etc.
+  base_url: null  # Default: https://api.openai.com/v1
+  timeout: 300
+  # Default sampling parameters
+  temperature: 0.7
+  top_p: 0.95
+  max_tokens: 16384
+  presence_penalty: 0.0
+# vLLM Service Configuration (used when mode="vllm")
+vllm:
+  base_url: "http://localhost:8000/"  # vLLM server base URL
+  api_key: "dummy-key"  # Not used for local vLLM, but required by OpenAI client
+  model_name: "openai/gpt-oss-120b"  # Model name on vLLM server
+  timeout: 300
+  # Rate limiting: Maximum concurrent requests to vLLM server
+  max_concurrent_requests: 64
+  # Retry configuration for server errors
+  max_retries: 3
+  retry_delay: 1.0
+  retry_backoff: 2.0
+  # Default sampling parameters
+  temperature: 0.7
+  top_p: 0.8
+  top_k: 20
+  max_tokens: 16384
+  presence_penalty: 0.0