Spaces:

igriv
/

math-validator

Sleeping

App Files Files Community

igriv commited on Aug 10, 2025

Commit

1ea9c72

verified ·

1 Parent(s): acf0d28

Update validator app

Browse files

Files changed (9) hide show

README.md +98 -12
app.py +38 -0
compile_all_pdfs.py +154 -0
latex_compiler.py +252 -0
packages.txt +6 -0
requirements.txt +8 -0
run_parallel.py +248 -0
universal_validator.py +697 -0
validator_gui.py +646 -0

README.md CHANGED Viewed

@@ -1,12 +1,98 @@
----
-title: Math Validator
-emoji: 🏆
-colorFrom: blue
-colorTo: red
-sdk: gradio
-sdk_version: 5.42.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Math Question Validator with OpenAI o3
+A Python tool for validating mathematical questions and answers using OpenAI's o3 model, with automatic reconciliation and quality assessment.
+## Features
+- **Automated Answer Validation**: Uses OpenAI o3 model to solve math problems
+- **Quality Assessment**: Evaluates question clarity, difficulty, and pedagogical value
+- **Smart Reconciliation**: Generates detailed LaTeX documents comparing different solutions
+- **Batch Processing**: Handles large datasets with progress tracking
+- **File-based Output**: Avoids truncation issues with cloud storage by saving outputs as separate files
+## Setup
+### Prerequisites
+- Python 3.8+
+- OpenAI API key with o3 access
+- MiKTeX (optional, for PDF compilation)
+### Installation
+1. Clone the repository:
+```bash
+git clone https://github.com/YOUR_USERNAME/validator.git
+cd validator
+```
+2. Install dependencies:
+```bash
+pip install pandas openpyxl python-dotenv openai tqdm
+```
+3. Create `.env` file with your OpenAI API key:
+```
+OPENAI_API_KEY=your_key_here
+```
+## Usage
+### Basic Validation
+```bash
+python math_validator.py
+```
+This will:
+1. Load questions from the Excel file
+2. Filter for math/statistics questions
+3. Assess each question's quality
+4. Generate o3 model answers
+5. Compare with reference answers
+6. Create LaTeX reconciliation documents for mismatches
+### Compile LaTeX to PDF
+```bash
+python compile_latex.py
+```
+## Output Structure
+```
+validation_results/
+└── run_YYYYMMDD_HHMMSS/
+    ├── manifest.json           # Index of all results
+    ├── model_answers/          # Full model responses
+    │   └── q_XXXX_answer.txt
+    ├── latex_documents/        # Reconciliation documents
+    │   └── q_XXXX_reconciliation.tex
+    └── compiled_pdfs/          # Compiled PDFs (if generated)
+        └── q_XXXX_reconciliation.pdf
+```
+## File Naming Convention
+Files are named using Excel row numbers for easy cross-reference:
+- `q_0116_reconciliation.tex` → Excel row 116
+- `q_0117_answer.txt` → Excel row 117
+## Models Used
+- **o3**: Primary model for solving mathematical problems
+- **gpt-4o**: Quality assessment and question evaluation
+## Excel Output Columns
+- `model_answer_file`: Path to model's answer
+- `answer_match`: MATCH/DIFFERENT/ERROR
+- `latex_file`: Path to reconciliation document
+- `quality_rating`: excellent/good/fair/poor
+- `difficulty_level`: too_easy/appropriate/too_hard/unclear
+- `quality_comment`: Detailed assessment
+## License
+[Your chosen license]
+## Author
+[Your name]

app.py ADDED Viewed

	@@ -0,0 +1,38 @@

+#!/usr/bin/env python
+"""
+Hugging Face Spaces app for Math Validator
+This is the main entry point for HF Spaces deployment
+"""
+import os
+import sys
+# For Hugging Face Spaces, we need to ensure all modules are importable
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+# Import and run the GUI
+from validator_gui import ValidatorGUI
+def main():
+    """Main entry point for HF Spaces"""
+    gui = ValidatorGUI()
+    interface = gui.create_interface()
+    # Launch with HF Spaces settings
+    interface.launch(
+        server_name="0.0.0.0",  # Required for HF Spaces
+        server_port=7860,        # Default HF Spaces port
+        share=False,             # Sharing handled by HF
+        # Set cache examples to False to save space
+        cache_examples=False
+    )
+if __name__ == "__main__":
+    # Check if we're in HF Spaces
+    if os.getenv("SPACE_ID"):
+        print("Running in Hugging Face Spaces")
+        print("Note: Set your API keys in the Spaces Settings > Variables and secrets")
+    else:
+        print("Running locally")
+    main()

compile_all_pdfs.py ADDED Viewed

	@@ -0,0 +1,154 @@

+#!/usr/bin/env python
+"""
+Batch compile all LaTeX reconciliation documents to PDFs
+Can be run after validation to generate all PDFs at once
+"""
+import os
+import sys
+import argparse
+from pathlib import Path
+from latex_compiler import compile_latex_batch, check_latex_available
+import time
+def find_tex_files(base_dir="validation_results"):
+    """Find all .tex files in validation results"""
+    tex_files = []
+    base_path = Path(base_dir)
+    if not base_path.exists():
+        print(f"Directory not found: {base_dir}")
+        return tex_files
+    # Find all .tex files recursively
+    for tex_file in base_path.rglob("*.tex"):
+        # Skip auxiliary files
+        if not any(skip in tex_file.name for skip in ['.aux', '.log', '.out']):
+            tex_files.append(str(tex_file))
+    return tex_files
+def compile_validation_pdfs(run_dir=None, max_workers=4):
+    """
+    Compile all LaTeX files from a validation run
+    Args:
+        run_dir: Specific run directory, or None for latest
+        max_workers: Number of parallel workers
+    """
+    if not check_latex_available():
+        print("Error: pdflatex not installed")
+        print("Install with:")
+        print("  Linux: apt-get install texlive-latex-base")
+        print("  Windows: Install MiKTeX")
+        print("  macOS: brew install --cask mactex")
+        return
+    # Find run directory
+    if run_dir:
+        base_dir = run_dir
+    else:
+        # Find latest run
+        base_path = Path("validation_results")
+        if not base_path.exists():
+            print("No validation_results directory found")
+            return
+        runs = [d for d in base_path.iterdir() if d.is_dir() and d.name.startswith("run_")]
+        if not runs:
+            print("No validation runs found")
+            return
+        # Get latest by timestamp
+        latest_run = max(runs, key=lambda x: x.stat().st_mtime)
+        base_dir = str(latest_run)
+        print(f"Using latest run: {latest_run.name}")
+    # Find LaTeX documents directory
+    latex_dir = Path(base_dir) / "latex_documents"
+    if not latex_dir.exists():
+        print(f"No latex_documents directory in {base_dir}")
+        return
+    # Find all .tex files
+    tex_files = list(latex_dir.glob("*.tex"))
+    if not tex_files:
+        print(f"No .tex files found in {latex_dir}")
+        return
+    print(f"Found {len(tex_files)} LaTeX files to compile")
+    # Check for already compiled PDFs
+    existing_pdfs = list(latex_dir.glob("*.pdf"))
+    if existing_pdfs:
+        print(f"  ({len(existing_pdfs)} PDFs already exist)")
+        # Filter to only uncompiled
+        tex_names = {f.stem for f in tex_files}
+        pdf_names = {f.stem for f in existing_pdfs}
+        new_tex = [f for f in tex_files if f.stem not in pdf_names]
+        if new_tex:
+            print(f"  Compiling {len(new_tex)} new PDFs...")
+            tex_files = new_tex
+        else:
+            print("  All PDFs already compiled")
+            recompile = input("Recompile all? (y/N): ").strip().lower()
+            if recompile != 'y':
+                return
+    # Compile in parallel
+    print(f"\nCompiling with {max_workers} parallel workers...")
+    start_time = time.time()
+    results = compile_latex_batch(
+        [str(f) for f in tex_files],
+        output_dir=str(latex_dir),
+        max_workers=max_workers,
+        timeout=30
+    )
+    # Summary
+    elapsed = time.time() - start_time
+    successful = sum(1 for r in results.values() if r[0])
+    failed = len(results) - successful
+    print(f"\n{'='*60}")
+    print(f"Compilation complete in {elapsed:.1f} seconds")
+    print(f"  Successful: {successful}")
+    print(f"  Failed: {failed}")
+    if failed > 0:
+        print("\nFailed files:")
+        for tex_file, (success, _, error) in results.items():
+            if not success:
+                print(f"  - {Path(tex_file).name}: {error[:50]}...")
+    print(f"\nPDFs saved to: {latex_dir}")
+def main():
+    parser = argparse.ArgumentParser(description='Compile LaTeX reconciliation documents to PDFs')
+    parser.add_argument('--run-dir', help='Specific run directory (default: latest)')
+    parser.add_argument('--workers', type=int, default=4, help='Number of parallel workers')
+    parser.add_argument('--all', action='store_true', help='Compile all runs, not just latest')
+    args = parser.parse_args()
+    if args.all:
+        # Compile all runs
+        base_path = Path("validation_results")
+        if base_path.exists():
+            runs = [d for d in base_path.iterdir() if d.is_dir() and d.name.startswith("run_")]
+            print(f"Found {len(runs)} validation runs")
+            for run in runs:
+                print(f"\n{'='*60}")
+                print(f"Processing: {run.name}")
+                print('='*60)
+                compile_validation_pdfs(str(run), args.workers)
+    else:
+        compile_validation_pdfs(args.run_dir, args.workers)
+if __name__ == "__main__":
+    main()

latex_compiler.py ADDED Viewed

	@@ -0,0 +1,252 @@

+#!/usr/bin/env python
+"""
+Async LaTeX compilation handler
+Works efficiently on Linux/HF Spaces with forking
+Falls back to sequential on Windows
+"""
+import os
+import sys
+import subprocess
+import platform
+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
+from pathlib import Path
+import time
+def is_linux():
+    """Check if running on Linux/Unix"""
+    return platform.system() in ['Linux', 'Darwin']
+def compile_latex_file(tex_path, output_dir=None, timeout=30):
+    """
+    Compile a single LaTeX file to PDF
+    Args:
+        tex_path: Path to .tex file
+        output_dir: Output directory (default: same as tex file)
+        timeout: Compilation timeout in seconds
+    Returns:
+        tuple: (success: bool, pdf_path: str or None, error_msg: str or None)
+    """
+    tex_path = Path(tex_path)
+    if not tex_path.exists():
+        return False, None, f"File not found: {tex_path}"
+    output_dir = output_dir or tex_path.parent
+    pdf_path = output_dir / tex_path.with_suffix('.pdf').name
+    # Remove old PDF if exists
+    if pdf_path.exists():
+        try:
+            pdf_path.unlink()
+        except:
+            pass
+    # Compile command
+    cmd = [
+        'pdflatex',
+        '-interaction=nonstopmode',
+        '-halt-on-error',
+        f'-output-directory={output_dir}',
+        str(tex_path)
+    ]
+    try:
+        # Run compilation
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            timeout=timeout,
+            cwd=str(tex_path.parent)
+        )
+        # Check if PDF was created
+        if pdf_path.exists():
+            return True, str(pdf_path), None
+        else:
+            # Extract error from log
+            error_msg = "Compilation failed"
+            if result.stdout:
+                lines = result.stdout.split('\n')
+                for i, line in enumerate(lines):
+                    if 'Error' in line or '!' in line[:2]:
+                        error_msg = '\n'.join(lines[i:i+5])
+                        break
+            return False, None, error_msg
+    except subprocess.TimeoutExpired:
+        return False, None, f"Timeout after {timeout} seconds"
+    except FileNotFoundError:
+        return False, None, "pdflatex not found - install texlive"
+    except Exception as e:
+        return False, None, str(e)
+def compile_latex_batch(tex_files, output_dir=None, max_workers=4, timeout=30):
+    """
+    Compile multiple LaTeX files in parallel
+    Args:
+        tex_files: List of .tex file paths
+        output_dir: Output directory for PDFs
+        max_workers: Number of parallel workers
+        timeout: Timeout per file
+    Returns:
+        dict: {tex_path: (success, pdf_path, error_msg)}
+    """
+    results = {}
+    if not tex_files:
+        return results
+    # Use ProcessPoolExecutor on Linux for true parallelism
+    # Use ThreadPoolExecutor on Windows (less efficient but works)
+    if is_linux():
+        executor_class = ProcessPoolExecutor
+        print(f"Using process-based parallelism ({max_workers} workers)")
+    else:
+        executor_class = ThreadPoolExecutor
+        print(f"Using thread-based parallelism ({max_workers} workers)")
+    with executor_class(max_workers=max_workers) as executor:
+        # Submit all compilation tasks
+        futures = {
+            executor.submit(compile_latex_file, tex_file, output_dir, timeout): tex_file
+            for tex_file in tex_files
+        }
+        # Collect results as they complete
+        for future in futures:
+            tex_file = futures[future]
+            try:
+                success, pdf_path, error = future.result(timeout=timeout+5)
+                results[tex_file] = (success, pdf_path, error)
+                if success:
+                    print(f"  ✓ Compiled: {Path(tex_file).name}")
+                else:
+                    print(f"  ✗ Failed: {Path(tex_file).name}")
+            except Exception as e:
+                results[tex_file] = (False, None, str(e))
+                print(f"  ✗ Error: {Path(tex_file).name}: {e}")
+    return results
+def compile_latex_async(tex_path, output_dir=None, callback=None):
+    """
+    Compile LaTeX file asynchronously (fire-and-forget)
+    Args:
+        tex_path: Path to .tex file
+        output_dir: Output directory
+        callback: Optional callback function(success, pdf_path, error)
+    """
+    if is_linux():
+        # On Linux, fork a subprocess
+        pid = os.fork()
+        if pid == 0:
+            # Child process
+            try:
+                success, pdf_path, error = compile_latex_file(tex_path, output_dir)
+                if callback:
+                    callback(success, pdf_path, error)
+            finally:
+                os._exit(0)
+        else:
+            # Parent process continues immediately
+            print(f"  → Compiling {Path(tex_path).name} in background (PID: {pid})")
+    else:
+        # On Windows, use threading
+        from threading import Thread
+        def compile_thread():
+            success, pdf_path, error = compile_latex_file(tex_path, output_dir)
+            if callback:
+                callback(success, pdf_path, error)
+        thread = Thread(target=compile_thread, daemon=True)
+        thread.start()
+        print(f"  → Compiling {Path(tex_path).name} in background thread")
+def check_latex_available():
+    """Check if pdflatex is available"""
+    try:
+        result = subprocess.run(
+            ['pdflatex', '--version'],
+            capture_output=True,
+            text=True,
+            timeout=5
+        )
+        if result.returncode == 0:
+            # Extract version
+            for line in result.stdout.split('\n'):
+                if 'TeX' in line:
+                    print(f"LaTeX available: {line.strip()}")
+                    return True
+        return False
+    except:
+        return False
+# Integration with universal_validator.py
+def setup_async_latex_compilation():
+    """
+    Setup async LaTeX compilation for the validator
+    Returns a function that can be used to compile LaTeX files
+    """
+    if not check_latex_available():
+        print("Warning: LaTeX not available, PDF compilation disabled")
+        return None
+    def compile_reconciliation(tex_path):
+        """Compile reconciliation document asynchronously"""
+        compile_latex_async(
+            tex_path,
+            callback=lambda s, p, e: print(f"    [PDF] {'Success' if s else 'Failed'}: {Path(tex_path).name}")
+        )
+    return compile_reconciliation
+if __name__ == "__main__":
+    # Test the compiler
+    import tempfile
+    print("Testing LaTeX compilation...")
+    print(f"Platform: {platform.system()}")
+    print(f"Async support: {'Yes' if is_linux() else 'Limited (Windows)'}")
+    if check_latex_available():
+        # Create a test document
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.tex', delete=False) as f:
+            f.write(r"""\documentclass{article}
+\begin{document}
+\title{Test Document}
+\author{Validator}
+\maketitle
+This is a test: $x^2 + y^2 = z^2$
+\end{document}""")
+            test_file = f.name
+        print(f"\nCompiling test file: {test_file}")
+        success, pdf_path, error = compile_latex_file(test_file)
+        if success:
+            print(f"✓ Success! PDF created: {pdf_path}")
+            print(f"  Size: {os.path.getsize(pdf_path)} bytes")
+        else:
+            print(f"✗ Failed: {error}")
+        # Clean up
+        try:
+            os.unlink(test_file)
+            if pdf_path and os.path.exists(pdf_path):
+                os.unlink(pdf_path)
+        except:
+            pass
+    else:
+        print("✗ LaTeX not installed")
+        print("  On Linux: apt-get install texlive-latex-base")
+        print("  On Windows: Install MiKTeX")
+        print("  On macOS: brew install --cask mactex")

packages.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+texlive-latex-base
+texlive-latex-extra
+texlive-fonts-recommended
+texlive-fonts-extra
+texlive-latex-recommended
+texlive-xetex

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+gradio>=4.0.0
+pandas>=2.0.0
+openpyxl>=3.1.0
+python-dotenv>=1.0.0
+openai>=1.0.0
+requests>=2.31.0
+tqdm>=4.65.0
+httpx>=0.24.0

run_parallel.py ADDED Viewed

	@@ -0,0 +1,248 @@

+#!/usr/bin/env python
+"""
+Run validator in parallel across multiple processes
+"""
+import subprocess
+import sys
+import os
+import argparse
+import math
+from concurrent.futures import ProcessPoolExecutor, as_completed
+import pandas as pd
+def run_validator_range(args):
+    """Run validator for a specific range"""
+    excel_file, solver, reconciler, start, end, images, batch_size, output_base, compile_latex = args
+    # Create unique output filename for this range
+    range_output = output_base.replace('.xlsx', f'_p{start}_{end}.xlsx')
+    cmd = [
+        sys.executable, "universal_validator.py",
+        excel_file,
+        "--model", solver,
+        "--reconciliation-model", reconciler,
+        "--images", images,
+        "--start", str(start),
+        "--end", str(end),
+        "--batch-size", str(batch_size),
+        "--output", range_output
+    ]
+    if compile_latex:
+        cmd.append("--compile-latex")
+    print(f"[PARALLEL] Starting process for questions {start+1}-{end}...")
+    try:
+        # Run without capturing output so it streams to console
+        process = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            encoding='utf-8',
+            errors='replace',
+            bufsize=1
+        )
+        # Stream output
+        output_lines = []
+        while True:
+            line = process.stdout.readline()
+            if not line:
+                break
+            print(f"[P{start//100+1}] {line.rstrip()}")
+            output_lines.append(line)
+        process.wait()
+        if process.returncode == 0:
+            print(f"[PARALLEL] Completed range {start+1}-{end}")
+            return (start, end, "success", "")
+        else:
+            error_msg = "".join(output_lines[-20:])  # Last 20 lines
+            print(f"[FAIL] Failed range {start+1}-{end}")
+            return (start, end, "failed", error_msg)
+    except Exception as e:
+        print(f"[ERROR] Error in range {start+1}-{end}: {e}")
+        return (start, end, "error", str(e))
+def main():
+    parser = argparse.ArgumentParser(description='Run validator in parallel')
+    parser.add_argument('file', help='Excel file to process')
+    parser.add_argument('--num-processes', type=int, default=4,
+                       help='Number of parallel processes (default: 4)')
+    parser.add_argument('--solver', default='o3-mini',
+                       help='Solver model (default: o3-mini)')
+    parser.add_argument('--reconciler', default='gpt-4o',
+                       help='Reconciliation model (default: gpt-4o)')
+    parser.add_argument('--images', default='when_needed',
+                       help='Image handling (default: when_needed)')
+    parser.add_argument('--batch-size', type=int, default=5,
+                       help='Questions per batch (default: 5)')
+    parser.add_argument('--questions-per-process', type=int, default=100,
+                       help='Questions per process (default: 100)')
+    parser.add_argument('--output', type=str, default=None,
+                       help='Output filename for merged results')
+    parser.add_argument('--start-range', type=int, default=0,
+                       help='Start of question range')
+    parser.add_argument('--end-range', type=int, default=None,
+                       help='End of question range')
+    parser.add_argument('--compile-latex', action='store_true',
+                       help='Compile LaTeX files to PDF')
+    args = parser.parse_args()
+    # Count total questions
+    print(f"Loading {args.file} to count questions...")
+    df = pd.read_excel(args.file, sheet_name='Data')
+    # Filter for math questions
+    if 'raw_subject' in df.columns:
+        math_filter = df['raw_subject'].str.lower().str.contains(
+            'math|statistic|calculus|algebra|geometry|trigonometry',
+            na=False, regex=True
+        )
+        df = df[math_filter]
+    # Apply range if specified
+    if args.start_range > 0 or args.end_range:
+        start_idx = args.start_range
+        end_idx = args.end_range if args.end_range else len(df)
+        df = df.iloc[start_idx:end_idx]
+        print(f"Processing range: questions {start_idx+1} to {end_idx}")
+    total_questions = len(df)
+    print(f"Found {total_questions} math questions to process")
+    # Calculate ranges
+    questions_per_process = max(args.questions_per_process, math.ceil(total_questions / args.num_processes))
+    num_processes = min(args.num_processes, math.ceil(total_questions / questions_per_process))
+    # Generate output base filename
+    if args.output:
+        output_base = args.output
+    else:
+        from datetime import datetime
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        base_name = os.path.basename(args.file).replace('.xlsx', '')
+        output_base = f"{base_name}_validated_{timestamp}_parallel.xlsx"
+    ranges = []
+    base_start = args.start_range if args.start_range else 0
+    for i in range(num_processes):
+        start = base_start + i * questions_per_process
+        end = min(base_start + (i + 1) * questions_per_process, base_start + total_questions)
+        if start < base_start + total_questions:
+            ranges.append((
+                args.file,
+                args.solver,
+                args.reconciler,
+                start,
+                end,
+                args.images,
+                args.batch_size,
+                output_base,
+                args.compile_latex
+            ))
+    print(f"\nWill run {len(ranges)} parallel processes:")
+    for i, (_, _, _, start, end, _, _, _, _) in enumerate(ranges, 1):
+        print(f"  Process {i}: questions {start+1}-{end}")
+    # Skip confirmation in GUI mode (when output is specified)
+    if not args.output:
+        confirm = input("\nProceed? (Y/n): ").strip().lower()
+        if confirm == 'n':
+            print("Cancelled")
+            return
+    # Run in parallel
+    print(f"\nStarting {len(ranges)} parallel processes...")
+    with ProcessPoolExecutor(max_workers=num_processes) as executor:
+        futures = {executor.submit(run_validator_range, r): r for r in ranges}
+        completed = 0
+        failed = []
+        for future in as_completed(futures):
+            completed += 1
+            start, end, status, error = future.result()
+            if status != "success":
+                failed.append((start, end, error))
+            print(f"Progress: {completed}/{len(ranges)} processes completed")
+    # Summary
+    print("\n" + "="*60)
+    print("PARALLEL VALIDATION COMPLETE")
+    print("="*60)
+    if failed:
+        print(f"\nFailed ranges ({len(failed)}):")
+        for start, end, error in failed:
+            print(f"  {start}-{end}: {error[:100]}")
+        print("\nRerun these ranges individually to retry")
+    else:
+        print("\nAll ranges completed successfully!")
+    # Merge results from all processes
+    print("\nMerging results from all processes...")
+    merge_results(args.file, output_base, ranges)
+    # Clean up intermediate files
+    for _, _, _, start, end, _, _, _, _ in ranges:
+        temp_file = output_base.replace('.xlsx', f'_p{start}_{end}.xlsx')
+        if os.path.exists(temp_file):
+            os.remove(temp_file)
+            print(f"  Cleaned up: {temp_file}")
+    print(f"\nFinal results saved to: {output_base}")
+    print(f"Results from {len(ranges)} processes have been merged")
+def merge_results(original_file, output_file, ranges):
+    """Merge results from parallel processes into a single file"""
+    import pandas as pd
+    # Load original data
+    original_df = pd.read_excel(original_file, sheet_name='Data')
+    # Process each range file and update the dataframe
+    for _, _, _, start, end, _, _, _, _ in ranges:
+        temp_file = output_file.replace('.xlsx', f'_p{start}_{end}.xlsx')
+        if os.path.exists(temp_file):
+            try:
+                temp_df = pd.read_excel(temp_file, sheet_name='Data')
+                # Update the original dataframe with results from this range
+                for idx in range(start, min(end, len(temp_df))):
+                    if idx < len(original_df):
+                        for col in ['model_answer_file', 'answer_match', 'latex_file',
+                                   'quality_rating', 'difficulty_level', 'quality_comment']:
+                            if col in temp_df.columns:
+                                original_df.at[idx, col] = temp_df.at[idx, col]
+                print(f"  Merged results from questions {start+1}-{end}")
+            except Exception as e:
+                print(f"  Warning: Could not merge {temp_file}: {e}")
+    # Save merged results
+    with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
+        original_df.to_excel(writer, sheet_name='Data', index=False)
+        # Copy other sheets if they exist
+        try:
+            xl = pd.ExcelFile(original_file)
+            for sheet_name in xl.sheet_names:
+                if sheet_name != 'Data':
+                    df = pd.read_excel(original_file, sheet_name=sheet_name)
+                    df.to_excel(writer, sheet_name=sheet_name, index=False)
+        except:
+            pass
+if __name__ == "__main__":
+    main()

universal_validator.py ADDED Viewed

	@@ -0,0 +1,697 @@

+import pandas as pd
+import os
+from dotenv import load_dotenv
+import time
+from typing import Dict, Any, Optional, List
+import re
+from datetime import datetime
+import json
+from tqdm import tqdm
+import base64
+import requests
+from io import BytesIO
+load_dotenv()
+class UniversalMathValidator:
+    """Universal validator that can handle different Excel formats and API providers"""
+    def __init__(self, excel_file: str, provider: str = "openai", include_images: str = "when_needed",
+                 solver_model: str = None, reconciliation_model: str = None):
+        """
+        Initialize validator
+        Args:
+            excel_file: Path to Excel file
+            provider: "openai" or "openrouter"
+            include_images: "always", "never", or "when_needed"
+            solver_model: Model for solving questions
+            reconciliation_model: Model for reconciliation
+        """
+        self.excel_file = excel_file
+        self.include_images = include_images
+        # Determine provider based on models
+        # If any model requires OpenRouter, use OpenRouter for everything
+        openrouter_prefixes = ["anthropic/", "x-ai/", "google/", "meta-llama/", "mistral/", "openai/"]
+        openai_models = ["o3-mini", "gpt-4o", "gpt-5", "gpt-5-mini", "gpt-5-nano", "gpt-4-turbo"]
+        # Check if any model needs OpenRouter (has a prefix or is not an OpenAI model)
+        solver_needs_or = solver_model and (
+            any(solver_model.startswith(p) for p in openrouter_prefixes) or
+            solver_model not in openai_models
+        )
+        recon_needs_or = reconciliation_model and (
+            any(reconciliation_model.startswith(p) for p in openrouter_prefixes) or
+            reconciliation_model not in openai_models
+        )
+        needs_openrouter = solver_needs_or or recon_needs_or
+        # Override provider if OpenRouter is needed
+        if needs_openrouter:
+            self.provider = "openrouter"
+            if provider == "openai":
+                print("Note: Using OpenRouter for all models since non-OpenAI model specified")
+        else:
+            self.provider = provider
+        # Store original model names for later prefixing if needed
+        self.solver_model_input = solver_model
+        self.reconciliation_model_input = reconciliation_model
+        self.df = None
+        self.output_file = None  # Will be set later
+        self.compile_latex = False  # Will be set from args
+        # Detect file format
+        self.file_format = self._detect_format()
+        # Create directories for outputs
+        self.base_dir = "validation_results"
+        self.timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        self.run_dir = os.path.join(self.base_dir, f"run_{self.timestamp}")
+        self.latex_dir = os.path.join(self.run_dir, "latex_documents")
+        self.answers_dir = os.path.join(self.run_dir, "model_answers")
+        os.makedirs(self.latex_dir, exist_ok=True)
+        os.makedirs(self.answers_dir, exist_ok=True)
+        # Initialize API client
+        if self.provider == "openai":
+            from openai import OpenAI
+            import httpx
+            # Set 5 minute timeout for GPT-5 models which can be very slow
+            self.client = OpenAI(
+                api_key=os.getenv('OPENAI_API_KEY'),
+                timeout=httpx.Timeout(300.0, connect=10.0)  # 300 second timeout, 10 second connect
+            )
+            # Default models for OpenAI
+            self.model = self.solver_model_input or "o3-mini"
+            self.reconciliation_model = self.reconciliation_model_input or "gpt-4o"
+            self.assessment_model = "gpt-4o"
+        elif self.provider == "openrouter":
+            import httpx
+            self.client = self._setup_openrouter()
+            # Helper to add openai/ prefix if needed
+            def format_for_openrouter(model_name):
+                if not model_name:
+                    return None
+                # If already has a prefix, use as-is
+                if "/" in model_name:
+                    return model_name
+                # If it's an OpenAI model, add prefix
+                openai_models = ["o3-mini", "gpt-4o", "gpt-5", "gpt-5-mini", "gpt-5-nano", "gpt-4-turbo"]
+                if model_name in openai_models:
+                    return f"openai/{model_name}"
+                # Otherwise assume it needs no prefix (for backwards compatibility)
+                return model_name
+            # Format models for OpenRouter
+            self.model = format_for_openrouter(self.solver_model_input) or "openai/o3-mini"
+            self.reconciliation_model = format_for_openrouter(self.reconciliation_model_input) or "openai/gpt-4o"
+            self.assessment_model = "openai/gpt-4o"
+        # System prompts
+        self.system_prompt_answer = """You are a highly skilled mathematics graduate student.
+        Solve the following problem step by step.
+        IMPORTANT: First show your complete reasoning and work.
+        Then clearly state the final answer.
+        Your response should include both the reasoning process and the final answer."""
+        self.system_prompt_assess = """You are an experienced mathematics educator. Evaluate mathematical questions."""
+        self.system_prompt_reconcile = """You are a graduate student who produces detailed justifications in LaTeX format.
+        You excel at analyzing mathematical solutions and identifying potential errors.
+        Your output should be a complete LaTeX document that can be compiled directly."""
+        # Create manifest
+        self.manifest_file = os.path.join(self.run_dir, "manifest.json")
+        self.manifest = {
+            "timestamp": self.timestamp,
+            "source_file": excel_file,
+            "file_format": self.file_format,
+            "provider": provider,
+            "model": self.model,
+            "questions": {}
+        }
+    def _detect_format(self) -> str:
+        """Detect which format the Excel file uses"""
+        xl = pd.ExcelFile(self.excel_file)
+        # Check for specific sheets
+        if 'rationale_images' in xl.sheet_names:
+            return "HLE_B3"  # HLE_Verified_B3 format
+        elif 'model_responses' in xl.sheet_names:
+            return "HLE_335"  # HLE_335 format
+        else:
+            return "unknown"
+    def _setup_openrouter(self):
+        """Setup OpenRouter client"""
+        from openai import OpenAI
+        import httpx
+        # OpenRouter uses OpenAI-compatible API
+        client = OpenAI(
+            base_url="https://openrouter.ai/api/v1",
+            api_key=os.getenv('OPENROUTER_API_KEY'),
+            timeout=httpx.Timeout(300.0, connect=10.0),  # Same timeout as OpenAI
+            default_headers={
+                "HTTP-Referer": "https://github.com/yourusername/validator",
+                "X-Title": "Math Validator"
+            }
+        )
+        return client
+    def load_data(self):
+        """Load and normalize data based on file format"""
+        if self.file_format == "HLE_B3":
+            # Load HLE_Verified_B3 format
+            self.df = pd.read_excel(self.excel_file, sheet_name='Data')
+            # Normalize column names
+            self.df['task_name'] = self.df.get('id', '')
+            self.df['answer type'] = self.df.get('answer_type', 'exactMatch')
+            # Create image mapping from file_url column (question images)
+            self.image_mapping = {}
+            if 'file_url' in self.df.columns:
+                for idx, row in self.df.iterrows():
+                    if pd.notna(row.get('file_url')) and pd.notna(row.get('id')):
+                        self.image_mapping[row['id']] = row['file_url']
+                print(f"Loaded {len(self.image_mapping)} question images from file_url column")
+            # Also load rationale images if needed (these are for rationales, not questions)
+            try:
+                rationale_images = pd.read_excel(self.excel_file, sheet_name='rationale_images')
+                # Don't overwrite question images with rationale images
+                rationale_mapping = dict(zip(rationale_images['ID'], rationale_images['gcp']))
+                print(f"Found {len(rationale_mapping)} rationale images (not used for questions)")
+            except:
+                pass
+        elif self.file_format == "HLE_335":
+            # Load HLE_335 format
+            self.df = pd.read_excel(self.excel_file, sheet_name='Data')
+            self.image_mapping = {}
+        else:
+            # Generic format - assume Data sheet exists
+            self.df = pd.read_excel(self.excel_file, sheet_name='Data')
+            self.image_mapping = {}
+        # Filter for math questions but KEEP ORIGINAL INDICES
+        if 'raw_subject' in self.df.columns:
+            math_filter = self.df['raw_subject'].str.lower().str.contains(
+                'math|statistic|calculus|algebra|geometry|trigonometry',
+                na=False, regex=True
+            )
+            # Keep original indices by not resetting them
+            self.df = self.df[math_filter]  # Don't use .copy() with reset indices
+        # Add result columns
+        self.df['model_answer_file'] = ''
+        self.df['answer_match'] = ''
+        self.df['latex_file'] = ''
+        self.df['quality_rating'] = ''
+        self.df['difficulty_level'] = ''
+        self.df['quality_comment'] = ''
+        print(f"Loaded {len(self.df)} math/statistics questions from {self.file_format} format")
+        return self.df
+    def _get_image_for_question(self, row) -> Optional[str]:
+        """Get image URL or path for a question if needed"""
+        if self.include_images == "never":
+            return None
+        # Check if question has an image reference
+        question_id = row.get('id') or row.get('task_name')
+        question_text = str(row.get('question', '')).lower()
+        # Check if question mentions an image
+        has_image_reference = any(keyword in question_text for keyword in [
+            "image", "figure", "diagram", "picture", "attached",
+            "graph", "plot", "shown", "below", "above"
+        ])
+        if self.include_images == "always" or (
+            self.include_images == "when_needed" and has_image_reference
+        ):
+            # First check file_url column directly (primary source for question images)
+            if 'file_url' in row and pd.notna(row['file_url']):
+                return row['file_url']
+            # Then try to get image from mapping
+            if question_id in self.image_mapping:
+                return self.image_mapping[question_id]
+            # Finally check for generic image column
+            if 'image' in row and pd.notna(row['image']):
+                return row['image']
+            # Log warning if image was expected but not found
+            if has_image_reference:
+                original_idx = row.name if hasattr(row, 'name') else 'unknown'
+                print(f"  [WARNING] Question {original_idx} mentions image but none found (ID: {question_id[:20]}...)")
+        return None
+    def _encode_image(self, image_url: str) -> Optional[str]:
+        """Download and encode image as base64"""
+        try:
+            response = requests.get(image_url, timeout=10)
+            if response.status_code == 200:
+                return base64.b64encode(response.content).decode('utf-8')
+        except:
+            pass
+        return None
+    def get_model_answer(self, question: str, image_url: Optional[str] = None, attempt: int = 1) -> Optional[str]:
+        """Get answer from model with optional image support"""
+        try:
+            messages = [
+                {"role": "system", "content": self.system_prompt_answer}
+            ]
+            # Build user message with optional image
+            if image_url and self.provider == "openai":
+                # OpenAI vision format
+                user_content = [
+                    {"type": "text", "text": question}
+                ]
+                if image_url.startswith('http'):
+                    user_content.append({
+                        "type": "image_url",
+                        "image_url": {"url": image_url}
+                    })
+                messages.append({"role": "user", "content": user_content})
+            else:
+                # Text-only or OpenRouter (handle differently if needed)
+                messages.append({"role": "user", "content": question})
+            # Make API call
+            # Check the original model name (before prefixing) for special handling
+            # Handle case where solver_model_input might not be set
+            if hasattr(self, 'solver_model_input'):
+                original_model = self.solver_model_input or self.model
+            else:
+                original_model = self.model
+            if original_model in ["o3-mini", "gpt-5", "gpt-5-mini", "gpt-5-nano"]:
+                # Use higher token limit for GPT-5 and o3 models to allow for reasoning
+                if original_model == "o3-mini":
+                    max_tokens = 10000
+                elif original_model in ["gpt-5", "gpt-5-mini", "gpt-5-nano"]:
+                    max_tokens = 8000  # Increased for reasoning + answer
+                else:
+                    max_tokens = 3000
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=messages,
+                    max_completion_tokens=max_tokens
+                )
+            else:
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=messages,
+                    temperature=0.1,
+                    max_tokens=2000
+                )
+            return response.choices[0].message.content.strip()
+        except Exception as e:
+            error_msg = str(e)
+            if "timeout" in error_msg.lower():
+                print(f"  [TIMEOUT] Timeout getting model answer (attempt {attempt}/3)")
+            else:
+                print(f"  [ERROR] Error getting model answer (attempt {attempt}): {e}")
+            if attempt < 3:
+                time.sleep(2 ** attempt)
+                return self.get_model_answer(question, image_url, attempt + 1)
+            print(f"  [ERROR] Failed after 3 attempts")
+            return None
+    def generate_reconciliation_latex(self, question: str, model_answer: str,
+                                      reference_answer: str, rationale: str = None, attempt: int = 1) -> str:
+        """Generate LaTeX reconciliation document for mismatched answers"""
+        prompt = f"""Compare and reconcile these two answers to the following problem.
+        PROBLEM:
+        {question}
+        MODEL'S ANSWER:
+        {model_answer}
+        REFERENCE ANSWER:
+        {reference_answer}
+        REFERENCE RATIONALE:
+        {rationale if pd.notna(rationale) else "Not provided"}
+        Please create a complete LaTeX document that:
+        1. States the problem
+        2. Shows the model's approach and solution
+        3. Shows the reference approach and solution
+        4. Analyzes where any differences or errors might occur
+        5. Provides your assessment of which answer is correct and why
+        The document should be properly formatted with sections and mathematical notation.
+        Begin with \\documentclass and end with \\end{{document}}."""
+        try:
+            # Handle GPT-5 models parameter differences
+            messages = [
+                {"role": "system", "content": self.system_prompt_reconcile},
+                {"role": "user", "content": prompt}
+            ]
+            # Use the configured reconciliation model
+            reconciliation_model = self.reconciliation_model
+            # Check the original model name (before prefixing) for special handling
+            # Handle case where reconciliation_model_input might not be set
+            if hasattr(self, 'reconciliation_model_input'):
+                original_recon = self.reconciliation_model_input or reconciliation_model
+            else:
+                original_recon = reconciliation_model
+            # Check if reconciliation model needs special handling
+            if original_recon in ["gpt-5", "gpt-5-mini", "gpt-5-nano"]:
+                # GPT-5 models don't support temperature
+                response = self.client.chat.completions.create(
+                    model=reconciliation_model,
+                    messages=messages,
+                    max_completion_tokens=8000  # Allow longer for reconciliation
+                )
+            elif original_recon in ["o3-mini"]:
+                response = self.client.chat.completions.create(
+                    model=reconciliation_model,
+                    messages=messages,
+                    max_completion_tokens=10000
+                )
+            else:
+                # Standard models (gpt-4o, claude, etc.)
+                response = self.client.chat.completions.create(
+                    model=reconciliation_model,
+                    messages=messages,
+                    temperature=0.3,
+                    max_tokens=3000
+                )
+            return response.choices[0].message.content.strip()
+        except Exception as e:
+            error_msg = str(e)
+            if "timeout" in error_msg.lower():
+                print(f"      [TIMEOUT] Timeout generating reconciliation (attempt {attempt}/3)")
+            else:
+                print(f"      [ERROR] Error generating reconciliation (attempt {attempt}): {e}")
+            if attempt < 3:
+                time.sleep(2 ** attempt)  # Exponential backoff
+                return self.generate_reconciliation_latex(question, model_answer, reference_answer, rationale, attempt + 1)
+            print(f"      [ERROR] Failed to generate reconciliation after 3 attempts")
+            return None
+    def process_questions(self, start_idx: int = 0, batch_size: int = 5):
+        """Process questions with progress tracking"""
+        total = len(self.df)
+        with tqdm(total=total, desc="Overall Progress", position=0, leave=True) as pbar_main:
+            pbar_main.update(start_idx)
+            for i in range(start_idx, total, batch_size):
+                batch_end = min(i + batch_size, total)
+                print(f"\n{'='*60}")
+                print(f"Processing batch: questions {i+1} to {batch_end} of {total}")
+                print(f"Using {self.provider} with model {self.model}")
+                print(f"{'='*60}")
+                batch_size_actual = batch_end - i
+                with tqdm(total=batch_size_actual * 3, desc="Current Batch", position=1, leave=False) as pbar_batch:
+                    for idx in range(i, batch_end):
+                        row = self.df.iloc[idx]
+                        original_idx = self.df.index[idx]
+                        # Get question and check for image
+                        question = row['question']
+                        image_url = self._get_image_for_question(row)
+                        if image_url:
+                            print(f"  Including image for question {original_idx}")
+                        # Get model answer
+                        print(f"  Question {original_idx}: Getting answer from {self.model}...")
+                        model_answer = self.get_model_answer(question, image_url)
+                        if model_answer:
+                            print(f"    [OK] Got answer ({len(model_answer)} chars)")
+                        else:
+                            print(f"    [FAIL] Failed to get answer")
+                        pbar_batch.update(1)
+                        if model_answer:
+                            # Save model answer to file
+                            question_id = f"q_{original_idx:04d}"
+                            answer_filename = f"{question_id}_answer.txt"
+                            answer_path = os.path.join(self.answers_dir, answer_filename)
+                            with open(answer_path, 'w', encoding='utf-8') as f:
+                                f.write(f"Question: {question}\n\n")
+                                f.write(f"Model Answer: {model_answer}\n")
+                            self.df.at[original_idx, 'model_answer_file'] = answer_filename
+                            # Check if answer matches reference
+                            reference_answer = str(row.get('correct_answer', row.get('answer', '')))
+                            # Simple string matching (could be enhanced)
+                            model_norm = str(model_answer).strip().lower()
+                            ref_norm = str(reference_answer).strip().lower()
+                            # Check for exact match or numerical equivalence
+                            match = (model_norm == ref_norm)
+                            if not match and reference_answer:
+                                # Try extracting numbers for comparison
+                                import re
+                                model_nums = re.findall(r'-?\d+\.?\d*', model_norm)
+                                ref_nums = re.findall(r'-?\d+\.?\d*', ref_norm)
+                                if model_nums and ref_nums:
+                                    match = (model_nums[0] == ref_nums[0])
+                            self.df.at[original_idx, 'answer_match'] = 'Yes' if match else 'No'
+                            # Print match result for GUI tracking
+                            if match:
+                                print(f"    [MATCH] Answer matches reference")
+                            else:
+                                print(f"    [MISMATCH] Answer differs from reference")
+                            # Generate LaTeX reconciliation if mismatch
+                            if not match and reference_answer:
+                                print(f"    Generating reconciliation for question {original_idx}")
+                                rationale = row.get('rationale', '')
+                                latex_doc = self.generate_reconciliation_latex(
+                                    question, model_answer, reference_answer, rationale
+                                )
+                                # Only save LaTeX if generation was successful
+                                if latex_doc:
+                                    latex_filename = f"{question_id}_reconciliation.tex"
+                                    latex_path = os.path.join(self.latex_dir, latex_filename)
+                                    with open(latex_path, 'w', encoding='utf-8') as f:
+                                        f.write(latex_doc)
+                                    self.df.at[original_idx, 'latex_file'] = latex_filename
+                                    # Compile LaTeX if requested
+                                    if self.compile_latex:
+                                        # Try async compilation first (better on Linux/HF Spaces)
+                                        try:
+                                            from latex_compiler import compile_latex_async, is_linux
+                                            if is_linux():
+                                                # Async compilation on Linux - doesn't block
+                                                compile_latex_async(
+                                                    latex_path,
+                                                    self.latex_dir,
+                                                    callback=lambda s, p, e: None  # Silent callback
+                                                )
+                                                print(f"      [PDF] Compiling in background: {latex_filename}")
+                                            else:
+                                                # Fallback to synchronous on Windows
+                                                import subprocess
+                                                pdf_path = latex_path.replace('.tex', '.pdf')
+                                                result = subprocess.run(
+                                                    ['pdflatex', '-interaction=nonstopmode', '-output-directory', self.latex_dir, latex_path],
+                                                    capture_output=True,
+                                                    timeout=30
+                                                )
+                                                if os.path.exists(pdf_path):
+                                                    print(f"      [OK] Compiled to PDF: {os.path.basename(pdf_path)}")
+                                        except ImportError:
+                                            # latex_compiler.py not available, use old method
+                                            try:
+                                                import subprocess
+                                                pdf_path = latex_path.replace('.tex', '.pdf')
+                                                result = subprocess.run(
+                                                    ['pdflatex', '-interaction=nonstopmode', '-output-directory', self.latex_dir, latex_path],
+                                                    capture_output=True,
+                                                    timeout=30
+                                                )
+                                                if os.path.exists(pdf_path):
+                                                    print(f"      [OK] Compiled to PDF: {os.path.basename(pdf_path)}")
+                                            except Exception as e:
+                                                print(f"      Warning: Could not compile LaTeX: {e}")
+                                        except Exception as e:
+                                            print(f"      Warning: Could not compile LaTeX: {e}")
+                                else:
+                                    print(f"      Failed to generate reconciliation after retries")
+                                    self.df.at[original_idx, 'latex_file'] = 'GENERATION_ERROR'
+                            pbar_batch.update(2)
+                        else:
+                            self.df.at[original_idx, 'model_answer_file'] = 'ERROR'
+                            self.df.at[original_idx, 'answer_match'] = 'ERROR'
+                            pbar_batch.update(2)
+                        pbar_main.update(1)
+                        time.sleep(0.5)  # Rate limiting
+                self.save_results()
+                print(f"\nBatch complete. Progress saved to {self.output_file}")
+                if batch_end < total:
+                    time.sleep(5)
+    def save_results(self):
+        """Save results back to Excel"""
+        with pd.ExcelWriter(self.output_file, engine='openpyxl') as writer:
+            original = pd.ExcelFile(self.excel_file)
+            for sheet_name in original.sheet_names:
+                if sheet_name == 'Data':
+                    original_df = pd.read_excel(self.excel_file, sheet_name='Data')
+                    # Update only processed rows
+                    for idx in self.df.index:
+                        for col in ['model_answer_file', 'answer_match', 'latex_file',
+                                   'quality_rating', 'difficulty_level', 'quality_comment']:
+                            if col in self.df.columns:
+                                original_df.at[idx, col] = self.df.at[idx, col]
+                    original_df.to_excel(writer, sheet_name=sheet_name, index=False)
+                else:
+                    df_other = pd.read_excel(self.excel_file, sheet_name=sheet_name)
+                    df_other.to_excel(writer, sheet_name=sheet_name, index=False)
+    def run(self):
+        """Main execution"""
+        # Set default output file if not already set
+        if not self.output_file:
+            from datetime import datetime
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            base_name = os.path.basename(self.excel_file).replace('.xlsx', '')
+            self.output_file = f"{base_name}_validated_{timestamp}.xlsx"
+        print(f"Starting Universal Math Validator")
+        print(f"  File: {self.excel_file}")
+        print(f"  Format: {self.file_format}")
+        print(f"  Provider: {self.provider}")
+        print(f"  Model: {self.model}")
+        print(f"  Image handling: {self.include_images}")
+        print(f"  Output: {self.output_file}")
+        print("=" * 60)
+        self.load_data()
+        self.process_questions()
+        # Calculate and display summary statistics
+        if 'answer_match' in self.df.columns:
+            total = len(self.df)
+            correct = (self.df['answer_match'] == 'Yes').sum()
+            incorrect = (self.df['answer_match'] == 'No').sum()
+            errors = (self.df['answer_match'] == 'ERROR').sum()
+            print("\n" + "="*60)
+            print("VALIDATION COMPLETE")
+            print("="*60)
+            print(f"\nTotal questions processed: {total}")
+            print(f"Correct answers: {correct} ({correct/total*100:.1f}%)")
+            print(f"Incorrect answers: {incorrect} ({incorrect/total*100:.1f}%)")
+            if errors > 0:
+                print(f"Errors: {errors}")
+            # Count LaTeX files generated
+            latex_count = (self.df['latex_file'] != '').sum()
+            if latex_count > 0:
+                print(f"\nLaTeX reconciliation documents generated: {latex_count}")
+                print(f"Location: {self.latex_dir}")
+            print(f"\nResults saved to: {self.output_file}")
+            print(f"Model answers saved to: {self.answers_dir}")
+        else:
+            print("\nValidation Complete!")
+            print(f"Results saved to: {self.output_file}")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description='Universal Math Question Validator')
+    parser.add_argument('file', help='Excel file to process')
+    parser.add_argument('--provider', choices=['openai', 'openrouter'], default='openai',
+                       help='API provider to use')
+    parser.add_argument('--model', help='Model for solving questions (default: o3-mini)')
+    parser.add_argument('--reconciliation-model', help='Model for reconciliation (default: gpt-4o)')
+    parser.add_argument('--images', choices=['always', 'never', 'when_needed'],
+                       default='when_needed', help='When to include images')
+    parser.add_argument('--start', type=int, default=0, help='Start from question index')
+    parser.add_argument('--end', type=int, default=None, help='End at question index (for parallel processing)')
+    parser.add_argument('--batch-size', type=int, default=5, help='Number of questions per batch')
+    parser.add_argument('--output', type=str, default=None, help='Output filename (default: auto-generated)')
+    parser.add_argument('--compile-latex', action='store_true', help='Compile LaTeX files to PDF')
+    args = parser.parse_args()
+    validator = UniversalMathValidator(
+        excel_file=args.file,
+        provider=args.provider,
+        include_images=args.images,
+        solver_model=args.model,
+        reconciliation_model=args.reconciliation_model
+    )
+    # Set output filename if provided
+    if args.output:
+        validator.output_file = args.output
+    else:
+        # Generate default filename with timestamp
+        from datetime import datetime
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        base_name = os.path.basename(args.file).replace('.xlsx', '')
+        if args.start > 0 or args.end:
+            range_str = f"_q{args.start+1}_q{args.end}" if args.end else f"_from_q{args.start+1}"
+        else:
+            range_str = ""
+        validator.output_file = f"{base_name}_validated_{timestamp}{range_str}.xlsx"
+    # Set LaTeX compilation flag
+    validator.compile_latex = args.compile_latex
+    # Handle parallel processing by limiting range
+    if args.end:
+        validator.load_data()
+        # Filter to specific range for parallel processing
+        validator.df = validator.df.iloc[args.start:args.end]
+        validator.process_questions(start_idx=0, batch_size=args.batch_size)
+    else:
+        validator.run()

validator_gui.py ADDED Viewed

	@@ -0,0 +1,646 @@

+#!/usr/bin/env python
+"""
+Gradio Web Interface for Math Validator
+"""
+import gradio as gr
+import pandas as pd
+import os
+import subprocess
+import sys
+import json
+from datetime import datetime
+import threading
+import queue
+import time
+from dotenv import load_dotenv
+# Load environment variables from .env file
+load_dotenv()
+class ValidatorGUI:
+    def __init__(self):
+        self.process = None
+        self.output_queue = queue.Queue()
+        self.is_running = False
+        self.total_questions = 0
+        self.math_questions = 0
+        # Progress tracking
+        self.questions_processed = 0
+        self.correct_answers = 0
+        self.incorrect_answers = 0
+        self.timeouts = 0
+        self.errors = 0
+        # Model options
+        self.openai_models = [
+            "o3-mini",
+            "gpt-4o",
+            "gpt-5",
+            "gpt-5-mini",
+            "gpt-5-nano",
+            "gpt-4-turbo"
+        ]
+        self.openrouter_models = [
+            # Anthropic Claude 4 Series (NEW)
+            "anthropic/claude-4-opus",
+            "anthropic/claude-4-sonnet",
+            # Anthropic Claude 3.5 Series
+            "anthropic/claude-3.5-sonnet",
+            "anthropic/claude-3-5-sonnet-20241022",
+            "anthropic/claude-3-opus",
+            "anthropic/claude-3-haiku",
+            # xAI Grok Series (including Grok 4)
+            "x-ai/grok-4",
+            "x-ai/grok-2",
+            "x-ai/grok-2-1212",
+            # DeepSeek Reasoning Models (NEW)
+            "deepseek/deepseek-r1",
+            "deepseek/deepseek-v3",
+            "deepseek/deepseek-chat",
+            # Google Gemini
+            "google/gemini-2.0-pro",
+            "google/gemini-2.0-flash",
+            "google/gemini-pro-1.5",
+            "google/gemini-flash-1.5",
+            # Baidu ERNIE (NEW)
+            "baidu/ernie-4.0-turbo-8k",
+            "baidu/ernie-bot-4",
+            # Meta Llama
+            "meta-llama/llama-3.2-405b",
+            "meta-llama/llama-3.1-405b-instruct",
+            # Mistral
+            "mistralai/mistral-large",
+            "mistralai/mixtral-8x22b-instruct"
+        ]
+        self.all_models = self.openai_models + self.openrouter_models
+    def get_excel_files(self):
+        """Get list of Excel files in current directory"""
+        files = [f for f in os.listdir('.') if f.endswith('.xlsx') and not f.endswith('_validated.xlsx')]
+        return files
+    def analyze_file(self, file_path):
+        """Analyze Excel file and return summary and question count"""
+        if not file_path:
+            return "No file selected", 0, 0
+        try:
+            df = pd.read_excel(file_path, sheet_name='Data')
+            # Store total questions
+            self.total_questions = len(df)
+            # Count math questions
+            if 'raw_subject' in df.columns:
+                math_filter = df['raw_subject'].str.lower().str.contains(
+                    'math|statistic|calculus|algebra|geometry|trigonometry',
+                    na=False, regex=True
+                )
+                self.math_questions = math_filter.sum()
+            else:
+                self.math_questions = len(df)
+            # Check for images
+            image_count = 0
+            if 'file_url' in df.columns:
+                image_count = df['file_url'].notna().sum()
+            summary = f"""### File Analysis
+**File:** {os.path.basename(file_path)}
+**Total rows:** {self.total_questions}
+**Math questions:** {self.math_questions}
+**Questions with images:** {image_count}
+**Columns found:** {', '.join(df.columns[:10])}{'...' if len(df.columns) > 10 else ''}
+**Estimated processing time:**
+- Serial: ~{self.math_questions * 30 // 60} minutes
+- Parallel (4 processes): ~{self.math_questions * 30 // (60 * 4)} minutes
+"""
+            return summary, self.total_questions, self.math_questions
+        except Exception as e:
+            return f"Error analyzing file: {str(e)}", 0, 0
+    def validate_config(self, file_path, solver_model, recon_model, num_processes, batch_size):
+        """Validate configuration before running"""
+        errors = []
+        if not file_path or not os.path.exists(file_path):
+            errors.append("Please select a valid Excel file")
+        if not solver_model:
+            errors.append("Please select a solver model")
+        if not recon_model:
+            errors.append("Please select a reconciliation model")
+        # Check API keys
+        needs_openai = solver_model in self.openai_models or recon_model in self.openai_models
+        needs_openrouter = solver_model in self.openrouter_models or recon_model in self.openrouter_models
+        if needs_openai and not os.getenv('OPENAI_API_KEY'):
+            errors.append("OPENAI_API_KEY not found in environment")
+        if needs_openrouter and not os.getenv('OPENROUTER_API_KEY'):
+            errors.append("OPENROUTER_API_KEY not found in environment")
+        return errors
+    def generate_output_filename(self, file_path, start_q, end_q):
+        """Generate output filename with timestamp and range"""
+        base_name = os.path.basename(file_path).replace('.xlsx', '')
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        if start_q is not None and end_q is not None and (start_q > 0 or end_q < self.math_questions):
+            # Add range to filename
+            range_str = f"_q{start_q+1}_q{end_q}"
+        else:
+            range_str = "_full"
+        return f"{base_name}_validated_{timestamp}{range_str}.xlsx"
+    def parse_progress_line(self, line):
+        """Parse output line for progress information"""
+        # Parse based on the new [TAG] format
+        line_lower = line.lower()
+        if "[ok] got answer" in line_lower and "chars" in line_lower:
+            self.questions_processed += 1
+        elif "[fail] failed to get answer" in line_lower:
+            self.errors += 1
+            self.questions_processed += 1  # Still count as processed
+        elif "[match]" in line_lower:
+            self.correct_answers += 1
+        elif "[mismatch]" in line_lower:
+            self.incorrect_answers += 1
+        elif "[timeout]" in line_lower:
+            self.timeouts += 1
+        elif "[error]" in line_lower:
+            if "failed after" in line_lower:
+                self.errors += 1
+        elif "[warning]" in line_lower:
+            # Just a warning, not an error
+            pass
+        elif "question" in line_lower and "getting answer from" in line_lower:
+            # This indicates a question is starting to be processed
+            pass
+        # Also parse parallel processing output
+        elif "starting process for questions" in line_lower:
+            # Parallel process starting
+            pass
+        elif "completed range" in line_lower:
+            # Parallel process completed a range
+            import re
+            # Try to extract question count from "Completed range X-Y"
+            match = re.search(r'range (\d+)-(\d+)', line_lower)
+            if match:
+                start, end = int(match.group(1)), int(match.group(2))
+                # This is approximate since we don't know exact results
+                self.questions_processed = max(self.questions_processed, end)
+    def get_progress_stats(self):
+        """Get formatted progress statistics"""
+        if self.questions_processed == 0:
+            return "Waiting for processing to start..."
+        accuracy = (self.correct_answers / self.questions_processed * 100) if self.questions_processed > 0 else 0
+        return f"""**Progress Stats:**
+- Processed: {self.questions_processed}
+- Correct: {self.correct_answers} ({accuracy:.1f}%)
+- Incorrect: {self.incorrect_answers}
+- Timeouts: {self.timeouts}
+- Errors: {self.errors}
+"""
+    def run_validation(self, file_path, solver_model, recon_model, image_mode,
+                      num_processes, batch_size, start_q, end_q, compile_latex, progress=gr.Progress()):
+        """Run the validation process"""
+        # Reset progress counters
+        self.questions_processed = 0
+        self.correct_answers = 0
+        self.incorrect_answers = 0
+        self.timeouts = 0
+        self.errors = 0
+        # Validate configuration
+        errors = self.validate_config(file_path, solver_model, recon_model, num_processes, batch_size)
+        if errors:
+            return f"### Configuration Errors\n" + "\n".join(f"- {e}" for e in errors), None, ""
+        self.is_running = True
+        output_log = []
+        # Generate output filename
+        output_file = self.generate_output_filename(file_path, start_q, end_q)
+        output_path = os.path.join(os.path.dirname(file_path), output_file)
+        try:
+            # Prepare command
+            base_cmd = [
+                sys.executable, "universal_validator.py", file_path,
+                "--model", solver_model,
+                "--reconciliation-model", recon_model,
+                "--images", image_mode,
+                "--batch-size", str(batch_size),
+                "--output", output_path
+            ]
+            # Add range parameters if specified
+            if start_q is not None and start_q >= 0:
+                base_cmd.extend(["--start", str(start_q)])
+            if end_q is not None and end_q > 0:
+                base_cmd.extend(["--end", str(end_q)])
+            # Add LaTeX compilation flag if requested
+            if compile_latex:
+                base_cmd.append("--compile-latex")
+            # Use parallel processing for larger ranges
+            if num_processes > 1 and (end_q - start_q) > 20:
+                cmd = [
+                    sys.executable, "run_parallel.py", file_path,
+                    "--num-processes", str(num_processes),
+                    "--solver", solver_model,
+                    "--reconciler", recon_model,
+                    "--images", image_mode,
+                    "--batch-size", str(batch_size),
+                    "--output", output_path,
+                    "--start-range", str(start_q),
+                    "--end-range", str(end_q)
+                ]
+                if compile_latex:
+                    cmd.append("--compile-latex")
+                print(f"[GUI] Using parallel processing with {num_processes} processes")
+            else:
+                # Use single process for small ranges
+                cmd = base_cmd
+                if num_processes > 1 and (end_q - start_q) <= 20:
+                    print(f"[GUI] Range too small for parallel processing, using single process")
+            # Start process
+            progress(0, desc="Starting validation...")
+            output_log.append(f"Running: {' '.join(cmd)}\n")
+            output_log.append(f"Output file: {output_path}\n")
+            output_log.append(f"Question range: {start_q+1} to {end_q}\n\n")
+            print(f"[GUI] Starting subprocess: {' '.join(cmd)}")
+            try:
+                self.process = subprocess.Popen(
+                    cmd,
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.STDOUT,
+                    text=True,
+                    bufsize=1,
+                    universal_newlines=True,
+                    encoding='utf-8',
+                    errors='replace'
+                )
+                print(f"[GUI] Process started with PID: {self.process.pid}")
+            except Exception as e:
+                error_msg = f"Failed to start validator: {str(e)}"
+                print(f"[GUI Error] {error_msg}")
+                return error_msg, None, ""
+            # Read output
+            lines_processed = 0
+            last_update_time = time.time()
+            while True:
+                line = self.process.stdout.readline()
+                if not line:
+                    # Check if process is still running
+                    if self.process.poll() is not None:
+                        break
+                    time.sleep(0.1)
+                    continue
+                output_log.append(line)
+                self.parse_progress_line(line)
+                # Debug: Print every line to see what's happening
+                print(f"[GUI Debug] {line.strip()}")
+                # Update progress based on output
+                if "processing batch" in line.lower() or "question" in line.lower():
+                    lines_processed += 1
+                    if self.math_questions > 0 and self.questions_processed > 0:
+                        actual_progress = min(self.questions_processed / (end_q - start_q), 1.0)
+                        progress(actual_progress, desc=f"Processing question {self.questions_processed}/{end_q - start_q}")
+                # Yield intermediate results with stats every 2 seconds or every 5 lines
+                current_time = time.time()
+                if lines_processed % 5 == 0 or (current_time - last_update_time) > 2:
+                    stats = self.get_progress_stats()
+                    output_text = stats + "\n\n" + "="*60 + "\n" + "".join(output_log[-50:])
+                    yield output_text, None, stats
+                    last_update_time = current_time
+            self.process.wait()
+            # Get final results
+            final_stats = self.get_progress_stats()
+            output_text = f"### Validation Complete\n\n{final_stats}\n\n" + "="*60 + "\n\nFull Log:\n" + "".join(output_log[-200:])
+            # Check if output file exists
+            if os.path.exists(output_path):
+                return output_text, output_path, final_stats
+            else:
+                # Try original naming convention as fallback
+                fallback_path = file_path.replace('.xlsx', '_validated.xlsx')
+                if os.path.exists(fallback_path):
+                    return output_text, fallback_path, final_stats
+                return output_text, None, final_stats
+        except Exception as e:
+            stats = self.get_progress_stats()
+            return f"Error: {str(e)}\n\n{stats}\n\n{''.join(output_log)}", None, stats
+        finally:
+            self.is_running = False
+            self.process = None
+    def stop_validation(self):
+        """Stop the running validation"""
+        if self.process:
+            self.process.terminate()
+            time.sleep(1)
+            if self.process.poll() is None:
+                self.process.kill()
+            return "Validation stopped"
+        return "No validation running"
+    def create_interface(self):
+        """Create the Gradio interface"""
+        with gr.Blocks(title="Math Validator", theme=gr.themes.Soft()) as interface:
+            gr.Markdown("# Math Question Validator")
+            gr.Markdown("Web interface for validating mathematical questions and answers")
+            with gr.Tab("Validation"):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        # File selection
+                        file_dropdown = gr.Dropdown(
+                            choices=self.get_excel_files(),
+                            label="Select Excel File",
+                            value=self.get_excel_files()[0] if self.get_excel_files() else None
+                        )
+                        refresh_btn = gr.Button("🔄 Refresh Files", size="sm")
+                        file_info = gr.Markdown("Select a file to see analysis")
+                        # Question range selection (dynamically updated)
+                        gr.Markdown("### Question Range")
+                        with gr.Row():
+                            start_question = gr.Number(
+                                label="Start Question",
+                                value=1,
+                                minimum=1,
+                                step=1,
+                                info="First question to process"
+                            )
+                            end_question = gr.Number(
+                                label="End Question",
+                                value=100,
+                                minimum=1,
+                                step=1,
+                                info="Last question to process"
+                            )
+                        use_all_questions = gr.Checkbox(
+                            label="Process all questions",
+                            value=True,
+                            info="Uncheck to specify custom range"
+                        )
+                    with gr.Column(scale=2):
+                        with gr.Row():
+                            # Model selection
+                            solver_dropdown = gr.Dropdown(
+                                choices=["o3-mini (recommended)"] + self.all_models,
+                                value="o3-mini (recommended)",
+                                label="Solver Model",
+                                info="Model for answering questions"
+                            )
+                            recon_dropdown = gr.Dropdown(
+                                choices=["gpt-4o (recommended)"] + self.all_models,
+                                value="gpt-4o (recommended)",
+                                label="Reconciliation Model",
+                                info="Model for comparing answers"
+                            )
+                        with gr.Row():
+                            image_mode = gr.Radio(
+                                choices=["when_needed", "always", "never"],
+                                value="when_needed",
+                                label="Image Handling",
+                                info="When to include images with questions"
+                            )
+                            parallel_slider = gr.Slider(
+                                minimum=1,
+                                maximum=8,
+                                value=1,
+                                step=1,
+                                label="Parallel Processes",
+                                info="Number of concurrent processes (1 = serial)"
+                            )
+                            batch_slider = gr.Slider(
+                                minimum=1,
+                                maximum=20,
+                                value=5,
+                                step=1,
+                                label="Batch Size",
+                                info="Questions per batch"
+                            )
+                        # LaTeX compilation option
+                        compile_latex = gr.Checkbox(
+                            label="Compile LaTeX reconciliation documents to PDF",
+                            value=False,
+                            info="Requires pdflatex installed (slower but produces PDFs)"
+                        )
+                with gr.Row():
+                    run_btn = gr.Button("▶️ Start Validation", variant="primary", size="lg")
+                    stop_btn = gr.Button("⏹️ Stop", variant="stop", size="lg")
+                # Output section with progress stats
+                progress_stats = gr.Markdown("**Progress:** Waiting to start...")
+                output_text = gr.Textbox(
+                    label="Validation Output",
+                    lines=20,
+                    max_lines=30,
+                    value="Click 'Start Validation' to begin..."
+                )
+                output_file = gr.File(
+                    label="Download Results",
+                    visible=False
+                )
+                # Event handlers
+                def update_file_info(file_path):
+                    if file_path:
+                        full_path = os.path.join(os.getcwd(), file_path)
+                        summary, total, math_q = self.analyze_file(full_path)
+                        # Update end question to match file
+                        return summary, gr.Number(value=math_q, maximum=math_q)
+                    return "No file selected", gr.Number(value=100)
+                def refresh_files():
+                    files = self.get_excel_files()
+                    return gr.Dropdown(choices=files, value=files[0] if files else None)
+                def clean_model_name(model):
+                    # Remove "(recommended)" suffix if present
+                    if "(recommended)" in model:
+                        return model.split(" (")[0]
+                    return model
+                def toggle_range_inputs(use_all):
+                    # Enable/disable range inputs based on checkbox
+                    return gr.Number(interactive=not use_all), gr.Number(interactive=not use_all)
+                def run_with_clean_models(file_path, solver, recon, images, parallel, batch,
+                                         use_all, start_q, end_q, compile_tex):
+                    solver_clean = clean_model_name(solver)
+                    recon_clean = clean_model_name(recon)
+                    if file_path:
+                        full_path = os.path.join(os.getcwd(), file_path)
+                        # Adjust question range (convert to 0-indexed)
+                        if use_all:
+                            actual_start = 0
+                            actual_end = self.math_questions
+                        else:
+                            actual_start = max(0, int(start_q) - 1)  # Convert to 0-indexed
+                            actual_end = min(self.math_questions, int(end_q))
+                        # Run validation with progress updates
+                        for result in self.run_validation(
+                            full_path, solver_clean, recon_clean, images, parallel, batch,
+                            actual_start, actual_end, compile_tex
+                        ):
+                            if len(result) == 3:
+                                result_text, result_file, stats = result
+                                if result_file:
+                                    yield result_text, gr.File(value=result_file, visible=True), stats
+                                else:
+                                    yield result_text, gr.File(visible=False), stats
+                            else:
+                                yield result[0], gr.File(visible=False), result[1] if len(result) > 1 else ""
+                    else:
+                        yield "No file selected", gr.File(visible=False), ""
+                file_dropdown.change(update_file_info, inputs=[file_dropdown],
+                                    outputs=[file_info, end_question])
+                refresh_btn.click(refresh_files, outputs=[file_dropdown])
+                # Toggle range inputs when checkbox changes
+                use_all_questions.change(toggle_range_inputs, inputs=[use_all_questions],
+                                        outputs=[start_question, end_question])
+                run_btn.click(
+                    run_with_clean_models,
+                    inputs=[file_dropdown, solver_dropdown, recon_dropdown,
+                           image_mode, parallel_slider, batch_slider,
+                           use_all_questions, start_question, end_question, compile_latex],
+                    outputs=[output_text, output_file, progress_stats]
+                )
+                stop_btn.click(self.stop_validation, outputs=[output_text])
+            with gr.Tab("Configuration"):
+                gr.Markdown("""
+                ### API Configuration
+                Make sure you have the required API keys set as environment variables:
+                - **OPENAI_API_KEY**: Required for OpenAI models (o3-mini, GPT-5, GPT-4o)
+                - **OPENROUTER_API_KEY**: Required for Claude, Grok, Gemini, and other models
+                ### Model Recommendations
+                **For best results:**
+                - Solver: o3-mini (best accuracy)
+                - Reconciliation: gpt-4o (fast and reliable)
+                **For speed:**
+                - Use 4-6 parallel processes
+                - Batch size of 5-10
+                **For GPT-5 testing:**
+                - Use gpt-5-mini (faster than gpt-5)
+                - Use gpt-4o for reconciliation (GPT-5 has timeout issues)
+                """)
+                # Check current configuration
+                config_status = []
+                if os.getenv('OPENAI_API_KEY'):
+                    config_status.append("✅ OPENAI_API_KEY is set")
+                else:
+                    config_status.append("❌ OPENAI_API_KEY is not set")
+                if os.getenv('OPENROUTER_API_KEY'):
+                    config_status.append("✅ OPENROUTER_API_KEY is set")
+                else:
+                    config_status.append("❌ OPENROUTER_API_KEY is not set")
+                gr.Markdown("### Current Status\n" + "\n".join(config_status))
+            with gr.Tab("Results Analysis"):
+                gr.Markdown("""
+                ### How to Analyze Results
+                After validation completes:
+                1. **Download the validated Excel file** - Contains all results
+                2. **Check the latex_documents folder** - Contains reconciliation documents
+                3. **Run analysis scripts:**
+                   - `python analyze_reconciliations.py` - Analyze which answers were vindicated
+                   - `python summarize_results.py` - Get overall statistics
+                ### Understanding Results
+                - **answer_match = Yes**: Model answer matches reference
+                - **answer_match = No**: Mismatch (see LaTeX reconciliation)
+                - **latex_file**: Path to detailed reconciliation document
+                - **model_answer_file**: Path to model's complete response
+                """)
+        return interface
+def main():
+    gui = ValidatorGUI()
+    interface = gui.create_interface()
+    interface.launch(
+        share=False,
+        server_name="127.0.0.1",
+        server_port=7860,
+        inbrowser=True
+    )
+if __name__ == "__main__":
+    main()