Spaces:

RoyAalekh
/

hackathon_code4change

Sleeping

App Files Files Community

RoyAalekh commited on Dec 1, 2025

Commit

d3a967e

1 Parent(s): 4baabe1

moved to parquet from duckdb for raw data, updated readme

Browse files

Files changed (9) hide show

.dockerignore +0 -1
Dockerfile +6 -4
README.md +13 -15
docs/CONFIGURATION.md +0 -44
docs/DASHBOARD.md +0 -17
docs/HACKATHON_SUBMISSION.md +0 -294
eda/config.py +2 -3
eda/load_clean.py +15 -40
pyproject.toml +0 -3

.dockerignore CHANGED Viewed

@@ -11,5 +11,4 @@ reports/*.pdf
 configs/*.secrets.*
 uv.lock
 code4change_analysis.egg-info
-!data/court_data.duckdb

 configs/*.secrets.*
 uv.lock
 code4change_analysis.egg-info

Dockerfile CHANGED Viewed

@@ -1,20 +1,19 @@
 # syntax=docker/dockerfile:1
 FROM python:3.11-slim
 RUN apt-get update \
-    && apt-get install -y --no-install-recommends curl git git-lfs libgomp1 \
     && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
-# Install uv
 RUN curl -LsSf https://astral.sh/uv/install.sh | sh
 ENV PATH="/root/.local/bin:${PATH}"
 COPY . .
-RUN git lfs install && git lfs pull
 RUN uv venv .venv \
     && uv pip install --upgrade pip setuptools wheel \
     && uv pip install .
@@ -22,6 +21,9 @@ RUN uv venv .venv \
 ENV PATH="/app/.venv/bin:${PATH}"
 ENV PYTHONPATH="/app"
 EXPOSE 8501
 CMD ["streamlit", "run", "scheduler/dashboard/app.py", "--server.port=8501", "--server.address=0.0.0.0"]

 # syntax=docker/dockerfile:1
 FROM python:3.11-slim
+# Install minimal system dependencies
 RUN apt-get update \
+    && apt-get install -y --no-install-recommends curl libgomp1 \
     && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
 RUN curl -LsSf https://astral.sh/uv/install.sh | sh
 ENV PATH="/root/.local/bin:${PATH}"
 COPY . .
 RUN uv venv .venv \
     && uv pip install --upgrade pip setuptools wheel \
     && uv pip install .
 ENV PATH="/app/.venv/bin:${PATH}"
 ENV PYTHONPATH="/app"
+# Health check commands
+RUN uv --version && python --version && which court-scheduler && which streamlit
 EXPOSE 8501
 CMD ["streamlit", "run", "scheduler/dashboard/app.py", "--server.port=8501", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -13,10 +13,16 @@ Purpose-built for hackathon evaluation. This repository runs out of the box usin
 1. Install uv (see above) and ensure Python 3.11+ is available.
 2. Clone this repository.
-3. Make sure to put `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv` in the `Data/` folder, or provide
-   `court_data.duckdb` there. Both in csv format, strictly named as shown.
-4. Launch the dashboard:
 ```bash
 uv run streamlit run scheduler/dashboard/app.py
 ```
@@ -85,18 +91,10 @@ Then open http://localhost:8501.
 Notes for Windows CMD: use ^ for line continuation and replace ${PWD} with the full path.
-## Data (DuckDB-first)
-This repository uses a DuckDB snapshot as the canonical raw dataset.
-- Preferred source: `Data/court_data.duckdb` (tables: `cases`, `hearings`). If this file is present, the EDA step will load directly from it.
-- CSV fallback: If the DuckDB file is missing, place the two organizer CSVs in `Data/` with the exact names below and the EDA step will load them automatically:
-  - `ISDMHack_Cases_WPfinal.csv`
-  - `ISDMHack_Hear.csv`
 No manual pre-processing is required; launch the dashboard and click “Run EDA Pipeline.”
-## Notes
-- This submission intentionally focuses on the end-to-end demo path. Internal development notes, enhancements, and bug fix logs have been removed from the README.
-- uv is enforced by the dashboard for a consistent, reproducible environment.

 1. Install uv (see above) and ensure Python 3.11+ is available.
 2. Clone this repository.
+3. Navigate to the repo root and activate uv:
+```bash
+cd path/to/repo
+uv activate
+```
+4. Install dependencies:
+```bash
+uv install
+```
+5. Launch the dashboard:
 ```bash
 uv run streamlit run scheduler/dashboard/app.py
 ```
 Notes for Windows CMD: use ^ for line continuation and replace ${PWD} with the full path.
+## Data (Parquet format)
+This repository uses a parquet data format for efficient loading and processing.
+Provided excel and csv files have been pre-converted to parquet and stored in the `Data/` folder.
 No manual pre-processing is required; launch the dashboard and click “Run EDA Pipeline.”

docs/CONFIGURATION.md DELETED Viewed

@@ -1,44 +0,0 @@
-# Configuration Guide (Consolidated)
-This configuration reference has been intentionally simplified for the hackathon to keep the repository focused for judges and evaluators.
-For the end-to-end demo and instructions, see:
-- `docs/HACKATHON_SUBMISSION.md`
-Advanced usage help is available via the CLI:
-```bash
-uv run court-scheduler --help
-uv run court-scheduler generate --help
-uv run court-scheduler simulate --help
-uv run court-scheduler workflow --help
-```
-Note: uv is required for all commands.
-### Deprecating Parameters
-1. Move to config class first (keep old path working)
-2. Add deprecation warning
-3. Remove old path after one release cycle
-## Validation Rules
-All config classes validate in `__post_init__`:
-- Value ranges (0 < learning_rate <= 1)
-- Type consistency (convert strings to Path)
-- Cross-parameter constraints (max_gap >= min_gap)
-- Required file existence (rl_agent_path must exist)
-## Anti-Patterns
-**DON'T**:
-- Hardcode magic numbers in algorithms
-- Use module-level mutable globals
-- Mix domain constants with tunable parameters
-- Create "god config" with everything in one class
-**DO**:
-- Separate by lifecycle and ownership
-- Validate early (constructor time)
-- Use dataclasses for immutability
-- Provide sensible defaults with named presets

docs/DASHBOARD.md DELETED Viewed

@@ -1,17 +0,0 @@
-# Dashboard Guide (Consolidated)
-This document has been simplified for the hackathon. Please use the main guide:
-- See `docs/HACKATHON_SUBMISSION.md` for end-to-end demo instructions.
-Quick launch:
-```bash
-uv run streamlit run scheduler/dashboard/app.py
-# Then open http://localhost:8501
-```
-Data source:
-- Preferred: `Data/court_data.duckdb` (tables: `cases`, `hearings`).
-- Fallback: place `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv` in `Data/` if the DuckDB file is not present.

docs/HACKATHON_SUBMISSION.md DELETED Viewed

@@ -1,294 +0,0 @@
-# Hackathon Submission Guide
-## Intelligent Court Scheduling System
-### Quick Start - Hackathon Demo
-**IMPORTANT**: The dashboard is fully self-contained. You only need:
-1. Preferred: `Data/court_data.duckdb` (included in this repo). Alternatively, place the two CSVs in `Data/` with exact names: `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv`.
-2. This codebase
-3. Run the dashboard
-Everything else (EDA, parameters, visualizations, simulations) is generated on-demand through the dashboard.
-#### Launch Dashboard
-```bash
-# Start the dashboard
-uv run streamlit run scheduler/dashboard/app.py
-# Open browser to http://localhost:8501
-```
-**Complete Workflow Through Dashboard**:
-1. **First Time Setup**: Click "Run EDA Pipeline" on main page (processes raw data - takes 2-5 min)
-2. **Explore Data**: Navigate to "Data & Insights" to see 739K+ hearings analysis
-3. **Run Simulation**: Go to "Simulation Workflow" → generate cases → run simulation
-4. **Review Results**: Check "Cause Lists & Overrides" for judge override interface
-5. **Performance Analysis**: View "Analytics & Reports" for metrics comparison
-**No pre-processing required** — EDA automatically loads `Data/court_data.duckdb` when present; if missing, it falls back to `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv` placed in `Data/`.
-### Docker Quick Start (no local Python needed)
-If you prefer a zero-setup run, use Docker. This is the recommended path for judges.
-1) Build the image (from the repository root):
-```bash
-docker build -t code4change-analysis .
-```
-2) Show CLI help (Windows PowerShell example):
-```powershell
-docker run --rm `
-  -v ${PWD}\Data:/app/Data `
-  -v ${PWD}\outputs:/app/outputs `
-  code4change-analysis court-scheduler --help
-```
-3) Run the Streamlit dashboard:
-```powershell
-docker run --rm -p 8501:8501 `
-  -v ${PWD}\Data:/app/Data `
-  -v ${PWD}\outputs:/app/outputs `
-  code4change-analysis `
-  streamlit run scheduler/dashboard/app.py --server.address=0.0.0.0
-```
-Then open http://localhost:8501.
-Notes:
-- Replace ${PWD} with the full path if using Windows CMD (use ^ for line continuation).
-- Mounting Data/ and outputs/ ensures inputs and generated artifacts persist on your host.
-#### Alternative: CLI Workflow (for scripting)
-```bash
-# Run complete pipeline: generate cases + simulate
-uv run court-scheduler workflow --cases 50000 --days 730
-```
-This executes:
-- EDA parameter extraction (if needed)
-- Case generation with realistic distributions
-- Multi-year simulation with policy comparison
-- Performance analysis and reporting
-#### Option 2: Quick Demo
-```bash
-# 90-day quick demo with 10,000 cases
-uv run court-scheduler workflow --cases 10000 --days 90
-```
-#### Option 3: Step-by-Step
-```bash
-# 1. Extract parameters from historical data
-uv run court-scheduler eda
-# 2. Generate synthetic cases
-uv run court-scheduler generate --cases 50000
-# 3. Run simulation
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy readiness
-```
-### What the Pipeline Does
-The comprehensive pipeline executes 6 automated steps:
-**Step 1: EDA & Parameter Extraction**
-- Analyzes 739K+ historical hearings
-- Extracts transition probabilities, duration statistics
-- Generates simulation parameters
-**Step 2: Data Generation**
-- Creates realistic synthetic case dataset
-- Configurable size (default: 50,000 cases)
-- Diverse case types and complexity levels
-**Step 3: 2-Year Simulation**
-- Runs 730-day court scheduling simulation
-- Compares scheduling policies (FIFO, age-based, readiness)
-- Tracks disposal rates, utilization, fairness metrics
-**Step 4: Daily Cause List Generation**
-- Generates production-ready daily cause lists
-- Exports for all simulation days
-- Court-room wise scheduling details
-**Step 5: Performance Analysis**
-- Comprehensive comparison reports
-- Performance visualizations
-- Statistical analysis of all metrics
-**Step 6: Executive Summary**
-- Hackathon-ready summary document
-- Key achievements and impact metrics
-- Deployment readiness checklist
-### Expected Output
-After completion, you'll find outputs under your selected run directory (created automatically; the dashboard uses outputs/simulation_runs by default):
-```
-outputs/simulation_runs/v<version>_<timestamp>/
-|-- pipeline_config.json     # Full configuration used
-|-- events.csv               # All scheduled events across days
-|-- metrics.csv              # Aggregate metrics for the run
-|-- daily_summaries.csv      # Per-day summary metrics
-|-- cause_lists/             # Generated daily cause lists (CSV)
-|   |-- YYYY-MM-DD.csv       # One file per simulation day
-|-- figures/                 # Optional charts (when exported)
-```
-### Hackathon Winning Features
-#### 1. Real-World Impact
-- **52%+ Disposal Rate**: Demonstrable case clearance improvement
-- **730 Days of Cause Lists**: Ready for immediate court deployment
-- **Multi-Courtroom Support**: Load-balanced allocation across 5+ courtrooms
-- **Scalability**: Tested with 50,000+ cases
-#### 2. Technical Approach
-- Data-informed simulation calibrated from historical hearings
-- Multiple heuristic policies: FIFO, age-based, readiness-based
-- Readiness policy enforces bottleneck/ripeness constraints
-- Fairness metrics (e.g., Gini) and utilization tracking
-#### 3. Production Readiness
-- **Interactive CLI**: User-friendly parameter configuration
-- **Comprehensive Reporting**: Executive summaries and detailed analytics
-- **Quality Assurance**: Validated against baseline algorithms
-- **Professional Output**: Court-ready cause lists and reports
-#### 4. Judicial Integration
-- **Ripeness Classification**: Filters unready cases (40%+ efficiency gain)
-- **Fairness Metrics**: Low Gini coefficient for equitable distribution
-- **Transparency**: Explainable decision-making process
-- **Override Capability**: Complete judicial control maintained
-### Performance Benchmarks
-Compare policies by running multiple simulations (e.g., readiness vs FIFO vs age) and reviewing disposal rate, utilization, and fairness (Gini). The Analytics & Reports dashboard page can load and compare runs side-by-side.
-### Customization Options
-#### For Hackathon Judges
-```bash
-# Large-scale impressive demo
-uv run court-scheduler workflow --cases 100000 --days 730
-# With all policies compared
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy readiness
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy fifo
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy age
-```
-#### For Technical Evaluation
-Focus on repeatability and fairness by comparing multiple policies and seeds:
-```bash
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy readiness --seed 1
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy fifo --seed 1
-uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy age --seed 1
-```
-#### For Quick Demo/Testing
-```bash
-# Fast proof-of-concept
-uv run court-scheduler workflow --cases 10000 --days 90
-# Pre-configured:
-# - 10,000 cases
-# - 90 days simulation
-# - ~5-10 minutes runtime
-```
-### Tips for Winning Presentation
-1. **Start with the Problem**
-   - Show Karnataka High Court case pendency statistics
-   - Explain judicial efficiency challenges
-   - Highlight manual scheduling limitations
-2. **Demonstrate the Solution**
-   - Run the interactive pipeline live
-   - Display generated cause lists
-3. **Present the Results**
-   - Open EXECUTIVE_SUMMARY.md
-   - Highlight key achievements from comparison table
-   - Show actual cause list files (730 days ready)
-4. **Emphasize Innovation**
-   - Data-driven readiness-based scheduling (novel for this context)
-   - Production-ready from day 1 (practical)
-   - Scalable to entire court system (impactful)
-5. **Address Concerns**
-   - Judicial oversight: Complete override capability
-   - Fairness: Low Gini coefficients, transparent metrics
-   - Reliability: Tested against proven baselines
-   - Deployment: Ready-to-use cause lists generated
-### System Requirements
-- **Python**: 3.11+
-- **uv**: required to run commands and the dashboard
-- **Memory**: 8GB+ RAM (16GB recommended for 50K cases)
-- **Storage**: 2GB+ for full pipeline outputs
-- **Runtime**:
-  - Quick demo: 5-10 minutes
-  - Full 2-year sim (50K cases): 30-60 minutes
-  - Large-scale (100K cases): 1-2 hours
-### Troubleshooting
-**Issue**: Out of memory during simulation
-**Solution**: Reduce n_cases to 10,000-20,000 or increase system RAM
-**Issue**: EDA parameters not found
-**Solution**: Run `uv run court-scheduler eda` first
-**Issue**: Import errors
-**Solution**: Ensure UV environment is activated, run `uv sync`
-### Advanced Configuration
-For fine-tuned control, use configuration files:
-```bash
-# Create configs/ directory with TOML files
-# Example: configs/generate_config.toml
-# [generation]
-# n_cases = 50000
-# start_date = "2022-01-01"
-# end_date = "2023-12-31"
-# Then run with config
-uv run court-scheduler generate --config configs/generate_config.toml
-uv run court-scheduler simulate --config configs/simulate_config.toml
-```
-Or use command-line options:
-```bash
-# Full customization
-uv run court-scheduler workflow \
-  --cases 50000 \
-  --days 730 \
-  --start 2022-01-01 \
-  --end 2023-12-31 \
-  --output data/custom_run \
-  --seed 42
-```
-### Contact & Support
-For hackathon questions or technical support:
-- Check README.md for the system overview
-- See this guide (docs/HACKATHON_SUBMISSION.md) for end-to-end instructions
----
-**Good luck with your hackathon submission!**
-This system represents a pragmatic, data-driven approach to improving judicial efficiency. The combination of production-ready cause lists, proven performance metrics, and a transparent, judge-in-the-loop design positions this as a compelling winning submission.

eda/config.py CHANGED Viewed

@@ -11,9 +11,8 @@ from pathlib import Path
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 DATA_DIR = PROJECT_ROOT / "Data"
-DUCKDB_FILE = DATA_DIR / "court_data.duckdb"
-CASES_FILE = DATA_DIR / "ISDMHack_Cases_WPfinal.csv"
-HEAR_FILE = DATA_DIR / "ISDMHack_Hear.csv"
 # Default paths (used when EDA is run standalone)
 REPORTS_DIR = PROJECT_ROOT / "reports"

 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 DATA_DIR = PROJECT_ROOT / "Data"
+CASE_FILE_PARQUET = DATA_DIR / "cases.parquet"
+HEARING_FILE_PARQUET = DATA_DIR / "hearings.parquet"
 # Default paths (used when EDA is run standalone)
 REPORTS_DIR = PROJECT_ROOT / "reports"

eda/load_clean.py CHANGED Viewed

@@ -14,10 +14,8 @@ from pathlib import Path
 import polars as pl
 from eda.config import (
-    CASES_FILE,
-    DUCKDB_FILE,
-    HEAR_FILE,
-    NULL_TOKENS,
     RUN_TS,
     VERSION,
     _get_cases_parquet,
@@ -55,46 +53,23 @@ def _null_summary(df: pl.DataFrame, name: str) -> None:
     print(row)
-# -------------------------------------------------------------------
-# Main logic
-# -------------------------------------------------------------------
 def load_raw() -> tuple[pl.DataFrame, pl.DataFrame]:
-    try:
-        import duckdb
-        if not Path(DUCKDB_FILE).exists():
-            print(
-                f"DuckDB file not found at {Path(DUCKDB_FILE)}, skipping DuckDB load."
-            )
-            raise FileNotFoundError("DuckDB file not found.")
-        if DUCKDB_FILE.exists():
-            print(f"Loading raw data from DuckDB: {DUCKDB_FILE}")
-            conn = duckdb.connect(str(DUCKDB_FILE))
-            cases = pl.from_pandas(conn.execute("SELECT * FROM cases").df())
-            hearings = pl.from_pandas(conn.execute("SELECT * FROM hearings").df())
-            conn.close()
-            print(f"Cases shape: {cases.shape}")
-            print(f"Hearings shape: {hearings.shape}")
-            return cases, hearings
-    except Exception as e:
-        print(f"[WARN] DuckDB load failed ({e}), falling back to CSV...")
-    print("Loading raw data from CSVs (fallback)...")
-    if not CASES_FILE.exists() or not HEAR_FILE.exists():
-        raise FileNotFoundError("One or both CSV files are missing.")
-    cases = pl.read_csv(
-        CASES_FILE,
-        try_parse_dates=True,
-        null_values=NULL_TOKENS,
-        infer_schema_length=100_000,
-    )
-    hearings = pl.read_csv(
-        HEAR_FILE,
-        try_parse_dates=True,
-        null_values=NULL_TOKENS,
-        infer_schema_length=100_000,
-    )
     print(f"Cases shape: {cases.shape}")
     print(f"Hearings shape: {hearings.shape}")
     return cases, hearings

 import polars as pl
 from eda.config import (
+    CASE_FILE_PARQUET,
+    HEARING_FILE_PARQUET,
     RUN_TS,
     VERSION,
     _get_cases_parquet,
     print(row)
 def load_raw() -> tuple[pl.DataFrame, pl.DataFrame]:
+    cases_path = Path(CASE_FILE_PARQUET)
+    hearings_path = Path(HEARING_FILE_PARQUET)
+    if not (cases_path.exists() and hearings_path.exists()):
+        raise FileNotFoundError(
+            "Parquet files not found. Will not proceed with loading cleaned data."
+        )
+    print(f"Loading Parquet files:\n- {cases_path}\n- {hearings_path}")
+    cases = pl.read_parquet(cases_path)
+    hearings = pl.read_parquet(hearings_path)
     print(f"Cases shape: {cases.shape}")
     print(f"Hearings shape: {hearings.shape}")
     return cases, hearings

pyproject.toml CHANGED Viewed

@@ -19,16 +19,13 @@ dependencies = [
     "XlsxWriter>=3.2",
     "pyarrow>=17.0",
     "numpy>=2.0",
-    "networkx>=3.0",
     "ortools>=9.8",
     "pydantic>=2.0",
     "typer>=0.12",
     "simpy>=4.1",
     "scipy>=1.14",
-    "scikit-learn>=1.5",
     "streamlit>=1.28",
     "altair>=5.0",
-    "duckdb>=1.4.2",
 ]
 ########################################

     "XlsxWriter>=3.2",
     "pyarrow>=17.0",
     "numpy>=2.0",
     "ortools>=9.8",
     "pydantic>=2.0",
     "typer>=0.12",
     "simpy>=4.1",
     "scipy>=1.14",
     "streamlit>=1.28",
     "altair>=5.0",
 ]
 ########################################