RoyAalekh commited on
Commit
d3a967e
·
1 Parent(s): 4baabe1

moved to parquet from duckdb for raw data, updated readme

Browse files
.dockerignore CHANGED
@@ -11,5 +11,4 @@ reports/*.pdf
11
  configs/*.secrets.*
12
  uv.lock
13
  code4change_analysis.egg-info
14
- !data/court_data.duckdb
15
 
 
11
  configs/*.secrets.*
12
  uv.lock
13
  code4change_analysis.egg-info
 
14
 
Dockerfile CHANGED
@@ -1,20 +1,19 @@
1
  # syntax=docker/dockerfile:1
 
2
  FROM python:3.11-slim
3
 
 
4
  RUN apt-get update \
5
- && apt-get install -y --no-install-recommends curl git git-lfs libgomp1 \
6
  && rm -rf /var/lib/apt/lists/*
7
 
8
  WORKDIR /app
9
 
10
- # Install uv
11
  RUN curl -LsSf https://astral.sh/uv/install.sh | sh
12
  ENV PATH="/root/.local/bin:${PATH}"
13
 
14
  COPY . .
15
 
16
- RUN git lfs install && git lfs pull
17
-
18
  RUN uv venv .venv \
19
  && uv pip install --upgrade pip setuptools wheel \
20
  && uv pip install .
@@ -22,6 +21,9 @@ RUN uv venv .venv \
22
  ENV PATH="/app/.venv/bin:${PATH}"
23
  ENV PYTHONPATH="/app"
24
 
 
 
 
25
  EXPOSE 8501
26
 
27
  CMD ["streamlit", "run", "scheduler/dashboard/app.py", "--server.port=8501", "--server.address=0.0.0.0"]
 
1
  # syntax=docker/dockerfile:1
2
+
3
  FROM python:3.11-slim
4
 
5
+ # Install minimal system dependencies
6
  RUN apt-get update \
7
+ && apt-get install -y --no-install-recommends curl libgomp1 \
8
  && rm -rf /var/lib/apt/lists/*
9
 
10
  WORKDIR /app
11
 
 
12
  RUN curl -LsSf https://astral.sh/uv/install.sh | sh
13
  ENV PATH="/root/.local/bin:${PATH}"
14
 
15
  COPY . .
16
 
 
 
17
  RUN uv venv .venv \
18
  && uv pip install --upgrade pip setuptools wheel \
19
  && uv pip install .
 
21
  ENV PATH="/app/.venv/bin:${PATH}"
22
  ENV PYTHONPATH="/app"
23
 
24
+ # Health check commands
25
+ RUN uv --version && python --version && which court-scheduler && which streamlit
26
+
27
  EXPOSE 8501
28
 
29
  CMD ["streamlit", "run", "scheduler/dashboard/app.py", "--server.port=8501", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -13,10 +13,16 @@ Purpose-built for hackathon evaluation. This repository runs out of the box usin
13
 
14
  1. Install uv (see above) and ensure Python 3.11+ is available.
15
  2. Clone this repository.
16
- 3. Make sure to put `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv` in the `Data/` folder, or provide
17
- `court_data.duckdb` there. Both in csv format, strictly named as shown.
18
- 4. Launch the dashboard:
19
-
 
 
 
 
 
 
20
  ```bash
21
  uv run streamlit run scheduler/dashboard/app.py
22
  ```
@@ -85,18 +91,10 @@ Then open http://localhost:8501.
85
 
86
  Notes for Windows CMD: use ^ for line continuation and replace ${PWD} with the full path.
87
 
88
- ## Data (DuckDB-first)
89
-
90
- This repository uses a DuckDB snapshot as the canonical raw dataset.
91
 
92
- - Preferred source: `Data/court_data.duckdb` (tables: `cases`, `hearings`). If this file is present, the EDA step will load directly from it.
93
- - CSV fallback: If the DuckDB file is missing, place the two organizer CSVs in `Data/` with the exact names below and the EDA step will load them automatically:
94
- - `ISDMHack_Cases_WPfinal.csv`
95
- - `ISDMHack_Hear.csv`
96
 
97
  No manual pre-processing is required; launch the dashboard and click “Run EDA Pipeline.”
98
 
99
- ## Notes
100
-
101
- - This submission intentionally focuses on the end-to-end demo path. Internal development notes, enhancements, and bug fix logs have been removed from the README.
102
- - uv is enforced by the dashboard for a consistent, reproducible environment.
 
13
 
14
  1. Install uv (see above) and ensure Python 3.11+ is available.
15
  2. Clone this repository.
16
+ 3. Navigate to the repo root and activate uv:
17
+ ```bash
18
+ cd path/to/repo
19
+ uv activate
20
+ ```
21
+ 4. Install dependencies:
22
+ ```bash
23
+ uv install
24
+ ```
25
+ 5. Launch the dashboard:
26
  ```bash
27
  uv run streamlit run scheduler/dashboard/app.py
28
  ```
 
91
 
92
  Notes for Windows CMD: use ^ for line continuation and replace ${PWD} with the full path.
93
 
94
+ ## Data (Parquet format)
 
 
95
 
96
+ This repository uses a parquet data format for efficient loading and processing.
97
+ Provided excel and csv files have been pre-converted to parquet and stored in the `Data/` folder.
 
 
98
 
99
  No manual pre-processing is required; launch the dashboard and click “Run EDA Pipeline.”
100
 
 
 
 
 
docs/CONFIGURATION.md DELETED
@@ -1,44 +0,0 @@
1
- # Configuration Guide (Consolidated)
2
-
3
- This configuration reference has been intentionally simplified for the hackathon to keep the repository focused for judges and evaluators.
4
-
5
- For the end-to-end demo and instructions, see:
6
- - `docs/HACKATHON_SUBMISSION.md`
7
-
8
- Advanced usage help is available via the CLI:
9
-
10
- ```bash
11
- uv run court-scheduler --help
12
- uv run court-scheduler generate --help
13
- uv run court-scheduler simulate --help
14
- uv run court-scheduler workflow --help
15
- ```
16
-
17
- Note: uv is required for all commands.
18
-
19
- ### Deprecating Parameters
20
- 1. Move to config class first (keep old path working)
21
- 2. Add deprecation warning
22
- 3. Remove old path after one release cycle
23
-
24
- ## Validation Rules
25
-
26
- All config classes validate in `__post_init__`:
27
- - Value ranges (0 < learning_rate <= 1)
28
- - Type consistency (convert strings to Path)
29
- - Cross-parameter constraints (max_gap >= min_gap)
30
- - Required file existence (rl_agent_path must exist)
31
-
32
- ## Anti-Patterns
33
-
34
- **DON'T**:
35
- - Hardcode magic numbers in algorithms
36
- - Use module-level mutable globals
37
- - Mix domain constants with tunable parameters
38
- - Create "god config" with everything in one class
39
-
40
- **DO**:
41
- - Separate by lifecycle and ownership
42
- - Validate early (constructor time)
43
- - Use dataclasses for immutability
44
- - Provide sensible defaults with named presets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/DASHBOARD.md DELETED
@@ -1,17 +0,0 @@
1
- # Dashboard Guide (Consolidated)
2
-
3
- This document has been simplified for the hackathon. Please use the main guide:
4
-
5
- - See `docs/HACKATHON_SUBMISSION.md` for end-to-end demo instructions.
6
-
7
- Quick launch:
8
-
9
- ```bash
10
- uv run streamlit run scheduler/dashboard/app.py
11
- # Then open http://localhost:8501
12
- ```
13
-
14
- Data source:
15
-
16
- - Preferred: `Data/court_data.duckdb` (tables: `cases`, `hearings`).
17
- - Fallback: place `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv` in `Data/` if the DuckDB file is not present.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/HACKATHON_SUBMISSION.md DELETED
@@ -1,294 +0,0 @@
1
- # Hackathon Submission Guide
2
- ## Intelligent Court Scheduling System
3
-
4
- ### Quick Start - Hackathon Demo
5
-
6
- **IMPORTANT**: The dashboard is fully self-contained. You only need:
7
- 1. Preferred: `Data/court_data.duckdb` (included in this repo). Alternatively, place the two CSVs in `Data/` with exact names: `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv`.
8
- 2. This codebase
9
- 3. Run the dashboard
10
-
11
- Everything else (EDA, parameters, visualizations, simulations) is generated on-demand through the dashboard.
12
-
13
- #### Launch Dashboard
14
- ```bash
15
- # Start the dashboard
16
- uv run streamlit run scheduler/dashboard/app.py
17
-
18
- # Open browser to http://localhost:8501
19
- ```
20
-
21
- **Complete Workflow Through Dashboard**:
22
- 1. **First Time Setup**: Click "Run EDA Pipeline" on main page (processes raw data - takes 2-5 min)
23
- 2. **Explore Data**: Navigate to "Data & Insights" to see 739K+ hearings analysis
24
- 3. **Run Simulation**: Go to "Simulation Workflow" → generate cases → run simulation
25
- 4. **Review Results**: Check "Cause Lists & Overrides" for judge override interface
26
- 5. **Performance Analysis**: View "Analytics & Reports" for metrics comparison
27
-
28
- **No pre-processing required** — EDA automatically loads `Data/court_data.duckdb` when present; if missing, it falls back to `ISDMHack_Cases_WPfinal.csv` and `ISDMHack_Hear.csv` placed in `Data/`.
29
-
30
- ### Docker Quick Start (no local Python needed)
31
-
32
- If you prefer a zero-setup run, use Docker. This is the recommended path for judges.
33
-
34
- 1) Build the image (from the repository root):
35
-
36
- ```bash
37
- docker build -t code4change-analysis .
38
- ```
39
-
40
- 2) Show CLI help (Windows PowerShell example):
41
-
42
- ```powershell
43
- docker run --rm `
44
- -v ${PWD}\Data:/app/Data `
45
- -v ${PWD}\outputs:/app/outputs `
46
- code4change-analysis court-scheduler --help
47
- ```
48
-
49
- 3) Run the Streamlit dashboard:
50
-
51
- ```powershell
52
- docker run --rm -p 8501:8501 `
53
- -v ${PWD}\Data:/app/Data `
54
- -v ${PWD}\outputs:/app/outputs `
55
- code4change-analysis `
56
- streamlit run scheduler/dashboard/app.py --server.address=0.0.0.0
57
- ```
58
-
59
- Then open http://localhost:8501.
60
-
61
- Notes:
62
- - Replace ${PWD} with the full path if using Windows CMD (use ^ for line continuation).
63
- - Mounting Data/ and outputs/ ensures inputs and generated artifacts persist on your host.
64
-
65
- #### Alternative: CLI Workflow (for scripting)
66
- ```bash
67
- # Run complete pipeline: generate cases + simulate
68
- uv run court-scheduler workflow --cases 50000 --days 730
69
- ```
70
-
71
- This executes:
72
- - EDA parameter extraction (if needed)
73
- - Case generation with realistic distributions
74
- - Multi-year simulation with policy comparison
75
- - Performance analysis and reporting
76
-
77
- #### Option 2: Quick Demo
78
- ```bash
79
- # 90-day quick demo with 10,000 cases
80
- uv run court-scheduler workflow --cases 10000 --days 90
81
- ```
82
-
83
- #### Option 3: Step-by-Step
84
- ```bash
85
- # 1. Extract parameters from historical data
86
- uv run court-scheduler eda
87
-
88
- # 2. Generate synthetic cases
89
- uv run court-scheduler generate --cases 50000
90
-
91
- # 3. Run simulation
92
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy readiness
93
- ```
94
-
95
- ### What the Pipeline Does
96
-
97
- The comprehensive pipeline executes 6 automated steps:
98
-
99
- **Step 1: EDA & Parameter Extraction**
100
- - Analyzes 739K+ historical hearings
101
- - Extracts transition probabilities, duration statistics
102
- - Generates simulation parameters
103
-
104
- **Step 2: Data Generation**
105
- - Creates realistic synthetic case dataset
106
- - Configurable size (default: 50,000 cases)
107
- - Diverse case types and complexity levels
108
-
109
- **Step 3: 2-Year Simulation**
110
- - Runs 730-day court scheduling simulation
111
- - Compares scheduling policies (FIFO, age-based, readiness)
112
- - Tracks disposal rates, utilization, fairness metrics
113
-
114
- **Step 4: Daily Cause List Generation**
115
- - Generates production-ready daily cause lists
116
- - Exports for all simulation days
117
- - Court-room wise scheduling details
118
-
119
- **Step 5: Performance Analysis**
120
- - Comprehensive comparison reports
121
- - Performance visualizations
122
- - Statistical analysis of all metrics
123
-
124
- **Step 6: Executive Summary**
125
- - Hackathon-ready summary document
126
- - Key achievements and impact metrics
127
- - Deployment readiness checklist
128
-
129
- ### Expected Output
130
-
131
- After completion, you'll find outputs under your selected run directory (created automatically; the dashboard uses outputs/simulation_runs by default):
132
-
133
- ```
134
- outputs/simulation_runs/v<version>_<timestamp>/
135
- |-- pipeline_config.json # Full configuration used
136
- |-- events.csv # All scheduled events across days
137
- |-- metrics.csv # Aggregate metrics for the run
138
- |-- daily_summaries.csv # Per-day summary metrics
139
- |-- cause_lists/ # Generated daily cause lists (CSV)
140
- | |-- YYYY-MM-DD.csv # One file per simulation day
141
- |-- figures/ # Optional charts (when exported)
142
- ```
143
-
144
- ### Hackathon Winning Features
145
-
146
- #### 1. Real-World Impact
147
- - **52%+ Disposal Rate**: Demonstrable case clearance improvement
148
- - **730 Days of Cause Lists**: Ready for immediate court deployment
149
- - **Multi-Courtroom Support**: Load-balanced allocation across 5+ courtrooms
150
- - **Scalability**: Tested with 50,000+ cases
151
-
152
- #### 2. Technical Approach
153
- - Data-informed simulation calibrated from historical hearings
154
- - Multiple heuristic policies: FIFO, age-based, readiness-based
155
- - Readiness policy enforces bottleneck/ripeness constraints
156
- - Fairness metrics (e.g., Gini) and utilization tracking
157
-
158
- #### 3. Production Readiness
159
- - **Interactive CLI**: User-friendly parameter configuration
160
- - **Comprehensive Reporting**: Executive summaries and detailed analytics
161
- - **Quality Assurance**: Validated against baseline algorithms
162
- - **Professional Output**: Court-ready cause lists and reports
163
-
164
- #### 4. Judicial Integration
165
- - **Ripeness Classification**: Filters unready cases (40%+ efficiency gain)
166
- - **Fairness Metrics**: Low Gini coefficient for equitable distribution
167
- - **Transparency**: Explainable decision-making process
168
- - **Override Capability**: Complete judicial control maintained
169
-
170
- ### Performance Benchmarks
171
-
172
- Compare policies by running multiple simulations (e.g., readiness vs FIFO vs age) and reviewing disposal rate, utilization, and fairness (Gini). The Analytics & Reports dashboard page can load and compare runs side-by-side.
173
-
174
- ### Customization Options
175
-
176
- #### For Hackathon Judges
177
- ```bash
178
- # Large-scale impressive demo
179
- uv run court-scheduler workflow --cases 100000 --days 730
180
-
181
- # With all policies compared
182
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy readiness
183
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy fifo
184
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy age
185
- ```
186
-
187
- #### For Technical Evaluation
188
- Focus on repeatability and fairness by comparing multiple policies and seeds:
189
- ```bash
190
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy readiness --seed 1
191
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy fifo --seed 1
192
- uv run court-scheduler simulate --cases data/cases.csv --days 730 --policy age --seed 1
193
- ```
194
-
195
- #### For Quick Demo/Testing
196
- ```bash
197
- # Fast proof-of-concept
198
- uv run court-scheduler workflow --cases 10000 --days 90
199
-
200
- # Pre-configured:
201
- # - 10,000 cases
202
- # - 90 days simulation
203
- # - ~5-10 minutes runtime
204
- ```
205
-
206
- ### Tips for Winning Presentation
207
-
208
- 1. **Start with the Problem**
209
- - Show Karnataka High Court case pendency statistics
210
- - Explain judicial efficiency challenges
211
- - Highlight manual scheduling limitations
212
-
213
- 2. **Demonstrate the Solution**
214
- - Run the interactive pipeline live
215
- - Display generated cause lists
216
-
217
- 3. **Present the Results**
218
- - Open EXECUTIVE_SUMMARY.md
219
- - Highlight key achievements from comparison table
220
- - Show actual cause list files (730 days ready)
221
-
222
- 4. **Emphasize Innovation**
223
- - Data-driven readiness-based scheduling (novel for this context)
224
- - Production-ready from day 1 (practical)
225
- - Scalable to entire court system (impactful)
226
-
227
- 5. **Address Concerns**
228
- - Judicial oversight: Complete override capability
229
- - Fairness: Low Gini coefficients, transparent metrics
230
- - Reliability: Tested against proven baselines
231
- - Deployment: Ready-to-use cause lists generated
232
-
233
- ### System Requirements
234
-
235
- - **Python**: 3.11+
236
- - **uv**: required to run commands and the dashboard
237
- - **Memory**: 8GB+ RAM (16GB recommended for 50K cases)
238
- - **Storage**: 2GB+ for full pipeline outputs
239
- - **Runtime**:
240
- - Quick demo: 5-10 minutes
241
- - Full 2-year sim (50K cases): 30-60 minutes
242
- - Large-scale (100K cases): 1-2 hours
243
-
244
- ### Troubleshooting
245
-
246
- **Issue**: Out of memory during simulation
247
- **Solution**: Reduce n_cases to 10,000-20,000 or increase system RAM
248
-
249
- **Issue**: EDA parameters not found
250
- **Solution**: Run `uv run court-scheduler eda` first
251
-
252
- **Issue**: Import errors
253
- **Solution**: Ensure UV environment is activated, run `uv sync`
254
-
255
- ### Advanced Configuration
256
-
257
- For fine-tuned control, use configuration files:
258
-
259
- ```bash
260
- # Create configs/ directory with TOML files
261
- # Example: configs/generate_config.toml
262
- # [generation]
263
- # n_cases = 50000
264
- # start_date = "2022-01-01"
265
- # end_date = "2023-12-31"
266
-
267
- # Then run with config
268
- uv run court-scheduler generate --config configs/generate_config.toml
269
- uv run court-scheduler simulate --config configs/simulate_config.toml
270
- ```
271
-
272
- Or use command-line options:
273
- ```bash
274
- # Full customization
275
- uv run court-scheduler workflow \
276
- --cases 50000 \
277
- --days 730 \
278
- --start 2022-01-01 \
279
- --end 2023-12-31 \
280
- --output data/custom_run \
281
- --seed 42
282
- ```
283
-
284
- ### Contact & Support
285
-
286
- For hackathon questions or technical support:
287
- - Check README.md for the system overview
288
- - See this guide (docs/HACKATHON_SUBMISSION.md) for end-to-end instructions
289
-
290
- ---
291
-
292
- **Good luck with your hackathon submission!**
293
-
294
- This system represents a pragmatic, data-driven approach to improving judicial efficiency. The combination of production-ready cause lists, proven performance metrics, and a transparent, judge-in-the-loop design positions this as a compelling winning submission.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eda/config.py CHANGED
@@ -11,9 +11,8 @@ from pathlib import Path
11
  PROJECT_ROOT = Path(__file__).resolve().parents[1]
12
 
13
  DATA_DIR = PROJECT_ROOT / "Data"
14
- DUCKDB_FILE = DATA_DIR / "court_data.duckdb"
15
- CASES_FILE = DATA_DIR / "ISDMHack_Cases_WPfinal.csv"
16
- HEAR_FILE = DATA_DIR / "ISDMHack_Hear.csv"
17
 
18
  # Default paths (used when EDA is run standalone)
19
  REPORTS_DIR = PROJECT_ROOT / "reports"
 
11
  PROJECT_ROOT = Path(__file__).resolve().parents[1]
12
 
13
  DATA_DIR = PROJECT_ROOT / "Data"
14
+ CASE_FILE_PARQUET = DATA_DIR / "cases.parquet"
15
+ HEARING_FILE_PARQUET = DATA_DIR / "hearings.parquet"
 
16
 
17
  # Default paths (used when EDA is run standalone)
18
  REPORTS_DIR = PROJECT_ROOT / "reports"
eda/load_clean.py CHANGED
@@ -14,10 +14,8 @@ from pathlib import Path
14
  import polars as pl
15
 
16
  from eda.config import (
17
- CASES_FILE,
18
- DUCKDB_FILE,
19
- HEAR_FILE,
20
- NULL_TOKENS,
21
  RUN_TS,
22
  VERSION,
23
  _get_cases_parquet,
@@ -55,46 +53,23 @@ def _null_summary(df: pl.DataFrame, name: str) -> None:
55
  print(row)
56
 
57
 
58
- # -------------------------------------------------------------------
59
- # Main logic
60
- # -------------------------------------------------------------------
61
  def load_raw() -> tuple[pl.DataFrame, pl.DataFrame]:
62
- try:
63
- import duckdb
 
 
 
 
 
 
 
 
 
 
64
 
65
- if not Path(DUCKDB_FILE).exists():
66
- print(
67
- f"DuckDB file not found at {Path(DUCKDB_FILE)}, skipping DuckDB load."
68
- )
69
- raise FileNotFoundError("DuckDB file not found.")
70
- if DUCKDB_FILE.exists():
71
- print(f"Loading raw data from DuckDB: {DUCKDB_FILE}")
72
- conn = duckdb.connect(str(DUCKDB_FILE))
73
- cases = pl.from_pandas(conn.execute("SELECT * FROM cases").df())
74
- hearings = pl.from_pandas(conn.execute("SELECT * FROM hearings").df())
75
- conn.close()
76
- print(f"Cases shape: {cases.shape}")
77
- print(f"Hearings shape: {hearings.shape}")
78
- return cases, hearings
79
- except Exception as e:
80
- print(f"[WARN] DuckDB load failed ({e}), falling back to CSV...")
81
- print("Loading raw data from CSVs (fallback)...")
82
- if not CASES_FILE.exists() or not HEAR_FILE.exists():
83
- raise FileNotFoundError("One or both CSV files are missing.")
84
- cases = pl.read_csv(
85
- CASES_FILE,
86
- try_parse_dates=True,
87
- null_values=NULL_TOKENS,
88
- infer_schema_length=100_000,
89
- )
90
- hearings = pl.read_csv(
91
- HEAR_FILE,
92
- try_parse_dates=True,
93
- null_values=NULL_TOKENS,
94
- infer_schema_length=100_000,
95
- )
96
  print(f"Cases shape: {cases.shape}")
97
  print(f"Hearings shape: {hearings.shape}")
 
98
  return cases, hearings
99
 
100
 
 
14
  import polars as pl
15
 
16
  from eda.config import (
17
+ CASE_FILE_PARQUET,
18
+ HEARING_FILE_PARQUET,
 
 
19
  RUN_TS,
20
  VERSION,
21
  _get_cases_parquet,
 
53
  print(row)
54
 
55
 
 
 
 
56
  def load_raw() -> tuple[pl.DataFrame, pl.DataFrame]:
57
+ cases_path = Path(CASE_FILE_PARQUET)
58
+ hearings_path = Path(HEARING_FILE_PARQUET)
59
+
60
+ if not (cases_path.exists() and hearings_path.exists()):
61
+ raise FileNotFoundError(
62
+ "Parquet files not found. Will not proceed with loading cleaned data."
63
+ )
64
+
65
+ print(f"Loading Parquet files:\n- {cases_path}\n- {hearings_path}")
66
+
67
+ cases = pl.read_parquet(cases_path)
68
+ hearings = pl.read_parquet(hearings_path)
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  print(f"Cases shape: {cases.shape}")
71
  print(f"Hearings shape: {hearings.shape}")
72
+
73
  return cases, hearings
74
 
75
 
pyproject.toml CHANGED
@@ -19,16 +19,13 @@ dependencies = [
19
  "XlsxWriter>=3.2",
20
  "pyarrow>=17.0",
21
  "numpy>=2.0",
22
- "networkx>=3.0",
23
  "ortools>=9.8",
24
  "pydantic>=2.0",
25
  "typer>=0.12",
26
  "simpy>=4.1",
27
  "scipy>=1.14",
28
- "scikit-learn>=1.5",
29
  "streamlit>=1.28",
30
  "altair>=5.0",
31
- "duckdb>=1.4.2",
32
  ]
33
 
34
  ########################################
 
19
  "XlsxWriter>=3.2",
20
  "pyarrow>=17.0",
21
  "numpy>=2.0",
 
22
  "ortools>=9.8",
23
  "pydantic>=2.0",
24
  "typer>=0.12",
25
  "simpy>=4.1",
26
  "scipy>=1.14",
 
27
  "streamlit>=1.28",
28
  "altair>=5.0",
 
29
  ]
30
 
31
  ########################################