Rinil-Parmar
/

tpch-query-router-models

query-optimization

learned-cost-model

Model card Files Files and versions

Rinil-Parmar commited on Mar 13

Commit

2c5da9b

·

verified ·

1 Parent(s): f9c6d71

Create README.md

Files changed (1) hide show

README.md +77 -0

README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+---
+library_name: scikit-learn
+tags:
+- sql
+- query-routing
+- query-optimization
+- learned-cost-model
+- sqlite
+- duckdb
+- tpch
+license: mit
+---
+# Cross‑Engine SQL Query Router (SQLite vs DuckDB)
+This model predicts which engine (**SQLite** or **DuckDB**) will run a SQL query faster, using a **learned cost model** trained on TPC‑H query variants.
+**Input:** SQL query → **Output:** predicted runtime for each engine + recommended engine
+---
+## Files in this repository
+- `model_sqlite.joblib` — regression model that predicts SQLite runtime (seconds)
+- `model_duckdb.joblib` — regression model that predicts DuckDB runtime (seconds)
+- `model_metadata.json` — metadata (feature list, training size, evaluation score, etc.)
+---
+## How it works (high level)
+1. Extract **25 structural features** from the query (joins, GROUP BY, subqueries, nesting depth, etc.)
+2. Predict runtime on **SQLite**
+3. Predict runtime on **DuckDB**
+4. Recommend the engine with the lower predicted runtime
+---
+## Recommended usage (best way)
+Use the full project repository which includes:
+- the **same feature extractor** used during training (`models/predict.py`)
+- a **Streamlit UI** (`app.py`)
+- optional **Live Test** (runs the query on actual local SQLite/DuckDB databases)
+Project repo: `Rinil-Parmar/cross-engine-learned-cost-model`
+Dataset repo: `Rinil-Parmar/tpch-query-routing-dataset`
+---
+## Minimal example (load models)
+```python
+import joblib
+sqlite_model = joblib.load("model_sqlite.joblib")
+duckdb_model  = joblib.load("model_duckdb.joblib")
+print(type(sqlite_model), type(duckdb_model))
+```
+To make predictions correctly, you must use the **same 25-feature extraction** and feature ordering as in training (use `models/predict.py` from the project repo).
+---
+## Limitations
+- Trained on **TPC‑H** templates with randomized parameters (TPC‑H‑like analytic queries).
+- Uses **query structure only** (no table statistics, indexes, cache state, hardware differences).
+- May not generalize well to OLTP workloads or very different schemas.
+---
+## License
+MIT