SchemaSage-SQL Model Card

This is the release model card for the SchemaSage-SQL project. A 10-step QLoRA smoke adapter, a 200-step safety-balanced adapter, and a larger 8k/600 follow-up adapter have been trained and uploaded to validate the cloud training and Hugging Face release path. The 200-step safety-balanced adapter is the public release baseline.

Model Details

Project name: SchemaSage-SQL
Model status: release baseline published; larger follow-up remains experimental
Planned base model: configurable through configs/model.yaml
Current default base model: Qwen/Qwen3-4B-Instruct-2507
Planned method: supervised fine-tuning with LoRA or QLoRA
Release format: adapter-first, optional merged model later
Smoke adapter: rishhh/schemasage-sql-qwen3-4b-smoke

Intended Use

SchemaSage-SQL is intended for:

Research and prototyping around Text-to-SQL.
Portfolio demonstration of LLM data, training, evaluation, and deployment engineering.
Generating draft read-only analytical SQL from a schema and natural-language question.

Generated SQL should be reviewed before execution.

Out-of-Scope Use

SchemaSage-SQL is not intended for:

Autonomous execution against production databases.
Destructive database operations.
Legal, financial, medical, or compliance-critical decision making.
Credential extraction, filesystem access, or network exfiltration.
SQL generation without a provided schema.

Datasets

Current data pipeline uses public Hugging Face datasets:

gretelai/synthetic_text_to_sql
b-mc2/sql-create-context

Cleaned and republished training data:

Split	Examples
Train	95,508
Validation	10,612
Test	5,324
Refusal examples created	10,862

Current processed split summary:

Split	Examples
Train	151,723
Validation	16,858
Test	5,248

Processing removed:

Empty or incomplete examples.
Exact duplicates.
Destructive SQL examples by default.
Sample INSERT rows from schema context where source datasets included data values.

The validation split is a deterministic train holdout because the selected sources do not provide a native validation split.

Training Procedure

Training is implemented in src/training/train_sft.py and has been run for smoke, release-baseline, and experimental adapters. A tiny local trainer smoke test was completed with sshleifer/tiny-gpt2; that artifact is not a release model and should not be used for model-quality claims.

A cloud QLoRA smoke run completed on Hugging Face Jobs:

Field	Value
Job ID	`6a0c94bd2dc5b1243da4ffee`
Hardware	`a10g-large`
Base model	`Qwen/Qwen3-4B-Instruct-2507`
Train rows	112
Eval rows	16
Steps	10
Final train loss	1.908
Final eval loss	1.188
Adapter repo	`rishhh/schemasage-sql-qwen3-4b-smoke`

Smoke adapter commit:

https://huggingface.co/rishhh/schemasage-sql-qwen3-4b-smoke/commit/6a8241d2f6cc3f0a917ba527f2f32b0cc4bf9933

This smoke adapter validates the training and upload path. It is not the release baseline.

Safety-Clean Balanced Adapter

The best current safety baseline is:

rishhh/schemasage-sql-qwen3-4b-clean-balanced-200

It was trained for 200 optimizer steps on rishhh/schemasage-sql-clean-text2sql with a deterministic 20% refusal mix in the short training sample. Evaluation on 64 cleaned held-out examples produced:

Metric	Value
Normalized exact match	0.2241
SQL parse validity	1.0000
Schema adherence rate	0.9828
Unsafe query rate	0.0000
Blocked refusal accuracy	1.0000
Mean latency seconds	5.5201

This adapter is the public release baseline because it kept perfect blocked-refusal accuracy on the re-evaluated held-out set.

The larger follow-up run completed as:

rishhh/schemasage-sql-qwen3-4b-clean-balanced-8k-600-v2

Completed settings: 8,192 cleaned train rows, 600 optimizer steps, 20% refusal mix, and 256 cleaned held-out evaluation examples.

The larger run is the stronger semantic candidate: it improved exact match, execution accuracy, and exact-or-execution-equivalent rate, but it missed one blocked refusal and kept slightly lower parse/schema scores. It remains the best comparison candidate, not the shipped baseline.

Planned command:

python -m src.training.train_sft --config configs/train_qlora.yaml

Smoke/dry-run validation command:

python -m src.training.train_sft --config configs/train_qlora.yaml --smoke-test --dry-run

Apple Silicon local development should use:

python -m src.training.train_sft --config configs/train_local_smoke.yaml --smoke-test --dry-run

Trainer-path validation command that has been run locally:

python -m src.training.train_sft --config configs/train_local_smoke.yaml --smoke-test

Evaluation

The full local reference-answer evaluation is a sanity check for the evaluator and processed data, not trained-model performance.

Metric	Value
Exact match	1.0000
Normalized exact match	1.0000
SQL parse validity	0.9998
Schema adherence rate	0.8994
Hallucinated table rate	0.0494
Hallucinated column rate	0.0878
Unsafe query rate	0.0071
Execution accuracy	1.0000
Execution comparable examples	4,007

These metrics validate the evaluator and processed reference data. They should not be presented as model quality.

The uploaded smoke adapter was also evaluated on 64 held-out examples from gretelai/synthetic_text_to_sql test. These metrics describe the 10-step smoke adapter only:

Metric	Value
Exact match	0.2188
Normalized exact match	0.2344
SQL parse validity	1.0000
Schema adherence rate	0.9688
Hallucinated table rate	0.0156
Hallucinated column rate	0.0312
Unsafe query rate	0.0000
Execution accuracy	0.8409
Execution comparable examples	44
Mean generated SQL length	11.80
Mean latency seconds	23.57

Evaluation artifacts:

evaluation/smoke_64/eval_results.json
evaluation/smoke_64/eval_report.md
evaluation/smoke_64/predictions.jsonl
evaluation/smoke_64/metrics_overview.svg
evaluation/smoke_64/risk_rates.svg

Run comparison report:

reports/run_comparison.md
reports/run_comparison/comparison_quality.svg
reports/run_comparison/comparison_risk_latency.svg

Release Baseline

The public release baseline is:

rishhh/schemasage-sql-qwen3-4b-clean-balanced-200

Its re-evaluated held-out metrics are summarized in:

reports/reparsed_clean_balanced_200.md
reports/reparsed_run_comparison.md
reports/run_comparison/comparison_quality.svg
reports/run_comparison/comparison_risk_latency.svg

The larger rishhh/schemasage-sql-qwen3-4b-clean-balanced-8k-600-v2 run improved semantic metrics but missed one blocked refusal and kept slightly lower parse/schema scores, so it remains an experimental comparison candidate rather than the release baseline.

Safety Policy

The project defaults to read-only analytical SQL.

The safety layer blocks or refuses:

DROP
DELETE
TRUNCATE
ALTER
UPDATE
INSERT
MERGE
REPLACE
CREATE DATABASE
CREATE USER
GRANT
REVOKE
EXEC and EXECUTE
CALL
COPY, LOAD, and UNLOAD
Multiple SQL statements
Prompt-injection-like requests to ignore safety or schema rules

Safety checks are implemented in src/inference/safety.py using lexical checks plus sqlglot parsing.

Limitations

The public release baseline is the 200-step safety-balanced adapter; the larger 8k/600 follow-up remains experimental.
The uploaded Qwen3 4B smoke adapter exists only to validate the cloud trainer and Hub upload path.
A tiny local smoke adapter exists only to validate the local trainer path.
Re-evaluated held-out predictions are published for the release baseline, but the larger follow-up still serves as the main comparison point for future improvement work.
The smoke adapter is undertrained; exact match is low despite good parse validity and schema adherence.
The model can continue generating prompt-like text after the canonical response, so inference uses canonical-response parsing and should add stricter stop criteria.
QLoRA training is expected to require CUDA-capable GPU hardware.
Apple Silicon is suitable for development, tests, data prep, dry-runs, and the Gradio app, but not the default 4-bit QLoRA path.
SQL equivalence is difficult; exact match is not enough to judge model quality.
SQLite execution evaluation skips dialect-incompatible examples.
Schema adherence checks can still miss complex aliasing or dialect-specific constructs.

Example Usage

python -m src.inference.generate_sql \
  --schema-file data/samples/schema.sql \
  --question "Which product categories had the highest revenue in Q4 2025?" \
  --dry-run

Model-backed inference requires a configured model or adapter path.

Hardware Notes

The repository was developed and validated locally on a MacBook Pro M5 Pro environment using Python 3.11. Full QLoRA training should be run on CUDA GPU hardware or a managed GPU environment.

How to Improve

Improve execution-equivalence on hard joins and aggregations.
Add Spider-style held-out execution evaluation.
Expand refusal and unanswerable-question training examples.
Add batched inference and stop strings to reduce latency and trailing prompt continuation.
Keep the larger 8k/600 run as an experimental benchmark rather than the shipped baseline until it clearly wins on both quality and safety.
Publish final metrics only after evaluating generated predictions, not references.

License

Project code is MIT licensed. Dataset and base-model licenses must be reviewed before publishing trained artifacts.

Downloads last month: 106

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rishhh/schemasage-sql-qwen3-4b-clean-balanced-200

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5498)

this model

rishhh
/

schemasage-sql-qwen3-4b-clean-balanced-200