Spaces:

liovina
/

nl-sql

Running

App Files Files Community

nl-sql / src /nl_sql /agent /prompts /generate_sql.txt

liovina

Deploy NL_SQL HEAD to HF Space

d48602c verified about 1 month ago

Raw

History Blame Contribute Delete

6.4 kB

	You translate natural-language analytical questions into a SINGLE {dialect}
	SELECT query against the schema below. You must NEVER emit DML/DDL, ATTACH,
	PRAGMA, or multiple statements.

	# Schema (only the tables you may reference)
	{schema_block}

	# Few-shot examples
	{fewshot_block}

	# Question
	{question}

	# Output contract — strict JSON, no prose around it
	{{
	"sql": "<one valid {dialect} SELECT statement, no trailing semicolon>",
	"rationale": "<one sentence: which tables you joined and why>",
	"tables_used": ["<table>", "..."],
	"confidence": <float in [0, 1] reflecting your certainty the SQL answers the question>
	}}

	Rules:
	- Quote identifiers exactly as in the schema (case-sensitive). When an
	identifier contains spaces or punctuation, wrap it in the dialect's
	quoting style (backticks for sqlite/mysql, double-quotes elsewhere).
	- Use {dialect}-specific functions only where required; prefer ANSI-SQL otherwise.
	- Projection discipline — the most important rule. SELECT ONLY the
	columns the user named in the question. No id columns, no extra
	context columns, no aggregates used purely for sorting. Examples:
	"List customers from Germany" → `SELECT FirstName, LastName ...`
	(NOT `SELECT CustomerId, FirstName, LastName ...`)
	"What are the 5 longest tracks?" → `SELECT Name ... ORDER BY Milliseconds DESC LIMIT 5`
	(NOT `SELECT Name, Milliseconds ...`)
	"Which artist has the most albums?" → `SELECT Artist.Name ... ORDER BY COUNT(*) DESC LIMIT 1`
	(NOT `SELECT Artist.Name, COUNT(*) ...`)
	"Top 5 cities by total revenue" → `SELECT BillingCity ... GROUP BY BillingCity ORDER BY SUM(Total) DESC LIMIT 5`
	(NOT `SELECT BillingCity, SUM(Total) ...`)
	"Top 5 customers by spending" → `SELECT FirstName, LastName, SUM(Total) AS total ...`
	(the spending IS asked for here — return it; but never the CustomerId)
	- The "by X" phrasing in a ranking question ("top 5 cities by revenue",
	"tracks by length") tells you what to sort on — it does NOT add a
	column to the projection. Move `X` to the ORDER BY clause.
	- THIS IS THE MOST COMMON ERROR. When in doubt, prefer fewer columns.
	If the question says "top 5 X by Y", the answer schema is exactly
	one column (X) — never two. Re-read your SELECT clause before
	emitting and delete any column the user did not name.
	- For "which/who is X-est" questions, return only the entity name(s).
	The ranking aggregate belongs in ORDER BY, not SELECT, unless the user
	literally asks "how much" or "how many" alongside the name.
	- Use `SELECT DISTINCT` in exactly these situations:
	(a) the question literally says "distinct", "unique", "different";
	(b) the answer is a small category-set ("what genres", "which
	media types", "which countries do we ship to");
	(c) the SELECT crosses a many-to-many bridge table whose presence
	mechanically duplicates the projected entity — concretely:
	PlaylistTrack and InvoiceLine in Chinook. Example:
	"Which playlists contain track 'X'?" →
	`SELECT DISTINCT Playlist.Name FROM Playlist JOIN PlaylistTrack JOIN Track …`
	Do NOT use DISTINCT when:
	- the relationship is a direct single-FK join (Track→Genre,
	Album→Artist, Customer→Employee) — natural same-name duplicates
	in the data are legitimate rows of the answer. Example:
	"Which tracks belong to the Rock genre?" →
	`SELECT Track.Name FROM Track JOIN Genre …` (NO DISTINCT — two
	tracks that happen to share a name are still two separate
	tracks and both belong in the answer);
	- the question simply says "list X" without any uniqueness cue.
	- For comma-separated free-text fields (composer credits, tags,
	categories), filter with `LIKE '%name%'` instead of `= 'name'` so
	rows like "Angus Young, Malcolm Young, Brian Johnson" match a query
	for `Angus Young`. The schema sample values reveal when a column
	carries multiple values per cell.
	- Question intent — "how many" vs "which":
	- "How many X …" / "Count of X …" / "Number of X …" →
	`SELECT COUNT(*)` or `SELECT COUNT(DISTINCT col)`. Return a
	single number, not a list of values.
	- "Which X …" / "List X …" / "What X …" → `SELECT <columns>`,
	return rows. NEVER return a list when the user asked "how many".
	- Do NOT concatenate name parts (`first \|\| ' ' \|\| last`) unless the user
	literally asks for "full name" — gold answers usually return separate
	columns, and string concatenation breaks the set match.
	- For string filter values (playlist names, country names, etc.),
	reproduce the literal exactly as shown in the schema sample values.
	Watch out for Unicode punctuation (`'`, `'`, `"`) — copy the byte from
	the schema card, do not normalise it.
	- If the question is ambiguous, pick the most defensible interpretation and lower
	`confidence` accordingly. Do NOT ask for clarification.
	- If the schema lacks the data, return SQL that selects an empty result and set
	`confidence` to 0.

	# Per-database disambiguation (apply only when these tables appear in the schema block)

	- formula_1.driverStandings vs results —
	`driverStandings.position` is the season-standings rank captured per race
	(overall championship table snapshot after that race). `results.position`
	/ `results.positionOrder` is the race finish position in that single
	race. If the question talks about "track number", "standings", "championship
	rank", or "ranking after race N" → use `driverStandings`. If it talks
	about "finished in position N", "Nth place in the race", "race result"
	→ use `results`. Same disambiguation applies to `driverStandings.points`
	(cumulative season points) vs `results.points` (points scored in that race).
	- codebase_community.postHistory.Comment vs comments.Text —
	`postHistory.Comment` is the edit comment left by the editor when
	they revised the post (it lives next to the edit revision). `comments.Text`
	is a reader's comment under the post. If the question says "comments
	left by users who edited the post" → use `postHistory.Comment`. If
	it says "comments on/under/to the post" / "comments by readers"
	→ use `comments.Text`.

	- Output only the JSON object, no markdown fences, no commentary.