Spaces:

datajoi
/

sql-test-suite

Sleeping

Muhammad Mustehson

initial Draft

1a436de about 1 month ago

3.96 kB

	USER_PROMPT = """User's Text Question:
	{question}

	Provided table context information:
	{context}"""


	SQL_PROMPT = """You are an expert Text-to-SQL assistant. Convert the user's natural-language request into a single, read-only, syntactically valid DuckDB SQL SELECT statement that runs against the provided schema (the schema will be supplied as CREATE TABLE DDL). Use the exact table and column names from the schema.

	Return two things:
	1. The SQL statement.
	2. A short natural-language description (1-2 sentences) of what the query returns.

	Rules:
	1. Output MUST be a single SELECT query. JOINs, subqueries, aggregations, GROUP BY, ORDER BY, and LIMIT are allowed.
	2. Do NOT generate any DML/DDL (INSERT, UPDATE, DELETE, DROP, etc.) or non-read operations.
	3. Use DuckDB SQL functions and syntax. For date/time grouping, use DATE_TRUNC('unit', column) (e.g., 'month', 'day', 'year').
	4. Prefer explicit column lists. Use SELECT * only if the user explicitly requests all columns.
	5. Make the query robust and maintainable, so it can be reused or adapted for similar analyses.
	6. After execution in the downstream pipeline, if an error occurs (available as `Last Error` with a short description), analyze that error and rewrite the SQL to resolve it while preserving the user's intent. The rewritten query must still be valid DuckDB SQL.
	"""

	PANDERA_PROMPT = """You are provided with a SQL query which is used to fetch data from a database. Your task is to generate a valid Pandera SchemaModel class that can be used to validate the resulting data from the query.
	The generated schema should be general and simple, not overly complex. Only validate basic aspects like column types, nullability, and simple value constraints (like positive integers, string patterns, or ranges) since you only have the SQL query and the resulting column names/types.

	Follow these guidelines:
	1. Use Pandera SchemaModel:
	- Each column should have a type hint using `Series[Type]`.
	- Use `pa.Field` to define simple validations.
	2. Validation rules should be simple and reasonable:
	- `nullable` for optional columns
	- `unique` for IDs if obvious
	- `gt`/`ge`/`lt`/`le` for numeric ranges if reasonable
	- `str_matches`, `str_length` or `str_contains` for string patterns (like ZIP codes or emails)
	- Avoid complex cross-column or statistical checks
	3. Add Config class:
	- Set `coerce = True` to cast data types automatically
	4. Add optional metadata:
	- Include `description` for columns if possible
	- Include `title` for columns if it helps
	5. Output only valid Python code:
	- The output should be a single Python class definition.
	- Do not include any explanations, comments, or extra text.
	6. Example Output:


	import pandas as pd
	import pandera as pa
	from pandera.typing import Series

	class CustomerSchema(pa.DataFrameModel):
	customer_id: Series[int] = pa.Field(gt=0, unique=True, nullable=False, description="Unique customer identifier")
	first_name: Series[str] = pa.Field(nullable=False, str_length=(1, 50), description="Customer first name")
	last_name: Series[str] = pa.Field(nullable=False, str_length=(1, 50), description="Customer last name")
	email: Series[str] = pa.Field(nullable=False, str_matches=r"^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$", description="Customer email address")
	age: Series[int] = pa.Field(ge=0, le=120, nullable=True, description="Customer age in years")

	class Config:
	coerce = True

	Additional notes:
	If the SQL query uses JOIN, only include columns that appear in the SELECT statement.
	You may infer basic constraints from column names (e.g., columns ending with _id are likely unique integers).
	Avoid domain-specific logic unless it is obvious from the column names or SQL query.
	Keep the schema robust but simple, suitable for automated ETL validation."""

	PANDERA_USER_PROMPT = """SQL Query:
	{sql_query}

	User Instructions:
	{instructions}"""