llama3-sql2plan / README.md

Create README.md

3f02af0 verified about 1 month ago

7.73 kB

	---
	license: apache-2.0
	base_model:
	- meta-llama/Llama-3.2-3B
	library_name: peft
	---
	# Model Card for llama3-sql2plan

	This model is a fine-tuned version of [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using LoRA (Low-Rank Adaptation) for the task of generating PostgreSQL execution plans from SQL queries. The model takes SQL queries as input and outputs the corresponding PostgreSQL execution plan in JSON format.

	## Model Details

	### Model Description

	This model is specifically designed to convert SQL queries into PostgreSQL execution plans. It was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters, allowing efficient training while maintaining the base model's capabilities.

	- Developed by: Anirudh Bharadwaj
	- Model type: Causal Language Model (Decoder-only)
	- Language(s) (NLP): English (SQL and JSON)
	- License: Apache 2.0
	- Finetuned from model: [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)

	### Model Sources

	- Repository: [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan)
	- Base Model: [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)

	## Uses

	### Direct Use

	This model can be used directly to generate PostgreSQL execution plans from SQL queries. It is intended for:

	- Database query optimization analysis
	- Understanding query execution strategies
	- Educational purposes for learning PostgreSQL query planning
	- Database performance analysis tools

	### Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "abharadwaj123/llama3-sql2plan"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	sql_query = "SELECT * FROM users WHERE age > 25;"
	prompt = (
	"Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n"
	"[QUERY]\n" + sql_query + "\n\n[PLAN]\n"
	)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.7,
	do_sample=True,
	)

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	plan = generated_text.split("[PLAN]\n")[-1].strip()
	print(plan)
	```

	### Out-of-Scope Use

	This model should not be used for:
	- Generating actual executable SQL queries (it generates execution plans, not queries)
	- Real-time database query execution
	- Production database systems without proper validation
	- Any use case requiring guaranteed accuracy of execution plans

	## Bias, Risks, and Limitations

	### Limitations

	- Accuracy: The model generates execution plans based on training data patterns and may not always produce accurate or optimal plans for all SQL queries.
	- PostgreSQL-specific: The model is trained specifically for PostgreSQL execution plans and may not be suitable for other database systems.
	- Training Data Scope: The model was trained on a subset of Stack Overflow data (10,000 samples from ~16,332 available), which may not cover all SQL query patterns.
	- No Database Context: The model does not have access to actual database schema, indexes, or statistics, which are crucial for accurate execution plan generation.

	### Recommendations

	Users should:
	- Validate generated execution plans against actual PostgreSQL EXPLAIN output
	- Not rely solely on this model for critical database optimization decisions
	- Use this model as a tool for understanding and learning, not as a replacement for actual database query planning
	- Be aware that execution plans may vary based on database configuration, schema, and data distribution

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load model and tokenizer
	model_name = "abharadwaj123/llama3-sql2plan"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	# Prepare input
	sql_query = "SELECT * FROM users WHERE age > 25;"
	prompt = (
	"Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n"
	"[QUERY]\n" + sql_query + "\n\n[PLAN]\n"
	)

	# Generate plan
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.7,
	do_sample=True,
	)

	# Extract plan
	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	plan = generated_text.split("[PLAN]\n")[-1].strip()
	```

	## Training Details

	### Training Data

	The model was trained on a dataset derived from Stack Overflow data (`stackoverflow_n18147.csv`), containing SQL queries and their corresponding PostgreSQL execution plans in JSON format.

	- Total samples in dataset: 18,147
	- Training samples used: 10,000 (sampled from first 90% of dataset, ~16,332 samples)
	- Sampling method: Random sampling with random_state=42
	- Data format: SQL query text paired with PostgreSQL execution plan JSON

	The training data was filtered to remove rows with missing `sql_text` or `plan_json` values.

	### Training Procedure

	#### Preprocessing

	1. Data Loading: Loaded CSV file and filtered out rows with missing SQL text or plan JSON
	2. Data Splitting: Used first 90% of dataset as training pool, then randomly sampled 10,000 examples
	3. Formatting: Each example was formatted with a prompt template:
	```
	Generate the PostgreSQL execution plan in JSON format for the SQL query.

	[QUERY]
	{sql_text}

	[PLAN]
	{plan_json}
	```
	4. Tokenization:
	- Maximum sequence length: 2048 tokens
	- Input prompt tokens were masked in labels (set to -100) to only train on plan generation
	- Padding to max_length

	#### Training Hyperparameters

	- Training regime: FP16 mixed precision training
	- LoRA Configuration:
	- `r`: 64
	- `lora_alpha`: 32
	- `lora_dropout`: 0.05
	- `target_modules`: ["q_proj", "k_proj", "v_proj", "o_proj"]
	- Training Arguments:
	- `per_device_train_batch_size`: 2
	- `gradient_accumulation_steps`: 8
	- `effective_batch_size`: 16
	- `learning_rate`: 1e-5
	- `warmup_ratio`: 0.03
	- `num_train_epochs`: 2
	- `gradient_checkpointing`: True
	- `logging_steps`: 20
	- `save_strategy`: "epoch"


	#### Testing Data

	The remaining 10% of the original dataset (~1,815 samples) was held out and could be used for evaluation.

	## Model Examination

	The model uses LoRA (Low-Rank Adaptation) fine-tuning, which allows efficient training by only updating a small number of parameters (low-rank matrices) while keeping the base model weights frozen. This approach:

	- Reduces memory requirements during training
	- Enables faster training compared to full fine-tuning
	- Maintains the base model's general capabilities
	- Allows easy merging of adapters with the base model


	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: Transformer-based decoder-only language model (Llama-3.2-3B)
	- Objective: Causal language modeling with masked input tokens
	- Fine-tuning Method: LoRA (Low-Rank Adaptation) via PEFT
	- Base Model Parameters: ~3 billion
	- Trainable Parameters: Significantly reduced via LoRA (exact count depends on LoRA rank and target modules)


	## Model Card Authors

	Anirudh Bharadwaj

	## Model Card Contact

	For questions or issues, please contact through the Hugging Face model repository: [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan)