llama3-sql2plan / README.md
abharadwaj123's picture
Create README.md
3f02af0 verified
---
license: apache-2.0
base_model:
- meta-llama/Llama-3.2-3B
library_name: peft
---
# Model Card for llama3-sql2plan
This model is a fine-tuned version of [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using LoRA (Low-Rank Adaptation) for the task of generating PostgreSQL execution plans from SQL queries. The model takes SQL queries as input and outputs the corresponding PostgreSQL execution plan in JSON format.
## Model Details
### Model Description
This model is specifically designed to convert SQL queries into PostgreSQL execution plans. It was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters, allowing efficient training while maintaining the base model's capabilities.
- **Developed by:** Anirudh Bharadwaj
- **Model type:** Causal Language Model (Decoder-only)
- **Language(s) (NLP):** English (SQL and JSON)
- **License:** Apache 2.0
- **Finetuned from model:** [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
### Model Sources
- **Repository:** [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan)
- **Base Model:** [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
## Uses
### Direct Use
This model can be used directly to generate PostgreSQL execution plans from SQL queries. It is intended for:
- Database query optimization analysis
- Understanding query execution strategies
- Educational purposes for learning PostgreSQL query planning
- Database performance analysis tools
### Example Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "abharadwaj123/llama3-sql2plan"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
sql_query = "SELECT * FROM users WHERE age > 25;"
prompt = (
"Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n"
"[QUERY]\n" + sql_query + "\n\n[PLAN]\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
plan = generated_text.split("[PLAN]\n")[-1].strip()
print(plan)
```
### Out-of-Scope Use
This model should not be used for:
- Generating actual executable SQL queries (it generates execution plans, not queries)
- Real-time database query execution
- Production database systems without proper validation
- Any use case requiring guaranteed accuracy of execution plans
## Bias, Risks, and Limitations
### Limitations
- **Accuracy**: The model generates execution plans based on training data patterns and may not always produce accurate or optimal plans for all SQL queries.
- **PostgreSQL-specific**: The model is trained specifically for PostgreSQL execution plans and may not be suitable for other database systems.
- **Training Data Scope**: The model was trained on a subset of Stack Overflow data (10,000 samples from ~16,332 available), which may not cover all SQL query patterns.
- **No Database Context**: The model does not have access to actual database schema, indexes, or statistics, which are crucial for accurate execution plan generation.
### Recommendations
Users should:
- Validate generated execution plans against actual PostgreSQL EXPLAIN output
- Not rely solely on this model for critical database optimization decisions
- Use this model as a tool for understanding and learning, not as a replacement for actual database query planning
- Be aware that execution plans may vary based on database configuration, schema, and data distribution
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "abharadwaj123/llama3-sql2plan"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Prepare input
sql_query = "SELECT * FROM users WHERE age > 25;"
prompt = (
"Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n"
"[QUERY]\n" + sql_query + "\n\n[PLAN]\n"
)
# Generate plan
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
# Extract plan
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
plan = generated_text.split("[PLAN]\n")[-1].strip()
```
## Training Details
### Training Data
The model was trained on a dataset derived from Stack Overflow data (`stackoverflow_n18147.csv`), containing SQL queries and their corresponding PostgreSQL execution plans in JSON format.
- **Total samples in dataset:** 18,147
- **Training samples used:** 10,000 (sampled from first 90% of dataset, ~16,332 samples)
- **Sampling method:** Random sampling with random_state=42
- **Data format:** SQL query text paired with PostgreSQL execution plan JSON
The training data was filtered to remove rows with missing `sql_text` or `plan_json` values.
### Training Procedure
#### Preprocessing
1. **Data Loading**: Loaded CSV file and filtered out rows with missing SQL text or plan JSON
2. **Data Splitting**: Used first 90% of dataset as training pool, then randomly sampled 10,000 examples
3. **Formatting**: Each example was formatted with a prompt template:
```
Generate the PostgreSQL execution plan in JSON format for the SQL query.
[QUERY]
{sql_text}
[PLAN]
{plan_json}
```
4. **Tokenization**:
- Maximum sequence length: 2048 tokens
- Input prompt tokens were masked in labels (set to -100) to only train on plan generation
- Padding to max_length
#### Training Hyperparameters
- **Training regime:** FP16 mixed precision training
- **LoRA Configuration:**
- `r`: 64
- `lora_alpha`: 32
- `lora_dropout`: 0.05
- `target_modules`: ["q_proj", "k_proj", "v_proj", "o_proj"]
- **Training Arguments:**
- `per_device_train_batch_size`: 2
- `gradient_accumulation_steps`: 8
- `effective_batch_size`: 16
- `learning_rate`: 1e-5
- `warmup_ratio`: 0.03
- `num_train_epochs`: 2
- `gradient_checkpointing`: True
- `logging_steps`: 20
- `save_strategy`: "epoch"
#### Testing Data
The remaining 10% of the original dataset (~1,815 samples) was held out and could be used for evaluation.
## Model Examination
The model uses LoRA (Low-Rank Adaptation) fine-tuning, which allows efficient training by only updating a small number of parameters (low-rank matrices) while keeping the base model weights frozen. This approach:
- Reduces memory requirements during training
- Enables faster training compared to full fine-tuning
- Maintains the base model's general capabilities
- Allows easy merging of adapters with the base model
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** Transformer-based decoder-only language model (Llama-3.2-3B)
- **Objective:** Causal language modeling with masked input tokens
- **Fine-tuning Method:** LoRA (Low-Rank Adaptation) via PEFT
- **Base Model Parameters:** ~3 billion
- **Trainable Parameters:** Significantly reduced via LoRA (exact count depends on LoRA rank and target modules)
## Model Card Authors
Anirudh Bharadwaj
## Model Card Contact
For questions or issues, please contact through the Hugging Face model repository: [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan)