|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- meta-llama/Llama-3.2-3B |
|
|
library_name: peft |
|
|
--- |
|
|
# Model Card for llama3-sql2plan |
|
|
|
|
|
This model is a fine-tuned version of [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using LoRA (Low-Rank Adaptation) for the task of generating PostgreSQL execution plans from SQL queries. The model takes SQL queries as input and outputs the corresponding PostgreSQL execution plan in JSON format. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is specifically designed to convert SQL queries into PostgreSQL execution plans. It was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters, allowing efficient training while maintaining the base model's capabilities. |
|
|
|
|
|
- **Developed by:** Anirudh Bharadwaj |
|
|
- **Model type:** Causal Language Model (Decoder-only) |
|
|
- **Language(s) (NLP):** English (SQL and JSON) |
|
|
- **License:** Apache 2.0 |
|
|
- **Finetuned from model:** [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan) |
|
|
- **Base Model:** [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
This model can be used directly to generate PostgreSQL execution plans from SQL queries. It is intended for: |
|
|
|
|
|
- Database query optimization analysis |
|
|
- Understanding query execution strategies |
|
|
- Educational purposes for learning PostgreSQL query planning |
|
|
- Database performance analysis tools |
|
|
|
|
|
### Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_name = "abharadwaj123/llama3-sql2plan" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
sql_query = "SELECT * FROM users WHERE age > 25;" |
|
|
prompt = ( |
|
|
"Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n" |
|
|
"[QUERY]\n" + sql_query + "\n\n[PLAN]\n" |
|
|
) |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
) |
|
|
|
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
plan = generated_text.split("[PLAN]\n")[-1].strip() |
|
|
print(plan) |
|
|
``` |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model should not be used for: |
|
|
- Generating actual executable SQL queries (it generates execution plans, not queries) |
|
|
- Real-time database query execution |
|
|
- Production database systems without proper validation |
|
|
- Any use case requiring guaranteed accuracy of execution plans |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- **Accuracy**: The model generates execution plans based on training data patterns and may not always produce accurate or optimal plans for all SQL queries. |
|
|
- **PostgreSQL-specific**: The model is trained specifically for PostgreSQL execution plans and may not be suitable for other database systems. |
|
|
- **Training Data Scope**: The model was trained on a subset of Stack Overflow data (10,000 samples from ~16,332 available), which may not cover all SQL query patterns. |
|
|
- **No Database Context**: The model does not have access to actual database schema, indexes, or statistics, which are crucial for accurate execution plan generation. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should: |
|
|
- Validate generated execution plans against actual PostgreSQL EXPLAIN output |
|
|
- Not rely solely on this model for critical database optimization decisions |
|
|
- Use this model as a tool for understanding and learning, not as a replacement for actual database query planning |
|
|
- Be aware that execution plans may vary based on database configuration, schema, and data distribution |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "abharadwaj123/llama3-sql2plan" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
# Prepare input |
|
|
sql_query = "SELECT * FROM users WHERE age > 25;" |
|
|
prompt = ( |
|
|
"Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n" |
|
|
"[QUERY]\n" + sql_query + "\n\n[PLAN]\n" |
|
|
) |
|
|
|
|
|
# Generate plan |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
) |
|
|
|
|
|
# Extract plan |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
plan = generated_text.split("[PLAN]\n")[-1].strip() |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on a dataset derived from Stack Overflow data (`stackoverflow_n18147.csv`), containing SQL queries and their corresponding PostgreSQL execution plans in JSON format. |
|
|
|
|
|
- **Total samples in dataset:** 18,147 |
|
|
- **Training samples used:** 10,000 (sampled from first 90% of dataset, ~16,332 samples) |
|
|
- **Sampling method:** Random sampling with random_state=42 |
|
|
- **Data format:** SQL query text paired with PostgreSQL execution plan JSON |
|
|
|
|
|
The training data was filtered to remove rows with missing `sql_text` or `plan_json` values. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
1. **Data Loading**: Loaded CSV file and filtered out rows with missing SQL text or plan JSON |
|
|
2. **Data Splitting**: Used first 90% of dataset as training pool, then randomly sampled 10,000 examples |
|
|
3. **Formatting**: Each example was formatted with a prompt template: |
|
|
``` |
|
|
Generate the PostgreSQL execution plan in JSON format for the SQL query. |
|
|
|
|
|
[QUERY] |
|
|
{sql_text} |
|
|
|
|
|
[PLAN] |
|
|
{plan_json} |
|
|
``` |
|
|
4. **Tokenization**: |
|
|
- Maximum sequence length: 2048 tokens |
|
|
- Input prompt tokens were masked in labels (set to -100) to only train on plan generation |
|
|
- Padding to max_length |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** FP16 mixed precision training |
|
|
- **LoRA Configuration:** |
|
|
- `r`: 64 |
|
|
- `lora_alpha`: 32 |
|
|
- `lora_dropout`: 0.05 |
|
|
- `target_modules`: ["q_proj", "k_proj", "v_proj", "o_proj"] |
|
|
- **Training Arguments:** |
|
|
- `per_device_train_batch_size`: 2 |
|
|
- `gradient_accumulation_steps`: 8 |
|
|
- `effective_batch_size`: 16 |
|
|
- `learning_rate`: 1e-5 |
|
|
- `warmup_ratio`: 0.03 |
|
|
- `num_train_epochs`: 2 |
|
|
- `gradient_checkpointing`: True |
|
|
- `logging_steps`: 20 |
|
|
- `save_strategy`: "epoch" |
|
|
|
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
The remaining 10% of the original dataset (~1,815 samples) was held out and could be used for evaluation. |
|
|
|
|
|
## Model Examination |
|
|
|
|
|
The model uses LoRA (Low-Rank Adaptation) fine-tuning, which allows efficient training by only updating a small number of parameters (low-rank matrices) while keeping the base model weights frozen. This approach: |
|
|
|
|
|
- Reduces memory requirements during training |
|
|
- Enables faster training compared to full fine-tuning |
|
|
- Maintains the base model's general capabilities |
|
|
- Allows easy merging of adapters with the base model |
|
|
|
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- **Architecture:** Transformer-based decoder-only language model (Llama-3.2-3B) |
|
|
- **Objective:** Causal language modeling with masked input tokens |
|
|
- **Fine-tuning Method:** LoRA (Low-Rank Adaptation) via PEFT |
|
|
- **Base Model Parameters:** ~3 billion |
|
|
- **Trainable Parameters:** Significantly reduced via LoRA (exact count depends on LoRA rank and target modules) |
|
|
|
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Anirudh Bharadwaj |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or issues, please contact through the Hugging Face model repository: [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan) |
|
|
|
|
|
|