abharadwaj123
/

llama3-sql2plan

+---
+license: apache-2.0
+base_model:
+- meta-llama/Llama-3.2-3B
+library_name: peft
+---
+# Model Card for llama3-sql2plan
+This model is a fine-tuned version of [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using LoRA (Low-Rank Adaptation) for the task of generating PostgreSQL execution plans from SQL queries. The model takes SQL queries as input and outputs the corresponding PostgreSQL execution plan in JSON format.
+## Model Details
+### Model Description
+This model is specifically designed to convert SQL queries into PostgreSQL execution plans. It was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters, allowing efficient training while maintaining the base model's capabilities.
+- **Developed by:** Anirudh Bharadwaj
+- **Model type:** Causal Language Model (Decoder-only)
+- **Language(s) (NLP):** English (SQL and JSON)
+- **License:** Apache 2.0
+- **Finetuned from model:** [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
+### Model Sources
+- **Repository:** [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan)
+- **Base Model:** [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
+## Uses
+### Direct Use
+This model can be used directly to generate PostgreSQL execution plans from SQL queries. It is intended for:
+- Database query optimization analysis
+- Understanding query execution strategies
+- Educational purposes for learning PostgreSQL query planning
+- Database performance analysis tools
+### Example Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_name = "abharadwaj123/llama3-sql2plan"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+sql_query = "SELECT * FROM users WHERE age > 25;"
+prompt = (
+    "Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n"
+    "[QUERY]\n" + sql_query + "\n\n[PLAN]\n"
+)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=256,
+        temperature=0.7,
+        do_sample=True,
+    )
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+plan = generated_text.split("[PLAN]\n")[-1].strip()
+print(plan)
+```
+### Out-of-Scope Use
+This model should not be used for:
+- Generating actual executable SQL queries (it generates execution plans, not queries)
+- Real-time database query execution
+- Production database systems without proper validation
+- Any use case requiring guaranteed accuracy of execution plans
+## Bias, Risks, and Limitations
+### Limitations
+- **Accuracy**: The model generates execution plans based on training data patterns and may not always produce accurate or optimal plans for all SQL queries.
+- **PostgreSQL-specific**: The model is trained specifically for PostgreSQL execution plans and may not be suitable for other database systems.
+- **Training Data Scope**: The model was trained on a subset of Stack Overflow data (10,000 samples from ~16,332 available), which may not cover all SQL query patterns.
+- **No Database Context**: The model does not have access to actual database schema, indexes, or statistics, which are crucial for accurate execution plan generation.
+### Recommendations
+Users should:
+- Validate generated execution plans against actual PostgreSQL EXPLAIN output
+- Not rely solely on this model for critical database optimization decisions
+- Use this model as a tool for understanding and learning, not as a replacement for actual database query planning
+- Be aware that execution plans may vary based on database configuration, schema, and data distribution
+## How to Get Started with the Model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+# Load model and tokenizer
+model_name = "abharadwaj123/llama3-sql2plan"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+# Prepare input
+sql_query = "SELECT * FROM users WHERE age > 25;"
+prompt = (
+    "Generate the PostgreSQL execution plan in JSON format for the SQL query.\n\n"
+    "[QUERY]\n" + sql_query + "\n\n[PLAN]\n"
+)
+# Generate plan
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=256,
+    temperature=0.7,
+    do_sample=True,
+)
+# Extract plan
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+plan = generated_text.split("[PLAN]\n")[-1].strip()
+```
+## Training Details
+### Training Data
+The model was trained on a dataset derived from Stack Overflow data (`stackoverflow_n18147.csv`), containing SQL queries and their corresponding PostgreSQL execution plans in JSON format.
+- **Total samples in dataset:** 18,147
+- **Training samples used:** 10,000 (sampled from first 90% of dataset, ~16,332 samples)
+- **Sampling method:** Random sampling with random_state=42
+- **Data format:** SQL query text paired with PostgreSQL execution plan JSON
+The training data was filtered to remove rows with missing `sql_text` or `plan_json` values.
+### Training Procedure
+#### Preprocessing
+1. **Data Loading**: Loaded CSV file and filtered out rows with missing SQL text or plan JSON
+2. **Data Splitting**: Used first 90% of dataset as training pool, then randomly sampled 10,000 examples
+3. **Formatting**: Each example was formatted with a prompt template:
+   ```
+   Generate the PostgreSQL execution plan in JSON format for the SQL query.
+   [QUERY]
+   {sql_text}
+   [PLAN]
+   {plan_json}
+   ```
+4. **Tokenization**:
+   - Maximum sequence length: 2048 tokens
+   - Input prompt tokens were masked in labels (set to -100) to only train on plan generation
+   - Padding to max_length
+#### Training Hyperparameters
+- **Training regime:** FP16 mixed precision training
+- **LoRA Configuration:**
+  - `r`: 64
+  - `lora_alpha`: 32
+  - `lora_dropout`: 0.05
+  - `target_modules`: ["q_proj", "k_proj", "v_proj", "o_proj"]
+- **Training Arguments:**
+  - `per_device_train_batch_size`: 2
+  - `gradient_accumulation_steps`: 8
+  - `effective_batch_size`: 16
+  - `learning_rate`: 1e-5
+  - `warmup_ratio`: 0.03
+  - `num_train_epochs`: 2
+  - `gradient_checkpointing`: True
+  - `logging_steps`: 20
+  - `save_strategy`: "epoch"
+#### Testing Data
+The remaining 10% of the original dataset (~1,815 samples) was held out and could be used for evaluation.
+## Model Examination
+The model uses LoRA (Low-Rank Adaptation) fine-tuning, which allows efficient training by only updating a small number of parameters (low-rank matrices) while keeping the base model weights frozen. This approach:
+- Reduces memory requirements during training
+- Enables faster training compared to full fine-tuning
+- Maintains the base model's general capabilities
+- Allows easy merging of adapters with the base model
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture:** Transformer-based decoder-only language model (Llama-3.2-3B)
+- **Objective:** Causal language modeling with masked input tokens
+- **Fine-tuning Method:** LoRA (Low-Rank Adaptation) via PEFT
+- **Base Model Parameters:** ~3 billion
+- **Trainable Parameters:** Significantly reduced via LoRA (exact count depends on LoRA rank and target modules)
+## Model Card Authors
+Anirudh Bharadwaj
+## Model Card Contact
+For questions or issues, please contact through the Hugging Face model repository: [abharadwaj123/llama3-sql2plan](https://huggingface.co/abharadwaj123/llama3-sql2plan)