File size: 4,244 Bytes
8b270e3 92eb4c6 8b270e3 92eb4c6 8b270e3 145d1cd 8b270e3 92eb4c6 8b270e3 92eb4c6 8b270e3 92eb4c6 8b270e3 92eb4c6 8b270e3 92eb4c6 b76ac41 8b270e3 145d1cd 8b270e3 145d1cd cc10352 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
license: mit
tags:
- text-to-sql
- education
- socratic-learning
- instruction-tuning
- sql
- STEM
- pedagogy
datasets:
- SQL-Instruct
---
# SQL Socratic Models
## Model Description
SQL Socratic Models are a collection of fine-tuned large language models designed for **Socratic SQL instruction in higher education**. Unlike standard Text-to-SQL systems, these models are trained to **guide learners through reasoning steps without producing final SQL solutions**, supporting conceptual understanding and active learning in STEM contexts.
Supported architectures:
- Phi-3
- Qwen2.5
- Gemma2
---
## Intended Use
These models are designed for:
- Teaching SQL concepts in higher education
- Supporting STEM learners through guided reasoning
- Providing step-by-step Socratic hints for SQL problems
- Assisting debugging and conceptual clarification
### Important Constraint
The models are intentionally trained to:
- ✅ Provide reasoning steps and conceptual hints
- ❌ Avoid generating complete SQL solutions
This ensures alignment with pedagogical goals such as scaffolding and learner engagement.
---
## Training Data: SQL-Instruct Corpus
We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform captures real-world misconceptions, debugging challenges, and conceptual gaps encountered by learners and practitioners.
### Data Collection
To ensure high-quality instructional signals, we filter SQL-tagged questions based on community impact. The resulting dataset has:
- **1.27 billion total views**
- **128,535 average views per question**
For each selected entry, we extract:
- Problem descriptions
- User-submitted SQL attempts
- Executable SQL from accepted solutions
This yields **9,916 unique questions**.
---
### Socratic Augmentation
Each example is transformed into a Socratic instructional format using GPT-4o, which generates:
- Guided reasoning steps
- Conceptual hints
- Question decomposition
This ensures the dataset emphasizes **instructional scaffolding rather than answer generation**.
---
### Dataset Composition
- **Intermediate questions:** 8,604
- **Advanced questions:** 629
- **Debugging tasks:** 531
The dataset emphasizes challenging reasoning scenarios, particularly:
- JOIN operations
- Aggregations and grouping
- Query optimization
We further ensure reliability by selecting entries with a **median Stack Overflow score of 27**.
---
## Training Procedure
### Phase 2: Fine-Tuning
We apply **Full Fine-Tuning (FFT)** on small, open-source LLMs under pedagogical constraints designed to:
- Encourage conceptual scaffolding
- Promote step-by-step reasoning
- Discourage direct SQL answer generation
---
## Evaluation
### Phase 3 Metrics
Models are evaluated using:
- **BERTScore** → semantic alignment with expected reasoning
- **ROUGE-L** → detection of answer leakage (i.e., unintended full SQL generation)
---
## Key Contributions
- Socratic SQL instruction tuning for higher education
- SQL-Instruct dataset derived from real-world misconceptions
- Multi-model fine-tuning across Phi-3, Qwen2.5, and Gemma2
- Evaluation framework balancing reasoning quality and answer leakage
- Ablation study identifying factors enabling:
- Misconception-based feedback
- Iterative guidance
- Instructor-like reasoning behavior
---
## Limitations
- Models may still occasionally generate partial SQL fragments
- Evaluation focuses on semantic similarity rather than full pedagogical outcomes
- Dataset is derived from Stack Overflow and may reflect community biases
---
## Ethical Considerations
These models are designed to support learning, not replace it. By avoiding full solution generation, they aim to:
- Encourage critical thinking
- Reduce over-reliance on AI-generated answers
- Support equitable access to SQL learning resources
---
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("sriram882004/SQL-Socratic-Models/phi3")
tokenizer = AutoTokenizer.from_pretrained("sriram882004/SQL-Socratic-Models/phi3") |