File size: 4,244 Bytes
8b270e3
 
 
 
 
 
 
 
 
 
 
 
 
92eb4c6
8b270e3
92eb4c6
8b270e3
145d1cd
8b270e3
92eb4c6
8b270e3
 
 
 
92eb4c6
8b270e3
92eb4c6
8b270e3
92eb4c6
8b270e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92eb4c6
b76ac41
8b270e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145d1cd
8b270e3
 
145d1cd
cc10352
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: mit
tags:
- text-to-sql
- education
- socratic-learning
- instruction-tuning
- sql
- STEM
- pedagogy
datasets:
- SQL-Instruct
---

# SQL Socratic Models

## Model Description

SQL Socratic Models are a collection of fine-tuned large language models designed for **Socratic SQL instruction in higher education**. Unlike standard Text-to-SQL systems, these models are trained to **guide learners through reasoning steps without producing final SQL solutions**, supporting conceptual understanding and active learning in STEM contexts.

Supported architectures:
- Phi-3
- Qwen2.5
- Gemma2

---

## Intended Use

These models are designed for:

- Teaching SQL concepts in higher education  
- Supporting STEM learners through guided reasoning  
- Providing step-by-step Socratic hints for SQL problems  
- Assisting debugging and conceptual clarification  

### Important Constraint
The models are intentionally trained to:
- ✅ Provide reasoning steps and conceptual hints  
- ❌ Avoid generating complete SQL solutions  

This ensures alignment with pedagogical goals such as scaffolding and learner engagement.

---

## Training Data: SQL-Instruct Corpus

We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform captures real-world misconceptions, debugging challenges, and conceptual gaps encountered by learners and practitioners.

### Data Collection

To ensure high-quality instructional signals, we filter SQL-tagged questions based on community impact. The resulting dataset has:

- **1.27 billion total views**  
- **128,535 average views per question**  

For each selected entry, we extract:
- Problem descriptions  
- User-submitted SQL attempts  
- Executable SQL from accepted solutions  

This yields **9,916 unique questions**.

---

### Socratic Augmentation

Each example is transformed into a Socratic instructional format using GPT-4o, which generates:

- Guided reasoning steps  
- Conceptual hints  
- Question decomposition  

This ensures the dataset emphasizes **instructional scaffolding rather than answer generation**.

---

### Dataset Composition

- **Intermediate questions:** 8,604  
- **Advanced questions:** 629  
- **Debugging tasks:** 531  

The dataset emphasizes challenging reasoning scenarios, particularly:

- JOIN operations  
- Aggregations and grouping  
- Query optimization  

We further ensure reliability by selecting entries with a **median Stack Overflow score of 27**.

---

## Training Procedure

### Phase 2: Fine-Tuning

We apply **Full Fine-Tuning (FFT)** on small, open-source LLMs under pedagogical constraints designed to:

- Encourage conceptual scaffolding  
- Promote step-by-step reasoning  
- Discourage direct SQL answer generation  

---

## Evaluation

### Phase 3 Metrics

Models are evaluated using:

- **BERTScore** → semantic alignment with expected reasoning  
- **ROUGE-L** → detection of answer leakage (i.e., unintended full SQL generation)  

---

## Key Contributions

- Socratic SQL instruction tuning for higher education  
- SQL-Instruct dataset derived from real-world misconceptions  
- Multi-model fine-tuning across Phi-3, Qwen2.5, and Gemma2  
- Evaluation framework balancing reasoning quality and answer leakage  
- Ablation study identifying factors enabling:
  - Misconception-based feedback  
  - Iterative guidance  
  - Instructor-like reasoning behavior  

---

## Limitations

- Models may still occasionally generate partial SQL fragments  
- Evaluation focuses on semantic similarity rather than full pedagogical outcomes  
- Dataset is derived from Stack Overflow and may reflect community biases  

---

## Ethical Considerations

These models are designed to support learning, not replace it. By avoiding full solution generation, they aim to:

- Encourage critical thinking  
- Reduce over-reliance on AI-generated answers  
- Support equitable access to SQL learning resources  

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("sriram882004/SQL-Socratic-Models/phi3")
tokenizer = AutoTokenizer.from_pretrained("sriram882004/SQL-Socratic-Models/phi3")