SQL-Socratic-Models / README.md
sriram882004's picture
Update README.md
cc10352 verified
---
license: mit
tags:
- text-to-sql
- education
- socratic-learning
- instruction-tuning
- sql
- STEM
- pedagogy
datasets:
- SQL-Instruct
---
# SQL Socratic Models
## Model Description
SQL Socratic Models are a collection of fine-tuned large language models designed for **Socratic SQL instruction in higher education**. Unlike standard Text-to-SQL systems, these models are trained to **guide learners through reasoning steps without producing final SQL solutions**, supporting conceptual understanding and active learning in STEM contexts.
Supported architectures:
- Phi-3
- Qwen2.5
- Gemma2
---
## Intended Use
These models are designed for:
- Teaching SQL concepts in higher education
- Supporting STEM learners through guided reasoning
- Providing step-by-step Socratic hints for SQL problems
- Assisting debugging and conceptual clarification
### Important Constraint
The models are intentionally trained to:
- ✅ Provide reasoning steps and conceptual hints
- ❌ Avoid generating complete SQL solutions
This ensures alignment with pedagogical goals such as scaffolding and learner engagement.
---
## Training Data: SQL-Instruct Corpus
We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform captures real-world misconceptions, debugging challenges, and conceptual gaps encountered by learners and practitioners.
### Data Collection
To ensure high-quality instructional signals, we filter SQL-tagged questions based on community impact. The resulting dataset has:
- **1.27 billion total views**
- **128,535 average views per question**
For each selected entry, we extract:
- Problem descriptions
- User-submitted SQL attempts
- Executable SQL from accepted solutions
This yields **9,916 unique questions**.
---
### Socratic Augmentation
Each example is transformed into a Socratic instructional format using GPT-4o, which generates:
- Guided reasoning steps
- Conceptual hints
- Question decomposition
This ensures the dataset emphasizes **instructional scaffolding rather than answer generation**.
---
### Dataset Composition
- **Intermediate questions:** 8,604
- **Advanced questions:** 629
- **Debugging tasks:** 531
The dataset emphasizes challenging reasoning scenarios, particularly:
- JOIN operations
- Aggregations and grouping
- Query optimization
We further ensure reliability by selecting entries with a **median Stack Overflow score of 27**.
---
## Training Procedure
### Phase 2: Fine-Tuning
We apply **Full Fine-Tuning (FFT)** on small, open-source LLMs under pedagogical constraints designed to:
- Encourage conceptual scaffolding
- Promote step-by-step reasoning
- Discourage direct SQL answer generation
---
## Evaluation
### Phase 3 Metrics
Models are evaluated using:
- **BERTScore** → semantic alignment with expected reasoning
- **ROUGE-L** → detection of answer leakage (i.e., unintended full SQL generation)
---
## Key Contributions
- Socratic SQL instruction tuning for higher education
- SQL-Instruct dataset derived from real-world misconceptions
- Multi-model fine-tuning across Phi-3, Qwen2.5, and Gemma2
- Evaluation framework balancing reasoning quality and answer leakage
- Ablation study identifying factors enabling:
- Misconception-based feedback
- Iterative guidance
- Instructor-like reasoning behavior
---
## Limitations
- Models may still occasionally generate partial SQL fragments
- Evaluation focuses on semantic similarity rather than full pedagogical outcomes
- Dataset is derived from Stack Overflow and may reflect community biases
---
## Ethical Considerations
These models are designed to support learning, not replace it. By avoiding full solution generation, they aim to:
- Encourage critical thinking
- Reduce over-reliance on AI-generated answers
- Support equitable access to SQL learning resources
---
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("sriram882004/SQL-Socratic-Models/phi3")
tokenizer = AutoTokenizer.from_pretrained("sriram882004/SQL-Socratic-Models/phi3")