--- license: mit tags: - text-to-sql - education - socratic-learning - instruction-tuning - sql - STEM - pedagogy datasets: - SQL-Instruct --- # SQL Socratic Models ## Model Description SQL Socratic Models are a collection of fine-tuned large language models designed for **Socratic SQL instruction in higher education**. Unlike standard Text-to-SQL systems, these models are trained to **guide learners through reasoning steps without producing final SQL solutions**, supporting conceptual understanding and active learning in STEM contexts. Supported architectures: - Phi-3 - Qwen2.5 - Gemma2 --- ## Intended Use These models are designed for: - Teaching SQL concepts in higher education - Supporting STEM learners through guided reasoning - Providing step-by-step Socratic hints for SQL problems - Assisting debugging and conceptual clarification ### Important Constraint The models are intentionally trained to: - ✅ Provide reasoning steps and conceptual hints - ❌ Avoid generating complete SQL solutions This ensures alignment with pedagogical goals such as scaffolding and learner engagement. --- ## Training Data: SQL-Instruct Corpus We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform captures real-world misconceptions, debugging challenges, and conceptual gaps encountered by learners and practitioners. ### Data Collection To ensure high-quality instructional signals, we filter SQL-tagged questions based on community impact. The resulting dataset has: - **1.27 billion total views** - **128,535 average views per question** For each selected entry, we extract: - Problem descriptions - User-submitted SQL attempts - Executable SQL from accepted solutions This yields **9,916 unique questions**. --- ### Socratic Augmentation Each example is transformed into a Socratic instructional format using GPT-4o, which generates: - Guided reasoning steps - Conceptual hints - Question decomposition This ensures the dataset emphasizes **instructional scaffolding rather than answer generation**. --- ### Dataset Composition - **Intermediate questions:** 8,604 - **Advanced questions:** 629 - **Debugging tasks:** 531 The dataset emphasizes challenging reasoning scenarios, particularly: - JOIN operations - Aggregations and grouping - Query optimization We further ensure reliability by selecting entries with a **median Stack Overflow score of 27**. --- ## Training Procedure ### Phase 2: Fine-Tuning We apply **Full Fine-Tuning (FFT)** on small, open-source LLMs under pedagogical constraints designed to: - Encourage conceptual scaffolding - Promote step-by-step reasoning - Discourage direct SQL answer generation --- ## Evaluation ### Phase 3 Metrics Models are evaluated using: - **BERTScore** → semantic alignment with expected reasoning - **ROUGE-L** → detection of answer leakage (i.e., unintended full SQL generation) --- ## Key Contributions - Socratic SQL instruction tuning for higher education - SQL-Instruct dataset derived from real-world misconceptions - Multi-model fine-tuning across Phi-3, Qwen2.5, and Gemma2 - Evaluation framework balancing reasoning quality and answer leakage - Ablation study identifying factors enabling: - Misconception-based feedback - Iterative guidance - Instructor-like reasoning behavior --- ## Limitations - Models may still occasionally generate partial SQL fragments - Evaluation focuses on semantic similarity rather than full pedagogical outcomes - Dataset is derived from Stack Overflow and may reflect community biases --- ## Ethical Considerations These models are designed to support learning, not replace it. By avoiding full solution generation, they aim to: - Encourage critical thinking - Reduce over-reliance on AI-generated answers - Support equitable access to SQL learning resources --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("sriram882004/SQL-Socratic-Models/phi3") tokenizer = AutoTokenizer.from_pretrained("sriram882004/SQL-Socratic-Models/phi3")