sriram882004 commited on
Commit
b76ac41
·
verified ·
1 Parent(s): 7f5f033

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -53
README.md CHANGED
@@ -1,64 +1,29 @@
1
- ---
2
- license: mit
3
- ---
4
-
5
- # SQL Socratic Models
6
-
7
- This repository contains fine-tuned large language models for **Socratic SQL instruction** in higher education, focusing on guiding learners through SQL concepts using structured reasoning rather than providing direct solutions.
8
-
9
- ## Models
10
- - phi3_rq4
11
- - qwen25
12
- - gemma2
13
-
14
  ## Method
15
 
16
- Our approach is designed to support **conceptual learning in STEM education** through Socratic interaction:
17
-
18
- - **Phase 1 (Data Construction):**
19
- SQL instruction data is augmented with Socratic prompts emphasizing:
20
- - Question decomposition
21
- - Conceptual hints
22
- - Guided reasoning steps
23
-
24
- - **Phase 2 (Fine-Tuning):**
25
- We apply full fine-tuning (FFT) on small, open-source LLMs with **pedagogical constraints** that explicitly discourage direct answer generation and instead promote:
26
- - Conceptual scaffolding
27
- - Incremental reasoning
28
- - Learner-centered guidance
29
 
30
- - **Phase 3 (Evaluation):**
31
- Models are evaluated using:
32
- - **BERTScore** for semantic alignment with expected reasoning
33
- - **ROUGE-L** to measure and control **answer leakage** (i.e., avoidance of direct SQL solutions)
34
 
35
- ## Contributions
36
- - Fine-tuning across multiple architectures (Phi-3, Qwen2.5, Gemma2) for **instructional SQL reasoning**
37
- - Development of **Socratic SQL prompting framework** for higher education contexts
38
- - Evaluation of models on their ability to generate **guidance without revealing final answers**
39
- - Ablation study identifying factors that enable LLMs to mimic effective instructors through:
40
- - Misconception-aware feedback
41
- - Iterative questioning
42
- - Structured reasoning support
43
 
44
- ## Task
 
 
 
45
 
46
- Given a natural language SQL question, the model generates:
47
 
48
- 1. Socratic reasoning steps
49
- 2. Conceptual hints and guiding questions
50
- 3. Intermediate decomposition of the problem
51
 
52
- **The model does NOT produce the final SQL query**, ensuring alignment with instructional use in higher education settings.
53
 
54
- This design supports:
55
- - Active learning
56
- - Conceptual understanding of SQL
57
- - Integration of database concepts into broader STEM curricula
58
 
59
- ## Usage
60
- ```python
61
- from transformers import AutoModelForCausalLM, AutoTokenizer
62
 
63
- model = AutoModelForCausalLM.from_pretrained("sriram882004/SQL-Socratic-Models/phi3_rq4")
64
- tokenizer = AutoTokenizer.from_pretrained("sriram882004/SQL-Socratic-Models/phi3_rq4")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## Method
2
 
3
+ Our approach is structured in three phases to support Socratic SQL instruction for higher education.
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
+ ### Phase 1: SQL-Instruct Corpus Construction
 
 
 
6
 
7
+ We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform provides a rich source of real-world misconceptions, debugging challenges, and conceptual difficulties encountered by both students and practitioners, making it well-suited for training models that emphasize understanding over code replication.
 
 
 
 
 
 
 
8
 
9
+ To ensure data quality, we filter SQL-tagged questions based on community impact. The resulting dataset reflects substantial engagement, with a cumulative reach of approximately **1.27 billion views** and an average of **128,535 views per question**. For each selected instance, we extract:
10
+ - The core problem description
11
+ - User-provided SQL attempts (when available)
12
+ - Executable SQL blocks from the accepted solution
13
 
14
+ This process yields **9,916 unique questions**, which are then transformed into Socratic instructional data using GPT-4o. We leverage GPT-4o for its strong reasoning capabilities to generate **pedagogical hints and guided reasoning steps**, ensuring that the dataset emphasizes conceptual scaffolding rather than direct answers.
15
 
16
+ The dataset is intentionally skewed toward higher cognitive complexity, with:
17
+ - **8,604 intermediate-level questions**
18
+ - **629 advanced-level questions**
19
 
20
+ Additionally, we identify a subset of **531 debugging tasks**, enabling models to learn how to guide students through error identification and correction in SQL queries.
21
 
22
+ The corpus spans a wide range of SQL topics, with particular emphasis on:
23
+ - JOIN operations
24
+ - Aggregation and grouping
25
+ - Query optimization and performance
26
 
27
+ By selecting questions with a **median Stack Overflow score of 27**, we ensure that the underlying solutions—and therefore the derived instructional signals—are technically reliable.
 
 
28
 
29
+ This corpus serves as the foundation for training models that prioritize **Socratic reasoning, misconception-aware feedback, and conceptual understanding** over direct SQL solution generation.