| --- |
| license: apache-2.0 |
| language: |
| - zh |
| - en |
| tags: |
| - education |
| - socratic-teaching |
| - dialogue |
| - fine-tuned |
| - glm4 |
| - kele |
| - lora |
| base_model: THUDM/glm-4-9b-chat |
| --- |
| |
| # SocratTeachLLM |
|
|
| A LoRA fine-tuned [GLM4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model trained to act as a **Socratic teacher** in structured educational dialogues. It generates heuristic questions and formative feedback that guide students through a principled sequence of reasoning stages, following the [KELE framework](https://aclanthology.org/2025.findings-emnlp.888) (Peng et al., EMNLP 2025 Findings). |
|
|
| > **Original model:** [yuanpan/SocratTeachLLM](https://huggingface.co/yuanpan/SocratTeachLLM) — this repository is a copy with an expanded README. |
|
|
| --- |
|
|
| ## What It Does |
|
|
| SocratTeachLLM is designed for the **teacher role** in a dual-agent Socratic tutoring system. A separate **consultant agent** (e.g., GPT-4o or Qwen) selects a teaching strategy from a predefined set of 34 Socratic rules (SocRule); SocratTeachLLM then generates the actual dialogue turn implementing that strategy. |
|
|
| Teaching proceeds through five stages (SocRule): |
|
|
| | Stage | Name | State codes | Description | |
| |---|---|---|---| |
| | a | Initiation | a1 | Student poses the question; dialogue begins | |
| | b | Concept Probing | b2–b7 | Teacher probes prior knowledge and surfaces misconceptions | |
| | c | Inductive Reasoning | c8–c29 | Core teaching stage — guides the student toward generalizations; can repeat many turns | |
| | d | Answer Derivation | d30–d33 | Help the student arrive at the correct answer | |
| | e | Summary | e34 | Consolidate and reinforce learning | |
|
|
| The model was fine-tuned on **SocratDataset**: 6,803 multi-turn Socratic dialogues covering 42,000+ interaction turns across elementary school science topics in Chinese. |
|
|
| --- |
|
|
| ## Published Performance |
|
|
| Results from Table 1 of the KELE paper (test set: 680 dialogues, 4,245 single-turn examples): |
|
|
| | Model | ROUGE-1 | ROUGE-2 | BLEU-4 | PRR | NDAR | SPR | IAR | Guidance | Logicality | Flexibility | |
| |---|---|---|---|---|---|---|---|---|---|---| |
| | GPT-4o | 38.25 | 22.35 | 29.93 | 72.13 | 81.19 | 85.00 | 87.74 | 4.35 | 4.50 | 4.33 | |
| | Qwen2.5-7B | 40.95 | 15.27 | 24.96 | 59.02 | 80.52 | 60.00 | 76.45 | 3.87 | 3.96 | 3.87 | |
| | Qwen2.5-14B | 43.79 | 17.06 | 26.63 | 65.21 | 78.57 | 74.00 | 80.81 | 3.99 | 4.15 | 4.03 | |
| | Qwen2.5-32B | 46.22 | 19.90 | 28.85 | 65.57 | 83.13 | 81.00 | 84.68 | 4.12 | 4.44 | 4.21 | |
| | EduChat-13B | 34.75 | 9.91 | 21.11 | 47.62 | 90.73 | 51.00 | 69.02 | 2.93 | 3.42 | 3.18 | |
| | SocraticLM-7B | 18.63 | 5.56 | 10.93 | 26.83 | 30.26 | 36.00 | 27.05 | 2.62 | 2.88 | 2.78 | |
| | **SocratTeachLLM (this model)** | **57.40** | **33.63** | **41.96** | **75.13** | **94.71** | **87.00** | **89.03** | **4.66** | **4.53** | **4.45** | |
|
|
| **Metric definitions:** |
| - **PRR** — Problem Relevance Rate: teacher question relates directly to the problem |
| - **NDAR** — No Direct Answer Rate: teacher avoids giving away the answer |
| - **SPR** — Summary Pass Rate: correct and complete final summary |
| - **IAR** — Instruction Adherence Rate: teacher follows the consultant's recommended strategy |
| - **Guidance / Logicality / Flexibility** — GPT-4o judge scores on a 1–5 scale (B.5 rubric) |
|
|
| SocratTeachLLM outperforms GPT-4o on every metric despite being ~40× smaller. |
|
|
| --- |
|
|
| ## Training Details |
|
|
| | Setting | Value | |
| |---|---| |
| | Base model | GLM4-9B-Chat | |
| | Method | LoRA | |
| | Epochs | 3 | |
| | Learning rate | 5e-5 | |
| | Batch size | 16 | |
| | Train split | 6,123 dialogues (90%) | |
| | Test split | 680 dialogues (10%) | |
| | Hardware | 2× NVIDIA A800 80GB | |
| | Dataset | SocratDataset (6,803 records, Chinese) | |
|
|
| ### Training Objective |
|
|
| ``` |
| P(teacher_response | dialogue_history, evaluation, action) |
| ``` |
|
|
| The `evaluation` (consultant's stage/state assessment) and `action` (recommended strategy) fields are required conditioning signals. At inference time, a consultant agent produces these before the teacher agent generates its response. Without the consultant outputs as conditioning, the model will underperform. |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| | Parameter | Value | |
| |---|---| |
| | Base model | GLM4-9B-Chat (`ChatGLMForConditionalGeneration`) | |
| | Total parameters | ~9.4B | |
| | Layers | 40 | |
| | Hidden size | 4,096 | |
| | Attention heads | 32 | |
| | FFN hidden size | 13,696 | |
| | KV channels | 128 | |
| | Vocabulary size | 151,552 | |
| | Max context length | 131,072 tokens (128K) | |
| | Storage dtype | bfloat16 | |
| | Attention | Multi-query (2 groups), RoPE (ratio 500) | |
| | Normalization | RMSNorm | |
| | Weight files | 4× safetensors shards (~18.8 GB total) | |
|
|
| **Generation defaults:** temperature 0.8, top-p 0.8. |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Transformers (recommended, ~19 GB VRAM) |
|
|
| The model uses custom modeling code, so `trust_remote_code=True` is required. |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_id = "ulises-c/SocratTeachLLM" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [{"role": "user", "content": "What do you think causes the seasons to change?"}] |
| inputs = tokenizer.apply_chat_template( |
| messages, add_generation_prompt=True, return_tensors="pt" |
| ).to(model.device) |
| |
| outputs = model.generate(inputs, max_new_tokens=512, temperature=0.8, top_p=0.8) |
| print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ### 4-bit NF4 via bitsandbytes (~6.5 GB VRAM) |
|
|
| ```python |
| from transformers import BitsAndBytesConfig |
| |
| bnb_config = BitsAndBytesConfig( |
| load_in_4bit=True, |
| bnb_4bit_compute_dtype=torch.float16, |
| bnb_4bit_use_double_quant=True, |
| bnb_4bit_quant_type="nf4", |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| quantization_config=bnb_config, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| ### vLLM (OpenAI-compatible endpoint) |
|
|
| ```bash |
| vllm serve ulises-c/SocratTeachLLM \ |
| --served-model-name SocratTeachLLM \ |
| --dtype bfloat16 \ |
| --trust-remote-code |
| ``` |
|
|
| ### Ollama |
|
|
| This repo includes a `Modelfile` (auto-generated by LlamaFactory) with the correct ChatGLM4 stop sequences and a 4,096-token context window. |
|
|
| ```bash |
| ollama create SocratTeachLLM -f Modelfile |
| ollama run SocratTeachLLM |
| ``` |
|
|
| > **Note:** Ollama caps context at 4,096 tokens. For the full 128K context, use Transformers or vLLM. |
|
|
| --- |
|
|
| ## Built With This Model |
|
|
| **[csen-346](https://github.com/ulises-c/csen-346)** is a downstream course project (CSEN 346 NLP, Santa Clara University) that reproduces and extends the KELE framework using this model as the teacher agent. |
|
|
| Key integration details: |
| - **Teacher:** SocratTeachLLM, served via FastAPI (4-bit on RTX 3070) or vLLM (bfloat16 on RTX 5090 / SCU WAVE cluster L40S) |
| - **Consultant:** GPT-4o (baseline) or Qwen3.5-9B (local variant) |
| - **Evaluation:** 680-dialogue test split of SocratDataset, automated with ROUGE, BLEU, and GPT-4o judge (B.5 rubric) |
| - **English extension:** An English translation of the training dataset is available at [ulises-c/SocratDataset-EN](https://huggingface.co/datasets/ulises-c/SocratDataset-EN) |
|
|
| ```bash |
| hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM |
| ``` |
|
|
| --- |
|
|
| ## Training Data |
|
|
| | Property | Value | |
| |---|---| |
| | Dataset | [ulises-c/SocratDataset](https://huggingface.co/datasets/ulises-c/SocratDataset) | |
| | Dialogues | 6,803 | |
| | Turns | 42,000+ | |
| | Domain | Elementary school science (grades 1–6) | |
| | Language | Chinese (Simplified) | |
| | Train split | 6,123 dialogues (90%) | |
| | Test split | 680 dialogues (10%) | |
| | Strategies | 34 SocRule teaching strategies | |
|
|
| An English translation of the training data is available at [ulises-c/SocratDataset-EN](https://huggingface.co/datasets/ulises-c/SocratDataset-EN). |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite the original KELE paper: |
|
|
| ```bibtex |
| @inproceedings{peng-etal-2025-kele, |
| title = {{KELE}: A Multi-Agent Framework for Structured {S}ocratic Teaching with Large Language Models}, |
| author = {Peng, Yuan and others}, |
| booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, |
| year = {2025}, |
| url = {https://aclanthology.org/2025.findings-emnlp.888/} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Related Resources |
|
|
| | Resource | Link | |
| |---|---| |
| | KELE paper (EMNLP 2025 Findings) | https://aclanthology.org/2025.findings-emnlp.888/ | |
| | KELE GitHub repository | https://github.com/yuanpan1020/KELE | |
| | Original model | https://huggingface.co/yuanpan/SocratTeachLLM | |
| | Training data (Chinese) | https://huggingface.co/datasets/ulises-c/SocratDataset | |
| | Training data (English translation) | https://huggingface.co/datasets/ulises-c/SocratDataset-EN | |
| | Evaluation + inference code | https://github.com/ulises-c/csen-346 | |
|
|
| --- |
|
|
| ## License |
|
|
| [Apache 2.0](LICENSE) |
|
|