SocratTeachLLM

@@ -1,40 +1,100 @@
 ---
 license: apache-2.0
 language:
-- zh
 ---
 # SocratTeachLLM
-A fine-tuned [GLM4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model trained to act as a **Socratic teacher** in structured educational dialogues. It generates heuristic questions and formative feedback that guide students through a principled sequence of reasoning stages, following the [KELE framework](https://aclanthology.org/2025.findings-emnlp.XXX) (Peng et al., EMNLP 2025 Findings).
-> **Original model:** [yuanpan/SocratTeachLLM](https://huggingface.co/yuanpan/SocratTeachLLM)
 ---
 ## What It Does
-SocratTeachLLM is designed for the **teacher role** in a dual-agent Socratic tutoring system. A separate consultant agent (e.g., GPT-4o) selects a teaching strategy from a predefined set of 34 Socratic rules (SocRule); SocratTeachLLM then generates the actual dialogue turn implementing that strategy.
-Teaching proceeds through five stages:
-| Stage | Name | Description |
-|-------|------|-------------|
-| A | Student Questioning | Elicit prior knowledge and surface misconceptions |
-| B | Concept Probing | Probe understanding of core concepts |
-| C | Inductive Reasoning | Guide the student toward generalizations |
-| D | Rule Construction | Help the student articulate a principle or rule |
-| E | Summary | Consolidate and reinforce learning |
-The model was fine-tuned (LoRA) on **SocratDataset**: 6,803 multi-turn Socratic dialogues covering 42,000+ interaction turns across elementary school science topics, primarily in Chinese.
 ---
 ## Model Architecture
 | Parameter | Value |
-|-----------|-------|
 | Base model | GLM4-9B-Chat (`ChatGLMForConditionalGeneration`) |
 | Layers | 40 |
 | Hidden size | 4,096 |
 | Attention heads | 32 |
@@ -44,15 +104,16 @@ The model was fine-tuned (LoRA) on **SocratDataset**: 6,803 multi-turn Socratic
 | Max context length | 131,072 tokens (128K) |
 | Storage dtype | bfloat16 |
 | Attention | Multi-query (2 groups), RoPE (ratio 500) |
-| Normalization | RMSNorm, post-layer-norm |
-| Total parameters | ~9.4B |
-| Weight files | 4 × safetensors shards (~18.8 GB total) |
-**Generation defaults:** temperature 0.8, top-p 0.8, max length 128K.
 ---
-## Loading with Transformers
 The model uses custom modeling code, so `trust_remote_code=True` is required.
@@ -70,10 +131,7 @@ model = AutoModelForCausalLM.from_pretrained(
     trust_remote_code=True,
 )
-messages = [
-    {"role": "user", "content": "What do you think causes the seasons to change?"}
-]
 inputs = tokenizer.apply_chat_template(
     messages, add_generation_prompt=True, return_tensors="pt"
 ).to(model.device)
@@ -82,7 +140,7 @@ outputs = model.generate(inputs, max_new_tokens=512, temperature=0.8, top_p=0.8)
 print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
 ```
-### Low-VRAM (4-bit NF4 via bitsandbytes, ~6.5 GB)
 ```python
 from transformers import BitsAndBytesConfig
@@ -93,7 +151,6 @@ bnb_config = BitsAndBytesConfig(
     bnb_4bit_use_double_quant=True,
     bnb_4bit_quant_type="nf4",
 )
 model = AutoModelForCausalLM.from_pretrained(
     model_id,
     quantization_config=bnb_config,
@@ -102,50 +159,39 @@ model = AutoModelForCausalLM.from_pretrained(
 )
 ```
----
-## Running Locally with Ollama
-This repo includes a `Modelfile` for Ollama (auto-generated by LlamaFactory). It sets a 4,096-token context window and the correct stop sequences for the ChatGLM4 chat format.
 ```bash
-# Create the Ollama model from the local Modelfile
-ollama create SocratTeachLLM -f Modelfile
-# Run interactively
-ollama run SocratTeachLLM
 ```
-Stop sequences used: `<|user|>`, `<|endoftext|>`, `<|observation|>`.
-> **Note:** Ollama currently caps the context at 4,096 tokens. For the full 128K context, use the Transformers or vLLM path.
-### vLLM (full bfloat16, ~19 GB VRAM)
 ```bash
-vllm serve /path/to/SocratTeachLLM \
-  --served-model-name SocratTeachLLM \
-  --dtype bfloat16 \
-  --trust-remote-code
 ```
-This exposes an OpenAI-compatible endpoint at `http://localhost:8000/v1`.
 ---
 ## Built With This Model
-**[csen-346](https://github.com/ulises-c/csen-346)** is a downstream course project (CSEN 346 NLP, Santa Clara University) that reproduces and extends the KELE framework using SocratTeachLLM as the teacher agent.
 Key integration details:
-- **Teacher agent:** SocratTeachLLM, served locally via FastAPI (4-bit on RTX 3070) or vLLM (bfloat16 on RTX 5090 / SCU WAVE cluster)
-- **Consultant agent:** GPT-4o (baseline) or Qwen3.5-9B (local variant) — selects Socratic strategies from SocRule and passes them to the teacher
-- **Evaluation:** 680-dialogue test split of SocratDataset
-- **API surface:** OpenAI-compatible chat completions endpoint (`TEACHER_MODEL_NAME=SocratTeachLLM`)
 ```bash
-# Download the model for use in csen-346
 hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM
 ```
@@ -154,16 +200,18 @@ hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM
 ## Training Data
 | Property | Value |
-|----------|-------|
-| Dataset | SocratDataset |
 | Dialogues | 6,803 |
 | Turns | 42,000+ |
-| Domain | Elementary school science |
-| Language | Primarily Chinese |
-| Train split | 5,723 dialogues (90%) |
 | Test split | 680 dialogues (10%) |
 | Strategies | 34 SocRule teaching strategies |
 ---
 ## Citation
@@ -171,16 +219,30 @@ hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM
 If you use this model, please cite the original KELE paper:
 ```bibtex
-@inproceedings{peng2025kele,
-  title     = {KELE: A Multi-Agent Framework for Structured Socratic Teaching with Large Language Models},
   author    = {Peng, Yuan and others},
   booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
   year      = {2025},
 }
 ```
 ---
 ## License
 [Apache 2.0](LICENSE)

 ---
 license: apache-2.0
 language:
+  - zh
+  - en
+tags:
+  - education
+  - socratic-teaching
+  - dialogue
+  - fine-tuned
+  - glm4
+  - kele
+  - lora
+base_model: THUDM/glm-4-9b-chat
 ---
 # SocratTeachLLM
+A LoRA fine-tuned [GLM4-9B-Chat](https://huggingface.co/THUDM/glm-4-9b-chat) model trained to act as a **Socratic teacher** in structured educational dialogues. It generates heuristic questions and formative feedback that guide students through a principled sequence of reasoning stages, following the [KELE framework](https://aclanthology.org/2025.findings-emnlp.888) (Peng et al., EMNLP 2025 Findings).
+> **Original model:** [yuanpan/SocratTeachLLM](https://huggingface.co/yuanpan/SocratTeachLLM) — this repository is a copy with an expanded README.
 ---
 ## What It Does
+SocratTeachLLM is designed for the **teacher role** in a dual-agent Socratic tutoring system. A separate **consultant agent** (e.g., GPT-4o or Qwen) selects a teaching strategy from a predefined set of 34 Socratic rules (SocRule); SocratTeachLLM then generates the actual dialogue turn implementing that strategy.
+Teaching proceeds through five stages (SocRule):
+| Stage | Name | State codes | Description |
+|---|---|---|---|
+| a | Initiation | a1 | Student poses the question; dialogue begins |
+| b | Concept Probing | b2–b7 | Teacher probes prior knowledge and surfaces misconceptions |
+| c | Inductive Reasoning | c8–c29 | Core teaching stage — guides the student toward generalizations; can repeat many turns |
+| d | Answer Derivation | d30–d33 | Help the student arrive at the correct answer |
+| e | Summary | e34 | Consolidate and reinforce learning |
+The model was fine-tuned on **SocratDataset**: 6,803 multi-turn Socratic dialogues covering 42,000+ interaction turns across elementary school science topics in Chinese.
+---
+## Published Performance
+Results from Table 1 of the KELE paper (test set: 680 dialogues, 4,245 single-turn examples):
+| Model | ROUGE-1 | ROUGE-2 | BLEU-4 | PRR | NDAR | SPR | IAR | Guidance | Logicality | Flexibility |
+|---|---|---|---|---|---|---|---|---|---|---|
+| GPT-4o | 38.25 | 22.35 | 29.93 | 72.13 | 81.19 | 85.00 | 87.74 | 4.35 | 4.50 | 4.33 |
+| Qwen2.5-7B | 40.95 | 15.27 | 24.96 | 59.02 | 80.52 | 60.00 | 76.45 | 3.87 | 3.96 | 3.87 |
+| Qwen2.5-14B | 43.79 | 17.06 | 26.63 | 65.21 | 78.57 | 74.00 | 80.81 | 3.99 | 4.15 | 4.03 |
+| Qwen2.5-32B | 46.22 | 19.90 | 28.85 | 65.57 | 83.13 | 81.00 | 84.68 | 4.12 | 4.44 | 4.21 |
+| EduChat-13B | 34.75 | 9.91 | 21.11 | 47.62 | 90.73 | 51.00 | 69.02 | 2.93 | 3.42 | 3.18 |
+| SocraticLM-7B | 18.63 | 5.56 | 10.93 | 26.83 | 30.26 | 36.00 | 27.05 | 2.62 | 2.88 | 2.78 |
+| **SocratTeachLLM (this model)** | **57.40** | **33.63** | **41.96** | **75.13** | **94.71** | **87.00** | **89.03** | **4.66** | **4.53** | **4.45** |
+**Metric definitions:**
+- **PRR** — Problem Relevance Rate: teacher question relates directly to the problem
+- **NDAR** — No Direct Answer Rate: teacher avoids giving away the answer
+- **SPR** — Summary Pass Rate: correct and complete final summary
+- **IAR** — Instruction Adherence Rate: teacher follows the consultant's recommended strategy
+- **Guidance / Logicality / Flexibility** — GPT-4o judge scores on a 1–5 scale (B.5 rubric)
+SocratTeachLLM outperforms GPT-4o on every metric despite being ~40× smaller.
+---
+## Training Details
+| Setting | Value |
+|---|---|
+| Base model | GLM4-9B-Chat |
+| Method | LoRA |
+| Epochs | 3 |
+| Learning rate | 5e-5 |
+| Batch size | 16 |
+| Train split | 6,123 dialogues (90%) |
+| Test split | 680 dialogues (10%) |
+| Hardware | 2× NVIDIA A800 80GB |
+| Dataset | SocratDataset (6,803 records, Chinese) |
+### Training Objective
+```
+P(teacher_response | dialogue_history, evaluation, action)
+```
+The `evaluation` (consultant's stage/state assessment) and `action` (recommended strategy) fields are required conditioning signals. At inference time, a consultant agent produces these before the teacher agent generates its response. Without the consultant outputs as conditioning, the model will underperform.
 ---
 ## Model Architecture
 | Parameter | Value |
+|---|---|
 | Base model | GLM4-9B-Chat (`ChatGLMForConditionalGeneration`) |
+| Total parameters | ~9.4B |
 | Layers | 40 |
 | Hidden size | 4,096 |
 | Attention heads | 32 |
 | Max context length | 131,072 tokens (128K) |
 | Storage dtype | bfloat16 |
 | Attention | Multi-query (2 groups), RoPE (ratio 500) |
+| Normalization | RMSNorm |
+| Weight files | 4× safetensors shards (~18.8 GB total) |
+**Generation defaults:** temperature 0.8, top-p 0.8.
 ---
+## Usage
+### Transformers (recommended, ~19 GB VRAM)
 The model uses custom modeling code, so `trust_remote_code=True` is required.
     trust_remote_code=True,
 )
+messages = [{"role": "user", "content": "What do you think causes the seasons to change?"}]
 inputs = tokenizer.apply_chat_template(
     messages, add_generation_prompt=True, return_tensors="pt"
 ).to(model.device)
 print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
 ```
+### 4-bit NF4 via bitsandbytes (~6.5 GB VRAM)
 ```python
 from transformers import BitsAndBytesConfig
     bnb_4bit_use_double_quant=True,
     bnb_4bit_quant_type="nf4",
 )
 model = AutoModelForCausalLM.from_pretrained(
     model_id,
     quantization_config=bnb_config,
 )
 ```
+### vLLM (OpenAI-compatible endpoint)
 ```bash
+vllm serve ulises-c/SocratTeachLLM \
+  --served-model-name SocratTeachLLM \
+  --dtype bfloat16 \
+  --trust-remote-code
 ```
+### Ollama
+This repo includes a `Modelfile` (auto-generated by LlamaFactory) with the correct ChatGLM4 stop sequences and a 4,096-token context window.
 ```bash
+ollama create SocratTeachLLM -f Modelfile
+ollama run SocratTeachLLM
 ```
+> **Note:** Ollama caps context at 4,096 tokens. For the full 128K context, use Transformers or vLLM.
 ---
 ## Built With This Model
+**[csen-346](https://github.com/ulises-c/csen-346)** is a downstream course project (CSEN 346 NLP, Santa Clara University) that reproduces and extends the KELE framework using this model as the teacher agent.
 Key integration details:
+- **Teacher:** SocratTeachLLM, served via FastAPI (4-bit on RTX 3070) or vLLM (bfloat16 on RTX 5090 / SCU WAVE cluster L40S)
+- **Consultant:** GPT-4o (baseline) or Qwen3.5-9B (local variant)
+- **Evaluation:** 680-dialogue test split of SocratDataset, automated with ROUGE, BLEU, and GPT-4o judge (B.5 rubric)
+- **English extension:** An English translation of the training dataset is available at [ulises-c/SocratDataset-EN](https://huggingface.co/datasets/ulises-c/SocratDataset-EN)
 ```bash
 hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM
 ```
 ## Training Data
 | Property | Value |
+|---|---|
+| Dataset | [ulises-c/SocratDataset](https://huggingface.co/datasets/ulises-c/SocratDataset) |
 | Dialogues | 6,803 |
 | Turns | 42,000+ |
+| Domain | Elementary school science (grades 1–6) |
+| Language | Chinese (Simplified) |
+| Train split | 6,123 dialogues (90%) |
 | Test split | 680 dialogues (10%) |
 | Strategies | 34 SocRule teaching strategies |
+An English translation of the training data is available at [ulises-c/SocratDataset-EN](https://huggingface.co/datasets/ulises-c/SocratDataset-EN).
 ---
 ## Citation
 If you use this model, please cite the original KELE paper:
 ```bibtex
+@inproceedings{peng-etal-2025-kele,
+  title     = {{KELE}: A Multi-Agent Framework for Structured {S}ocratic Teaching with Large Language Models},
   author    = {Peng, Yuan and others},
   booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
   year      = {2025},
+  url       = {https://aclanthology.org/2025.findings-emnlp.888/}
 }
 ```
 ---
+## Related Resources
+| Resource | Link |
+|---|---|
+| KELE paper (EMNLP 2025 Findings) | https://aclanthology.org/2025.findings-emnlp.888/ |
+| KELE GitHub repository | https://github.com/yuanpan1020/KELE |
+| Original model | https://huggingface.co/yuanpan/SocratTeachLLM |
+| Training data (Chinese) | https://huggingface.co/datasets/ulises-c/SocratDataset |
+| Training data (English translation) | https://huggingface.co/datasets/ulises-c/SocratDataset-EN |
+| Evaluation + inference code | https://github.com/ulises-c/csen-346 |
+---
 ## License
 [Apache 2.0](LICENSE)