SmolLM2-360M β GraphCoT Fine-tune
Fine-tuned version of SmolLM2-360M on the Wikipedia GraphCoT Summarization dataset.
The model learns to perform semantic graph chain-of-thought (GraphCoT) summarization: given a raw paragraph, it explicitly constructs a semantic knowledge graph before writing a summary grounded exclusively in that graph. No fact in the summary can appear unless it exists as an edge.
What it does
Given a paragraph wrapped in <input> tags, the model generates a structured <reasoning> block followed by a <summary>:
- Content words β extracts key nouns, verbs, and proper entities
- Coreference β resolves pronouns and aliases to a single primary subject
- Relations β builds
(subject, relation, object)triples in snake_case - Graph β organises triples into a rooted ASCII tree
- Summary β writes 1β3 sentences where every clause maps to a graph edge
Example output
Input:
Alan Turing was a British mathematician and computer scientist who played a
crucial role in breaking German ciphers during World War II. He is widely
regarded as the father of theoretical computer science and artificial intelligence.
Model output:
<reasoning>
<content_words>
nouns: mathematician, computer, scientist, role, cipher, father, theory, intelligence
verbs: break, regard
proper_entities: Alan Turing, British, World War II, German
</content_words>
<coref>
(none)
</coref>
<relations>
(Alan Turing, is_a, British mathematician)
(Alan Turing, is_a, computer scientist)
(Alan Turing, broke, German ciphers)
(Alan Turing, broke_during, World War II)
(Alan Turing, is_the_father_of, theoretical computer science)
(Alan Turing, is_the_father_of, artificial intelligence)
</relations>
<graph>
ROOT: Alan Turing
ββ is_a -> British mathematician
ββ is_a -> computer scientist
ββ broke -> German ciphers
β ββ broke_during -> World War II
ββ is_the_father_of -> theoretical computer science
ββ is_the_father_of -> artificial intelligence
</graph>
</reasoning>
<summary>
Alan Turing was a British mathematician and computer scientist who broke German
ciphers during World War II. He is regarded as the father of theoretical computer
science and artificial intelligence.
</summary>
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "ambrosfitz/smollm2-graphcot"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
paragraph = "Your paragraph here."
prompt = f"<input>\n{paragraph}\n</input>\n\n<reasoning>\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=400,
do_sample=False,
repetition_penalty=1.1,
)
print(tokenizer.decode(output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training
Dataset
ambrosfitz/Wikipedia_GraphCoT_Summarization β 6,856 Wikipedia paragraphs processed through a two-stage pipeline:
- Stage 1 (local): spaCy scaffold β content word extraction, dependency triples, coreference clustering
- Stage 2 (LLM): Gemini 2.5 Flash normalization β semantic edge labelling, tree assembly, grounded summary generation
| Split | Records |
|---|---|
| Train | 6,172 |
| Validation | 342 |
| Test | 342 |
Loss masking
Only <reasoning> and <summary> tokens contribute to the loss. The <input> paragraph is masked (label = -100) so the model learns to generate the graph and summary, not memorise the input.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | HuggingFaceTB/SmolLM2-360M |
| Epochs | 3 |
| Effective batch size | 16 (8 Γ 2 grad accum) |
| Learning rate | 2e-5 |
| LR schedule | Cosine with 100 warmup steps |
| Max sequence length | 1024 tokens |
| Precision | fp16 (AMP) |
| Gradient checkpointing | Yes |
| Hardware | NVIDIA T4 (Google Colab) |
| Training time | ~2h 18m |
Training curves
| Step | Train Loss | Eval Loss |
|---|---|---|
| 100 | 0.520 | 0.497 |
| 300 | 0.369 | 0.367 |
| 500 | 0.315 | 0.335 |
| 700 | 0.310 | 0.320 |
| 900 | 0.260 | 0.314 |
| 1100 | 0.278 | 0.312 |
| 1158 | 0.282 | 0.312 |
Train and validation loss stayed within ~0.03 throughout β no overfitting.
Limitations
- Trained on Wikipedia-style encyclopaedic paragraphs; may produce lower-quality graphs on conversational or highly technical text
- 360M parameters β graph structure may be incomplete or inconsistent on long or complex inputs
- Max context 1024 tokens; paragraphs longer than ~700 words will be truncated
- Downloads last month
- -
Model tree for ambrosfitz/smollm2-graphcot
Base model
HuggingFaceTB/SmolLM2-360M