SmolLM2-360M — GraphCoT Fine-tune

Fine-tuned version of SmolLM2-360M on the Wikipedia GraphCoT Summarization dataset.

The model learns to perform semantic graph chain-of-thought (GraphCoT) summarization: given a raw paragraph, it explicitly constructs a semantic knowledge graph before writing a summary grounded exclusively in that graph. No fact in the summary can appear unless it exists as an edge.

What it does

Given a paragraph wrapped in <input> tags, the model generates a structured <reasoning> block followed by a <summary>:

Content words — extracts key nouns, verbs, and proper entities
Coreference — resolves pronouns and aliases to a single primary subject
Relations — builds (subject, relation, object) triples in snake_case
Graph — organises triples into a rooted ASCII tree
Summary — writes 1–3 sentences where every clause maps to a graph edge

Example output

Input:

Alan Turing was a British mathematician and computer scientist who played a
crucial role in breaking German ciphers during World War II. He is widely
regarded as the father of theoretical computer science and artificial intelligence.

Model output:

<reasoning>
<content_words>
  nouns: mathematician, computer, scientist, role, cipher, father, theory, intelligence
  verbs: break, regard
  proper_entities: Alan Turing, British, World War II, German
</content_words>

<coref>
  (none)
</coref>

<relations>
  (Alan Turing, is_a, British mathematician)
  (Alan Turing, is_a, computer scientist)
  (Alan Turing, broke, German ciphers)
  (Alan Turing, broke_during, World War II)
  (Alan Turing, is_the_father_of, theoretical computer science)
  (Alan Turing, is_the_father_of, artificial intelligence)
</relations>

<graph>
ROOT: Alan Turing
├─ is_a -> British mathematician
├─ is_a -> computer scientist
├─ broke -> German ciphers
│  └─ broke_during -> World War II
├─ is_the_father_of -> theoretical computer science
└─ is_the_father_of -> artificial intelligence
</graph>
</reasoning>

<summary>
Alan Turing was a British mathematician and computer scientist who broke German
ciphers during World War II. He is regarded as the father of theoretical computer
science and artificial intelligence.
</summary>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ambrosfitz/smollm2-graphcot"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

paragraph = "Your paragraph here."
prompt = f"<input>\n{paragraph}\n</input>\n\n<reasoning>\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=400,
        do_sample=False,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training

Dataset

ambrosfitz/Wikipedia_GraphCoT_Summarization — 6,856 Wikipedia paragraphs processed through a two-stage pipeline:

Stage 1 (local): spaCy scaffold — content word extraction, dependency triples, coreference clustering
Stage 2 (LLM): Gemini 2.5 Flash normalization — semantic edge labelling, tree assembly, grounded summary generation

Split	Records
Train	6,172
Validation	342
Test	342

Loss masking

Only <reasoning> and <summary> tokens contribute to the loss. The <input> paragraph is masked (label = -100) so the model learns to generate the graph and summary, not memorise the input.

Hyperparameters

Parameter	Value
Base model	HuggingFaceTB/SmolLM2-360M
Epochs	3
Effective batch size	16 (8 × 2 grad accum)
Learning rate	2e-5
LR schedule	Cosine with 100 warmup steps
Max sequence length	1024 tokens
Precision	fp16 (AMP)
Gradient checkpointing	Yes
Hardware	NVIDIA T4 (Google Colab)
Training time	~2h 18m

Training curves

Step	Train Loss	Eval Loss
100	0.520	0.497
300	0.369	0.367
500	0.315	0.335
700	0.310	0.320
900	0.260	0.314
1100	0.278	0.312
1158	0.282	0.312

Train and validation loss stayed within ~0.03 throughout — no overfitting.

Limitations

Trained on Wikipedia-style encyclopaedic paragraphs; may produce lower-quality graphs on conversational or highly technical text
360M parameters — graph structure may be incomplete or inconsistent on long or complex inputs
Max context 1024 tokens; paragraphs longer than ~700 words will be truncated

Downloads last month: -

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for ambrosfitz/smollm2-graphcot

Base model

HuggingFaceTB/SmolLM2-360M

Finetuned

(106)

this model

ambrosfitz
/

smollm2-graphcot