SmolLM2-360M β€” GraphCoT Fine-tune

Fine-tuned version of SmolLM2-360M on the Wikipedia GraphCoT Summarization dataset.

The model learns to perform semantic graph chain-of-thought (GraphCoT) summarization: given a raw paragraph, it explicitly constructs a semantic knowledge graph before writing a summary grounded exclusively in that graph. No fact in the summary can appear unless it exists as an edge.


What it does

Given a paragraph wrapped in <input> tags, the model generates a structured <reasoning> block followed by a <summary>:

  1. Content words β€” extracts key nouns, verbs, and proper entities
  2. Coreference β€” resolves pronouns and aliases to a single primary subject
  3. Relations β€” builds (subject, relation, object) triples in snake_case
  4. Graph β€” organises triples into a rooted ASCII tree
  5. Summary β€” writes 1–3 sentences where every clause maps to a graph edge

Example output

Input:

Alan Turing was a British mathematician and computer scientist who played a
crucial role in breaking German ciphers during World War II. He is widely
regarded as the father of theoretical computer science and artificial intelligence.

Model output:

<reasoning>
<content_words>
  nouns: mathematician, computer, scientist, role, cipher, father, theory, intelligence
  verbs: break, regard
  proper_entities: Alan Turing, British, World War II, German
</content_words>

<coref>
  (none)
</coref>

<relations>
  (Alan Turing, is_a, British mathematician)
  (Alan Turing, is_a, computer scientist)
  (Alan Turing, broke, German ciphers)
  (Alan Turing, broke_during, World War II)
  (Alan Turing, is_the_father_of, theoretical computer science)
  (Alan Turing, is_the_father_of, artificial intelligence)
</relations>

<graph>
ROOT: Alan Turing
β”œβ”€ is_a -> British mathematician
β”œβ”€ is_a -> computer scientist
β”œβ”€ broke -> German ciphers
β”‚  └─ broke_during -> World War II
β”œβ”€ is_the_father_of -> theoretical computer science
└─ is_the_father_of -> artificial intelligence
</graph>
</reasoning>

<summary>
Alan Turing was a British mathematician and computer scientist who broke German
ciphers during World War II. He is regarded as the father of theoretical computer
science and artificial intelligence.
</summary>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ambrosfitz/smollm2-graphcot"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

paragraph = "Your paragraph here."
prompt = f"<input>\n{paragraph}\n</input>\n\n<reasoning>\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=400,
        do_sample=False,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training

Dataset

ambrosfitz/Wikipedia_GraphCoT_Summarization β€” 6,856 Wikipedia paragraphs processed through a two-stage pipeline:

  • Stage 1 (local): spaCy scaffold β€” content word extraction, dependency triples, coreference clustering
  • Stage 2 (LLM): Gemini 2.5 Flash normalization β€” semantic edge labelling, tree assembly, grounded summary generation
Split Records
Train 6,172
Validation 342
Test 342

Loss masking

Only <reasoning> and <summary> tokens contribute to the loss. The <input> paragraph is masked (label = -100) so the model learns to generate the graph and summary, not memorise the input.

Hyperparameters

Parameter Value
Base model HuggingFaceTB/SmolLM2-360M
Epochs 3
Effective batch size 16 (8 Γ— 2 grad accum)
Learning rate 2e-5
LR schedule Cosine with 100 warmup steps
Max sequence length 1024 tokens
Precision fp16 (AMP)
Gradient checkpointing Yes
Hardware NVIDIA T4 (Google Colab)
Training time ~2h 18m

Training curves

Step Train Loss Eval Loss
100 0.520 0.497
300 0.369 0.367
500 0.315 0.335
700 0.310 0.320
900 0.260 0.314
1100 0.278 0.312
1158 0.282 0.312

Train and validation loss stayed within ~0.03 throughout β€” no overfitting.


Limitations

  • Trained on Wikipedia-style encyclopaedic paragraphs; may produce lower-quality graphs on conversational or highly technical text
  • 360M parameters β€” graph structure may be incomplete or inconsistent on long or complex inputs
  • Max context 1024 tokens; paragraphs longer than ~700 words will be truncated
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ambrosfitz/smollm2-graphcot

Finetuned
(106)
this model

Dataset used to train ambrosfitz/smollm2-graphcot