TechTokenBERT

TechTokenBERT is a state-of-the-art patent embedding model based on BERT. It outperforms larger and task-specific models on IPC code classification, citation prediction, and title–abstract matching.

The core innovation is treating International Patent Classification (IPC) codes as dedicated tokens in the model's vocabulary (technological tokens). This allows the attention mechanism to operate directly between patent text and classification codes during fine-tuning, producing embeddings that are simultaneously aware of linguistic content and technological structure.

Introduced in:

Anticipating Innovation Using Large Language Models
Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella
arXiv:2605.04875 · May 2026
https://arxiv.org/abs/2605.04875


Model Description

TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per IPC code at group level (~8000 codes total). Fine-tuning uses masked-language-modelling on sequences of the form:

[CLS] patent title [SEP] patent abstract [SEP] [TT_1] [TT_2] ... [TT_N] [SEP]

During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:

  • Patent embeddings: use the [CLS] token as a single vector representation of the full patent (title + abstract + IPC codes). The [CLS] embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
  • IPC-code embeddings: read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.

Training Data

  • Source: Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
  • Fine-tuning split: EPO Patents published 1980–2023
  • IPC granularity: Group level, yielding 8000 unique codes

Evaluation Results

Patent-related downstream tasks

Model IPC Macro-F1 ↑ Citation MAP ↑ Title–Abstract AUC-ROC ↑
BERT4Patents 0.354 59.46 0.920
BERT4Patents FT (Mirror-BERT) 0.262 52.78 0.832
PatentSBERTa 0.356 75.95 0.985
Paecter 0.420 68.11 0.944
LLaMA 3.1 8B FT (LLM2Vec) 0.343 56.78 0.973
TechTokenBERT 0.488 68.96 0.994

For details see the full paper.


Usage

Minimal example: build a batch where only the abstract is truncated (on the right), while the title, tech tokens, and all special tokens are preserved. Then run the model and extract the [CLS] embedding for each example.

import torch
from transformers import BertTokenizer, BertForMaskedLM

# ---------------------------------------------------------------------------
# 1. Load model + tokenizer
# ---------------------------------------------------------------------------
MODEL_NAME = "AndreaTacchella/TechTokenBert"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForMaskedLM.from_pretrained(MODEL_NAME)
model.eval()

# ---------------------------------------------------------------------------
# 2. Toy data (3 rows)
# ---------------------------------------------------------------------------
titles = [
    "Method for cooling electronic components",
    "Wireless charging apparatus",
    "Biodegradable packaging material",
]

abstracts = [
    "A heat sink assembly that dissipates thermal energy from a processor using "
    "a network of micro-channels through which a coolant is circulated, thereby "
    "maintaining the junction temperature below a predefined threshold under load.",

    "An inductive power transfer system comprising a transmitter coil and a "
    "receiver coil aligned via a magnetic guidance structure to maximize coupling "
    "efficiency across a variable air gap.",

    "A composite film derived from plant-based polymers that decomposes under "
    "industrial composting conditions while providing an oxygen barrier suitable "
    "for food preservation.",
]

# Already preprocessed tech tokens (list of lists of IPC group-level strings)
tech_tokens_list = [
    ["h05k7", "g06f1"],
    ["h02j50", "h01f27"],
    ["c08l101", "b65d65"],
]


# ---------------------------------------------------------------------------
# 3. Build the padded batch (abstract truncated on the right only)
# ---------------------------------------------------------------------------
def build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512):
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id

    all_ids = []
    for title, abstract, tech_tokens in zip(titles, abstracts, tech_tokens_list):
        title_ids = tokenizer.encode(title, add_special_tokens=False)
        abstract_ids = tokenizer.encode(abstract, add_special_tokens=False)
        tech_ids = tokenizer.encode(" ".join(tech_tokens), add_special_tokens=False)

        # [CLS] title [SEP] abstract [SEP] tech [SEP]  -> 4 special tokens fixed
        fixed_len = 4 + len(title_ids) + len(tech_ids)
        abstract_budget = max(max_length - fixed_len, 0)
        abstract_ids = abstract_ids[:abstract_budget]  # right-side truncation

        ids = (
            [cls_id]
            + title_ids
            + [sep_id]
            + abstract_ids
            + [sep_id]
            + tech_ids
            + [sep_id]
        )
        all_ids.append(ids)

    return tokenizer.pad({"input_ids": all_ids}, padding=True, return_tensors="pt")


enc = build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512)

# ---------------------------------------------------------------------------
# 4. Forward pass + extract the [CLS] embedding
# ---------------------------------------------------------------------------
with torch.no_grad():
    outputs = model(**enc, output_hidden_states=True)

# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
cls_embeddings = outputs.hidden_states[-1][:, 0, :]
print(cls_embeddings.shape)  # (3, 1024)

Extracting IPC-code embeddings

To obtain the context-dependent embedding of a specific IPC code within a patent, read the hidden-state vector at the position of the corresponding technological token (the tokens after the second [SEP]):

with torch.no_grad():
    outputs = model(**enc, output_hidden_states=True)

last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 1024)
# Identify the position of each TT token, then index last_hidden accordingly.

Input Format

[CLS] <title tokens> [SEP] <abstract tokens> [SEP] <ipc_code_1> <ipc_code_2> ... [SEP]
  • IPC codes must be lower-cased and at group level (e.g., h05k7, g06f1).
  • The abstract is the only segment that should be truncated when the total length exceeds 512 tokens; title and IPC codes are always kept in full.
  • If IPC codes are not available at inference time, the model can still be used with only title and abstract (omit the third segment); performance on IPC-aware tasks will be reduced.


Citation

@article{fenoaltea2026anticipating,
  title   = {Anticipating Innovation Using Large Language Models},
  author  = {Fenoaltea, Enrico Maria and Santoro, Filippo and De Marzo, Giordano
             and Aroyehun, Segun Taofeek and Tacchella, Andrea},
  journal = {arXiv preprint arXiv:2605.04875},
  year    = {2026}
}
Downloads last month
12
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AndreaTacchella/TechTokenBert

Finetuned
(1)
this model

Paper for AndreaTacchella/TechTokenBert