File size: 7,921 Bytes

---
license: cc-by-sa-4.0
language:
- en
tags:
- bert
- patents
- ipc
- embeddings
- semantic-similarity
- masked-language-modeling
- patent-classification
base_model:
- saroyehun/bertforpatent-mirror-meanpooling
pipeline_tag: feature-extraction
arxiv: 2605.04875
---

# TechTokenBERT

**TechTokenBERT** is a state-of-the-art patent embedding model based on BERT. It outperforms larger and task-specific models on IPC code classification, citation prediction, and title–abstract matching.

The core innovation is treating International Patent Classification (IPC) codes as dedicated tokens in the model's vocabulary (*technological tokens*). This allows the attention mechanism to operate directly between patent text and classification codes during fine-tuning, producing embeddings that are simultaneously aware of linguistic content and technological structure.

Introduced in:

> **Anticipating Innovation Using Large Language Models**  
> Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella  
> arXiv:2605.04875 · May 2026  
> [https://arxiv.org/abs/2605.04875](https://arxiv.org/abs/2605.04875)

---

## Model Description

TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per IPC code at group level (~8000 codes total). Fine-tuning uses masked-language-modelling on sequences of the form:

```
[CLS] patent title [SEP] patent abstract [SEP] [TT_1] [TT_2] ... [TT_N] [SEP]
```

During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:

- **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.

---

## Training Data

- **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
- **Fine-tuning split:** EPO Patents published 1980–2023
- **IPC granularity:** Group level, yielding **8000 unique codes**

---

## Evaluation Results

### Patent-related downstream tasks

| Model | IPC Macro-F1 ↑ | Citation MAP ↑ | Title–Abstract AUC-ROC ↑ |
|---|---|---|---|
| BERT4Patents | 0.354 | 59.46 | 0.920 |
| BERT4Patents FT (Mirror-BERT) | 0.262 | 52.78 | 0.832 |
| PatentSBERTa | 0.356 | 75.95 | 0.985 |
| Paecter | 0.420 | 68.11 | 0.944 |
| LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
| **TechTokenBERT** | **0.488** | **68.96** | **0.994** |

For details see the full paper.

---

## Usage

> **Minimal example:** build a batch where only the abstract is truncated (on the right), while the title, tech tokens, and all special tokens are preserved. Then run the model and extract the `[CLS]` embedding for each example.

```python
import torch
from transformers import BertTokenizer, BertForMaskedLM

# ---------------------------------------------------------------------------
# 1. Load model + tokenizer
# ---------------------------------------------------------------------------
MODEL_NAME = "AndreaTacchella/TechTokenBert"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForMaskedLM.from_pretrained(MODEL_NAME)
model.eval()

# ---------------------------------------------------------------------------
# 2. Toy data (3 rows)
# ---------------------------------------------------------------------------
titles = [
    "Method for cooling electronic components",
    "Wireless charging apparatus",
    "Biodegradable packaging material",
]

abstracts = [
    "A heat sink assembly that dissipates thermal energy from a processor using "
    "a network of micro-channels through which a coolant is circulated, thereby "
    "maintaining the junction temperature below a predefined threshold under load.",

    "An inductive power transfer system comprising a transmitter coil and a "
    "receiver coil aligned via a magnetic guidance structure to maximize coupling "
    "efficiency across a variable air gap.",

    "A composite film derived from plant-based polymers that decomposes under "
    "industrial composting conditions while providing an oxygen barrier suitable "
    "for food preservation.",
]

# Already preprocessed tech tokens (list of lists of IPC group-level strings)
tech_tokens_list = [
    ["h05k7", "g06f1"],
    ["h02j50", "h01f27"],
    ["c08l101", "b65d65"],
]


# ---------------------------------------------------------------------------
# 3. Build the padded batch (abstract truncated on the right only)
# ---------------------------------------------------------------------------
def build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512):
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id

    all_ids = []
    for title, abstract, tech_tokens in zip(titles, abstracts, tech_tokens_list):
        title_ids = tokenizer.encode(title, add_special_tokens=False)
        abstract_ids = tokenizer.encode(abstract, add_special_tokens=False)
        tech_ids = tokenizer.encode(" ".join(tech_tokens), add_special_tokens=False)

        # [CLS] title [SEP] abstract [SEP] tech [SEP]  -> 4 special tokens fixed
        fixed_len = 4 + len(title_ids) + len(tech_ids)
        abstract_budget = max(max_length - fixed_len, 0)
        abstract_ids = abstract_ids[:abstract_budget]  # right-side truncation

        ids = (
            [cls_id]
            + title_ids
            + [sep_id]
            + abstract_ids
            + [sep_id]
            + tech_ids
            + [sep_id]
        )
        all_ids.append(ids)

    return tokenizer.pad({"input_ids": all_ids}, padding=True, return_tensors="pt")


enc = build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512)

# ---------------------------------------------------------------------------
# 4. Forward pass + extract the [CLS] embedding
# ---------------------------------------------------------------------------
with torch.no_grad():
    outputs = model(**enc, output_hidden_states=True)

# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
cls_embeddings = outputs.hidden_states[-1][:, 0, :]
print(cls_embeddings.shape)  # (3, 1024)
```

### Extracting IPC-code embeddings

To obtain the context-dependent embedding of a specific IPC code within a patent, read the hidden-state vector at the position of the corresponding technological token (the tokens after the second `[SEP]`):

```python
with torch.no_grad():
    outputs = model(**enc, output_hidden_states=True)

last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 1024)
# Identify the position of each TT token, then index last_hidden accordingly.
```

---

## Input Format

```
[CLS] <title tokens> [SEP] <abstract tokens> [SEP] <ipc_code_1> <ipc_code_2> ... [SEP]
```

- IPC codes must be **lower-cased** and at **group level** (e.g., `h05k7`, `g06f1`).
- The abstract is the only segment that should be truncated when the total length exceeds 512 tokens; title and IPC codes are always kept in full.
- If IPC codes are not available at inference time, the model can still be used with only title and abstract (omit the third segment); performance on IPC-aware tasks will be reduced.

---

---

## Citation

```bibtex
@article{fenoaltea2026anticipating,
  title   = {Anticipating Innovation Using Large Language Models},
  author  = {Fenoaltea, Enrico Maria and Santoro, Filippo and De Marzo, Giordano
             and Aroyehun, Segun Taofeek and Tacchella, Andrea},
  journal = {arXiv preprint arXiv:2605.04875},
  year    = {2026}
}
```