File size: 7,921 Bytes
332bfca
 
77432d7
5b9821a
77432d7
5b9821a
 
 
 
 
 
 
 
 
77432d7
 
332bfca
77432d7
 
 
5620ec3
fc75de6
 
 
 
77432d7
 
 
 
 
 
 
 
 
 
b916c70
77432d7
 
 
 
 
fc75de6
77432d7
fc75de6
9af7485
77432d7
 
 
 
 
fc75de6
9af7485
 
77432d7
 
 
 
 
fc75de6
77432d7
 
 
 
fc75de6
77432d7
 
fc75de6
77432d7
 
9af7485
77432d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9af7485
77432d7
 
fc75de6
77432d7
fc75de6
77432d7
 
 
 
 
9af7485
77432d7
 
 
 
 
 
 
 
 
 
 
fc75de6
 
 
77432d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b9821a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: cc-by-sa-4.0
language:
- en
tags:
- bert
- patents
- ipc
- embeddings
- semantic-similarity
- masked-language-modeling
- patent-classification
base_model:
- saroyehun/bertforpatent-mirror-meanpooling
pipeline_tag: feature-extraction
arxiv: 2605.04875
---

# TechTokenBERT

**TechTokenBERT** is a state-of-the-art patent embedding model based on BERT. It outperforms larger and task-specific models on IPC code classification, citation prediction, and title–abstract matching.

The core innovation is treating International Patent Classification (IPC) codes as dedicated tokens in the model's vocabulary (*technological tokens*). This allows the attention mechanism to operate directly between patent text and classification codes during fine-tuning, producing embeddings that are simultaneously aware of linguistic content and technological structure.

Introduced in:

> **Anticipating Innovation Using Large Language Models**  
> Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella  
> arXiv:2605.04875 · May 2026  
> [https://arxiv.org/abs/2605.04875](https://arxiv.org/abs/2605.04875)

---

## Model Description

TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per IPC code at group level (~8000 codes total). Fine-tuning uses masked-language-modelling on sequences of the form:

```
[CLS] patent title [SEP] patent abstract [SEP] [TT_1] [TT_2] ... [TT_N] [SEP]
```

During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:

- **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.

---

## Training Data

- **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
- **Fine-tuning split:** EPO Patents published 1980–2023
- **IPC granularity:** Group level, yielding **8000 unique codes**

---

## Evaluation Results

### Patent-related downstream tasks

| Model | IPC Macro-F1 ↑ | Citation MAP ↑ | Title–Abstract AUC-ROC ↑ |
|---|---|---|---|
| BERT4Patents | 0.354 | 59.46 | 0.920 |
| BERT4Patents FT (Mirror-BERT) | 0.262 | 52.78 | 0.832 |
| PatentSBERTa | 0.356 | 75.95 | 0.985 |
| Paecter | 0.420 | 68.11 | 0.944 |
| LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
| **TechTokenBERT** | **0.488** | **68.96** | **0.994** |

For details see the full paper.

---

## Usage

> **Minimal example:** build a batch where only the abstract is truncated (on the right), while the title, tech tokens, and all special tokens are preserved. Then run the model and extract the `[CLS]` embedding for each example.

```python
import torch
from transformers import BertTokenizer, BertForMaskedLM

# ---------------------------------------------------------------------------
# 1. Load model + tokenizer
# ---------------------------------------------------------------------------
MODEL_NAME = "AndreaTacchella/TechTokenBert"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForMaskedLM.from_pretrained(MODEL_NAME)
model.eval()

# ---------------------------------------------------------------------------
# 2. Toy data (3 rows)
# ---------------------------------------------------------------------------
titles = [
    "Method for cooling electronic components",
    "Wireless charging apparatus",
    "Biodegradable packaging material",
]

abstracts = [
    "A heat sink assembly that dissipates thermal energy from a processor using "
    "a network of micro-channels through which a coolant is circulated, thereby "
    "maintaining the junction temperature below a predefined threshold under load.",

    "An inductive power transfer system comprising a transmitter coil and a "
    "receiver coil aligned via a magnetic guidance structure to maximize coupling "
    "efficiency across a variable air gap.",

    "A composite film derived from plant-based polymers that decomposes under "
    "industrial composting conditions while providing an oxygen barrier suitable "
    "for food preservation.",
]

# Already preprocessed tech tokens (list of lists of IPC group-level strings)
tech_tokens_list = [
    ["h05k7", "g06f1"],
    ["h02j50", "h01f27"],
    ["c08l101", "b65d65"],
]


# ---------------------------------------------------------------------------
# 3. Build the padded batch (abstract truncated on the right only)
# ---------------------------------------------------------------------------
def build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512):
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id

    all_ids = []
    for title, abstract, tech_tokens in zip(titles, abstracts, tech_tokens_list):
        title_ids = tokenizer.encode(title, add_special_tokens=False)
        abstract_ids = tokenizer.encode(abstract, add_special_tokens=False)
        tech_ids = tokenizer.encode(" ".join(tech_tokens), add_special_tokens=False)

        # [CLS] title [SEP] abstract [SEP] tech [SEP]  -> 4 special tokens fixed
        fixed_len = 4 + len(title_ids) + len(tech_ids)
        abstract_budget = max(max_length - fixed_len, 0)
        abstract_ids = abstract_ids[:abstract_budget]  # right-side truncation

        ids = (
            [cls_id]
            + title_ids
            + [sep_id]
            + abstract_ids
            + [sep_id]
            + tech_ids
            + [sep_id]
        )
        all_ids.append(ids)

    return tokenizer.pad({"input_ids": all_ids}, padding=True, return_tensors="pt")


enc = build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512)

# ---------------------------------------------------------------------------
# 4. Forward pass + extract the [CLS] embedding
# ---------------------------------------------------------------------------
with torch.no_grad():
    outputs = model(**enc, output_hidden_states=True)

# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
cls_embeddings = outputs.hidden_states[-1][:, 0, :]
print(cls_embeddings.shape)  # (3, 1024)
```

### Extracting IPC-code embeddings

To obtain the context-dependent embedding of a specific IPC code within a patent, read the hidden-state vector at the position of the corresponding technological token (the tokens after the second `[SEP]`):

```python
with torch.no_grad():
    outputs = model(**enc, output_hidden_states=True)

last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 1024)
# Identify the position of each TT token, then index last_hidden accordingly.
```

---

## Input Format

```
[CLS] <title tokens> [SEP] <abstract tokens> [SEP] <ipc_code_1> <ipc_code_2> ... [SEP]
```

- IPC codes must be **lower-cased** and at **group level** (e.g., `h05k7`, `g06f1`).
- The abstract is the only segment that should be truncated when the total length exceeds 512 tokens; title and IPC codes are always kept in full.
- If IPC codes are not available at inference time, the model can still be used with only title and abstract (omit the third segment); performance on IPC-aware tasks will be reduced.

---

---

## Citation

```bibtex
@article{fenoaltea2026anticipating,
  title   = {Anticipating Innovation Using Large Language Models},
  author  = {Fenoaltea, Enrico Maria and Santoro, Filippo and De Marzo, Giordano
             and Aroyehun, Segun Taofeek and Tacchella, Andrea},
  journal = {arXiv preprint arXiv:2605.04875},
  year    = {2026}
}
```