File size: 7,921 Bytes
332bfca 77432d7 5b9821a 77432d7 5b9821a 77432d7 332bfca 77432d7 5620ec3 fc75de6 77432d7 b916c70 77432d7 fc75de6 77432d7 fc75de6 9af7485 77432d7 fc75de6 9af7485 77432d7 fc75de6 77432d7 fc75de6 77432d7 fc75de6 77432d7 9af7485 77432d7 9af7485 77432d7 fc75de6 77432d7 fc75de6 77432d7 9af7485 77432d7 fc75de6 77432d7 5b9821a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: cc-by-sa-4.0
language:
- en
tags:
- bert
- patents
- ipc
- embeddings
- semantic-similarity
- masked-language-modeling
- patent-classification
base_model:
- saroyehun/bertforpatent-mirror-meanpooling
pipeline_tag: feature-extraction
arxiv: 2605.04875
---
# TechTokenBERT
**TechTokenBERT** is a state-of-the-art patent embedding model based on BERT. It outperforms larger and task-specific models on IPC code classification, citation prediction, and title–abstract matching.
The core innovation is treating International Patent Classification (IPC) codes as dedicated tokens in the model's vocabulary (*technological tokens*). This allows the attention mechanism to operate directly between patent text and classification codes during fine-tuning, producing embeddings that are simultaneously aware of linguistic content and technological structure.
Introduced in:
> **Anticipating Innovation Using Large Language Models**
> Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella
> arXiv:2605.04875 · May 2026
> [https://arxiv.org/abs/2605.04875](https://arxiv.org/abs/2605.04875)
---
## Model Description
TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per IPC code at group level (~8000 codes total). Fine-tuning uses masked-language-modelling on sequences of the form:
```
[CLS] patent title [SEP] patent abstract [SEP] [TT_1] [TT_2] ... [TT_N] [SEP]
```
During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
- **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.
---
## Training Data
- **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
- **Fine-tuning split:** EPO Patents published 1980–2023
- **IPC granularity:** Group level, yielding **8000 unique codes**
---
## Evaluation Results
### Patent-related downstream tasks
| Model | IPC Macro-F1 ↑ | Citation MAP ↑ | Title–Abstract AUC-ROC ↑ |
|---|---|---|---|
| BERT4Patents | 0.354 | 59.46 | 0.920 |
| BERT4Patents FT (Mirror-BERT) | 0.262 | 52.78 | 0.832 |
| PatentSBERTa | 0.356 | 75.95 | 0.985 |
| Paecter | 0.420 | 68.11 | 0.944 |
| LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
| **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
For details see the full paper.
---
## Usage
> **Minimal example:** build a batch where only the abstract is truncated (on the right), while the title, tech tokens, and all special tokens are preserved. Then run the model and extract the `[CLS]` embedding for each example.
```python
import torch
from transformers import BertTokenizer, BertForMaskedLM
# ---------------------------------------------------------------------------
# 1. Load model + tokenizer
# ---------------------------------------------------------------------------
MODEL_NAME = "AndreaTacchella/TechTokenBert"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForMaskedLM.from_pretrained(MODEL_NAME)
model.eval()
# ---------------------------------------------------------------------------
# 2. Toy data (3 rows)
# ---------------------------------------------------------------------------
titles = [
"Method for cooling electronic components",
"Wireless charging apparatus",
"Biodegradable packaging material",
]
abstracts = [
"A heat sink assembly that dissipates thermal energy from a processor using "
"a network of micro-channels through which a coolant is circulated, thereby "
"maintaining the junction temperature below a predefined threshold under load.",
"An inductive power transfer system comprising a transmitter coil and a "
"receiver coil aligned via a magnetic guidance structure to maximize coupling "
"efficiency across a variable air gap.",
"A composite film derived from plant-based polymers that decomposes under "
"industrial composting conditions while providing an oxygen barrier suitable "
"for food preservation.",
]
# Already preprocessed tech tokens (list of lists of IPC group-level strings)
tech_tokens_list = [
["h05k7", "g06f1"],
["h02j50", "h01f27"],
["c08l101", "b65d65"],
]
# ---------------------------------------------------------------------------
# 3. Build the padded batch (abstract truncated on the right only)
# ---------------------------------------------------------------------------
def build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512):
cls_id = tokenizer.cls_token_id
sep_id = tokenizer.sep_token_id
all_ids = []
for title, abstract, tech_tokens in zip(titles, abstracts, tech_tokens_list):
title_ids = tokenizer.encode(title, add_special_tokens=False)
abstract_ids = tokenizer.encode(abstract, add_special_tokens=False)
tech_ids = tokenizer.encode(" ".join(tech_tokens), add_special_tokens=False)
# [CLS] title [SEP] abstract [SEP] tech [SEP] -> 4 special tokens fixed
fixed_len = 4 + len(title_ids) + len(tech_ids)
abstract_budget = max(max_length - fixed_len, 0)
abstract_ids = abstract_ids[:abstract_budget] # right-side truncation
ids = (
[cls_id]
+ title_ids
+ [sep_id]
+ abstract_ids
+ [sep_id]
+ tech_ids
+ [sep_id]
)
all_ids.append(ids)
return tokenizer.pad({"input_ids": all_ids}, padding=True, return_tensors="pt")
enc = build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512)
# ---------------------------------------------------------------------------
# 4. Forward pass + extract the [CLS] embedding
# ---------------------------------------------------------------------------
with torch.no_grad():
outputs = model(**enc, output_hidden_states=True)
# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
cls_embeddings = outputs.hidden_states[-1][:, 0, :]
print(cls_embeddings.shape) # (3, 1024)
```
### Extracting IPC-code embeddings
To obtain the context-dependent embedding of a specific IPC code within a patent, read the hidden-state vector at the position of the corresponding technological token (the tokens after the second `[SEP]`):
```python
with torch.no_grad():
outputs = model(**enc, output_hidden_states=True)
last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 1024)
# Identify the position of each TT token, then index last_hidden accordingly.
```
---
## Input Format
```
[CLS] <title tokens> [SEP] <abstract tokens> [SEP] <ipc_code_1> <ipc_code_2> ... [SEP]
```
- IPC codes must be **lower-cased** and at **group level** (e.g., `h05k7`, `g06f1`).
- The abstract is the only segment that should be truncated when the total length exceeds 512 tokens; title and IPC codes are always kept in full.
- If IPC codes are not available at inference time, the model can still be used with only title and abstract (omit the third segment); performance on IPC-aware tasks will be reduced.
---
---
## Citation
```bibtex
@article{fenoaltea2026anticipating,
title = {Anticipating Innovation Using Large Language Models},
author = {Fenoaltea, Enrico Maria and Santoro, Filippo and De Marzo, Giordano
and Aroyehun, Segun Taofeek and Tacchella, Andrea},
journal = {arXiv preprint arXiv:2605.04875},
year = {2026}
}
``` |