AndreaTacchella
/

TechTokenBert

 ---
 license: cc-by-sa-4.0
+language:
+  - en
+tags:
+  - bert
+  - patents
+  - ipc
+  - innovation
+  - embeddings
+  - technology-forecasting
+  - masked-language-modeling
+base_model: anferico/bert-for-patents
+pipeline_tag: feature-extraction
+arxiv: 2605.04875
 ---
+# TechTokenBERT
+**TechTokenBERT** is a BERT-based language model fine-tuned on patent text that treats International Patent Classification (IPC) codes as first-class tokens in the model's vocabulary. It is the model introduced in:
+> **Anticipating Innovation Using Large Language Models**
+> Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella
+> arXiv:2605.04875 · May 2026
+> [https://arxiv.org/abs/2605.04875](https://arxiv.org/abs/2605.04875)
+---
+## Model Description
+Predicting technological innovation—understood as the emergence of novel combinations of existing technologies—is a fundamental challenge for science and policy. TechTokenBERT addresses this by learning rich, context-dependent representations of IPC codes directly within the language model's embedding space.
+The key idea is to extend the vocabulary of a pre-trained BERT model (BERT4Patent) with one dedicated token per IPC code (*technological tokens*, TTs). Fine-tuning is performed with masked-language-modelling on patent sequences of the form:
+```
+[CLS] patent title [SEP] patent abstract [SEP] [TT_1] [TT_2] ... [TT_N] [SEP]
+```
+The attention mechanism learns to link each technological token to the natural-language words of the abstract *and* to the other technological tokens in the same patent. This gives each IPC code a distinct, context-dependent embedding for every patent in which it appears, naturally capturing the polysemy of technologies across heterogeneous domains.
+**Context Similarity (CS)** — defined as the average cosine similarity of the top-1% closest embedding pairs between two IPC codes across a corpus — serves as the innovation-forecasting signal. An increase in CS between two codes reliably precedes their first observed co-occurrence in a patent, often by more than a decade.
+---
+## Training Data
+- **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
+- **Fine-tuning split:** Patents published 1980–2005
+- **IPC granularity:** Group level (4-character codes), yielding **7,200 unique codes**
+- Patents missing either abstract or claims are excluded.
+---
+## Evaluation Results
+### Innovation forecasting (AUC-ROC, class imbalance 0.005%)
+| Model | AUC-ROC |
+|---|---|
+| BERT4Patents | 0.725 |
+| BERT4Patents FT (Mirror-BERT) | 0.765 |
+| LLaMA 3.1 8B (LLM2Vec FT) | 0.856 |
+| **TechTokenBERT (IPC embeddings)** | **0.936** |
+| TechTokenBERT (CLS embeddings) | 0.908 |
+### Patent-related downstream tasks (best per model)
+| Model | IPC Macro-F1 ↑ | Citation MAP ↑ | Title–Abstract AUC-ROC ↑ |
+|---|---|---|---|
+| BERT4Patents | 0.354 | 59.46 | 0.920 |
+| PatentSBERTa | 0.356 | 75.95 | 0.985 |
+| Paecter | 0.420 | 68.11 | 0.944 |
+| LLaMA 3.1 8B FT | 0.343 | 56.78 | 0.973 |
+| **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
+TechTokenBERT achieves state-of-the-art performance on all three tasks while being roughly 25× smaller than LLaMA 3.1 8B.
+---
+## Usage
+The model is a `BertForMaskedLM` with an expanded vocabulary. At inference time, the IPC-code embeddings are read from the last hidden layer at the positions of the technological tokens; the `[CLS]` token embedding can also be used as a general-purpose patent representation.
+> **Minimal example:** build a batch where only the abstract is truncated (on the right), while the title, tech tokens, and all special tokens are preserved. Then run the model and extract the `[CLS]` embedding for each example.
+```python
+import torch
+from transformers import BertTokenizer, BertForMaskedLM
+# ---------------------------------------------------------------------------
+# 1. Load model + tokenizer
+# ---------------------------------------------------------------------------
+MODEL_NAME = "AndreaTacchella/TechTokenBert"
+tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
+model = BertForMaskedLM.from_pretrained(MODEL_NAME)
+model.eval()
+# ---------------------------------------------------------------------------
+# 2. Toy data (3 rows)
+# ---------------------------------------------------------------------------
+titles = [
+    "Method for cooling electronic components",
+    "Wireless charging apparatus",
+    "Biodegradable packaging material",
+]
+abstracts = [
+    "A heat sink assembly that dissipates thermal energy from a processor using "
+    "a network of micro-channels through which a coolant is circulated, thereby "
+    "maintaining the junction temperature below a predefined threshold under load.",
+    "An inductive power transfer system comprising a transmitter coil and a "
+    "receiver coil aligned via a magnetic guidance structure to maximize coupling "
+    "efficiency across a variable air gap.",
+    "A composite film derived from plant-based polymers that decomposes under "
+    "industrial composting conditions while providing an oxygen barrier suitable "
+    "for food preservation.",
+]
+# Already preprocessed tech tokens (list of lists of IPC group-level strings)
+tech_tokens_list = [
+    ["h05k7", "g06f1"],
+    ["h02j50", "h01f27"],
+    ["c08l101", "b65d65"],
+]
+# ---------------------------------------------------------------------------
+# 3. Build the padded batch (abstract truncated on the right only)
+# ---------------------------------------------------------------------------
+def build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512):
+    cls_id = tokenizer.cls_token_id
+    sep_id = tokenizer.sep_token_id
+    all_ids = []
+    for title, abstract, tech_tokens in zip(titles, abstracts, tech_tokens_list):
+        title_ids = tokenizer.encode(title, add_special_tokens=False)
+        abstract_ids = tokenizer.encode(abstract, add_special_tokens=False)
+        tech_ids = tokenizer.encode(" ".join(tech_tokens), add_special_tokens=False)
+        # [CLS] title [SEP] abstract [SEP] tech [SEP]  -> 4 special tokens fixed
+        fixed_len = 4 + len(title_ids) + len(tech_ids)
+        abstract_budget = max(max_length - fixed_len, 0)
+        abstract_ids = abstract_ids[:abstract_budget]  # right-side truncation
+        ids = (
+            [cls_id]
+            + title_ids
+            + [sep_id]
+            + abstract_ids
+            + [sep_id]
+            + tech_ids
+            + [sep_id]
+        )
+        all_ids.append(ids)
+    return tokenizer.pad({"input_ids": all_ids}, padding=True, return_tensors="pt")
+enc = build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512)
+# ---------------------------------------------------------------------------
+# 4. Forward pass + extract the [CLS] embedding
+# ---------------------------------------------------------------------------
+with torch.no_grad():
+    outputs = model(**enc, output_hidden_states=True)
+# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
+cls_embeddings = outputs.hidden_states[-1][:, 0, :]
+print(cls_embeddings.shape)  # (3, 768)
+```
+### Extracting IPC-code embeddings (TechToken method)
+To obtain the context-dependent embedding of an IPC code from a specific patent, read the hidden-state vector at the position of the corresponding technological token (positions after the second `[SEP]`):
+```python
+# Assuming enc contains a single patent with tech tokens at known positions
+with torch.no_grad():
+    outputs = model(**enc, output_hidden_states=True)
+last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 768)
+# Identify the position of each TT token, then index last_hidden accordingly.
+```
+---
+## Input Format
+```
+[CLS] <title tokens> [SEP] <abstract tokens> [SEP] <ipc_code_1> <ipc_code_2> ... [SEP]
+```
+- IPC codes must be **lower-cased** and at **group level** (e.g., `h05k7`, `g06f1`).
+- The abstract is the only segment that should be truncated if the total length exceeds 512 tokens; title and IPC codes are always kept in full.
+---
+## Limitations
+- Operates at IPC *group* level (4-character codes); intra-class innovation is invisible to the framework.
+- Analysis is restricted to pairwise code combinations; higher-order assemblies of three or more technologies are not directly modeled.
+- Trained and evaluated on European Patent Office (EPO) data in English; performance on other patent offices or languages has not been assessed.
+---
+## Citation
+```bibtex
+@article{fenoaltea2026anticipating,
+  title   = {Anticipating Innovation Using Large Language Models},
+  author  = {Fenoaltea, Enrico Maria and Santoro, Filippo and De Marzo, Giordano
+             and Aroyehun, Segun Taofeek and Tacchella, Andrea},
+  journal = {arXiv preprint arXiv:2605.04875},
+  year    = {2026}
+}
+```