AndreaTacchella commited on
Commit
77432d7
·
verified ·
1 Parent(s): 5121458

Add model card

Browse files
Files changed (1) hide show
  1. README.md +215 -0
README.md CHANGED
@@ -1,3 +1,218 @@
1
  ---
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - bert
7
+ - patents
8
+ - ipc
9
+ - innovation
10
+ - embeddings
11
+ - technology-forecasting
12
+ - masked-language-modeling
13
+ base_model: anferico/bert-for-patents
14
+ pipeline_tag: feature-extraction
15
+ arxiv: 2605.04875
16
  ---
17
+
18
+ # TechTokenBERT
19
+
20
+ **TechTokenBERT** is a BERT-based language model fine-tuned on patent text that treats International Patent Classification (IPC) codes as first-class tokens in the model's vocabulary. It is the model introduced in:
21
+
22
+ > **Anticipating Innovation Using Large Language Models**
23
+ > Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella
24
+ > arXiv:2605.04875 · May 2026
25
+ > [https://arxiv.org/abs/2605.04875](https://arxiv.org/abs/2605.04875)
26
+
27
+ ---
28
+
29
+ ## Model Description
30
+
31
+ Predicting technological innovation—understood as the emergence of novel combinations of existing technologies—is a fundamental challenge for science and policy. TechTokenBERT addresses this by learning rich, context-dependent representations of IPC codes directly within the language model's embedding space.
32
+
33
+ The key idea is to extend the vocabulary of a pre-trained BERT model (BERT4Patent) with one dedicated token per IPC code (*technological tokens*, TTs). Fine-tuning is performed with masked-language-modelling on patent sequences of the form:
34
+
35
+ ```
36
+ [CLS] patent title [SEP] patent abstract [SEP] [TT_1] [TT_2] ... [TT_N] [SEP]
37
+ ```
38
+
39
+ The attention mechanism learns to link each technological token to the natural-language words of the abstract *and* to the other technological tokens in the same patent. This gives each IPC code a distinct, context-dependent embedding for every patent in which it appears, naturally capturing the polysemy of technologies across heterogeneous domains.
40
+
41
+ **Context Similarity (CS)** — defined as the average cosine similarity of the top-1% closest embedding pairs between two IPC codes across a corpus — serves as the innovation-forecasting signal. An increase in CS between two codes reliably precedes their first observed co-occurrence in a patent, often by more than a decade.
42
+
43
+ ---
44
+
45
+ ## Training Data
46
+
47
+ - **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
48
+ - **Fine-tuning split:** Patents published 1980–2005
49
+ - **IPC granularity:** Group level (4-character codes), yielding **7,200 unique codes**
50
+ - Patents missing either abstract or claims are excluded.
51
+
52
+ ---
53
+
54
+ ## Evaluation Results
55
+
56
+ ### Innovation forecasting (AUC-ROC, class imbalance 0.005%)
57
+
58
+ | Model | AUC-ROC |
59
+ |---|---|
60
+ | BERT4Patents | 0.725 |
61
+ | BERT4Patents FT (Mirror-BERT) | 0.765 |
62
+ | LLaMA 3.1 8B (LLM2Vec FT) | 0.856 |
63
+ | **TechTokenBERT (IPC embeddings)** | **0.936** |
64
+ | TechTokenBERT (CLS embeddings) | 0.908 |
65
+
66
+ ### Patent-related downstream tasks (best per model)
67
+
68
+ | Model | IPC Macro-F1 ↑ | Citation MAP ↑ | Title–Abstract AUC-ROC ↑ |
69
+ |---|---|---|---|
70
+ | BERT4Patents | 0.354 | 59.46 | 0.920 |
71
+ | PatentSBERTa | 0.356 | 75.95 | 0.985 |
72
+ | Paecter | 0.420 | 68.11 | 0.944 |
73
+ | LLaMA 3.1 8B FT | 0.343 | 56.78 | 0.973 |
74
+ | **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
75
+
76
+ TechTokenBERT achieves state-of-the-art performance on all three tasks while being roughly 25× smaller than LLaMA 3.1 8B.
77
+
78
+ ---
79
+
80
+ ## Usage
81
+
82
+ The model is a `BertForMaskedLM` with an expanded vocabulary. At inference time, the IPC-code embeddings are read from the last hidden layer at the positions of the technological tokens; the `[CLS]` token embedding can also be used as a general-purpose patent representation.
83
+
84
+ > **Minimal example:** build a batch where only the abstract is truncated (on the right), while the title, tech tokens, and all special tokens are preserved. Then run the model and extract the `[CLS]` embedding for each example.
85
+
86
+ ```python
87
+ import torch
88
+ from transformers import BertTokenizer, BertForMaskedLM
89
+
90
+ # ---------------------------------------------------------------------------
91
+ # 1. Load model + tokenizer
92
+ # ---------------------------------------------------------------------------
93
+ MODEL_NAME = "AndreaTacchella/TechTokenBert"
94
+ tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
95
+ model = BertForMaskedLM.from_pretrained(MODEL_NAME)
96
+ model.eval()
97
+
98
+ # ---------------------------------------------------------------------------
99
+ # 2. Toy data (3 rows)
100
+ # ---------------------------------------------------------------------------
101
+ titles = [
102
+ "Method for cooling electronic components",
103
+ "Wireless charging apparatus",
104
+ "Biodegradable packaging material",
105
+ ]
106
+
107
+ abstracts = [
108
+ "A heat sink assembly that dissipates thermal energy from a processor using "
109
+ "a network of micro-channels through which a coolant is circulated, thereby "
110
+ "maintaining the junction temperature below a predefined threshold under load.",
111
+
112
+ "An inductive power transfer system comprising a transmitter coil and a "
113
+ "receiver coil aligned via a magnetic guidance structure to maximize coupling "
114
+ "efficiency across a variable air gap.",
115
+
116
+ "A composite film derived from plant-based polymers that decomposes under "
117
+ "industrial composting conditions while providing an oxygen barrier suitable "
118
+ "for food preservation.",
119
+ ]
120
+
121
+ # Already preprocessed tech tokens (list of lists of IPC group-level strings)
122
+ tech_tokens_list = [
123
+ ["h05k7", "g06f1"],
124
+ ["h02j50", "h01f27"],
125
+ ["c08l101", "b65d65"],
126
+ ]
127
+
128
+
129
+ # ---------------------------------------------------------------------------
130
+ # 3. Build the padded batch (abstract truncated on the right only)
131
+ # ---------------------------------------------------------------------------
132
+ def build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512):
133
+ cls_id = tokenizer.cls_token_id
134
+ sep_id = tokenizer.sep_token_id
135
+
136
+ all_ids = []
137
+ for title, abstract, tech_tokens in zip(titles, abstracts, tech_tokens_list):
138
+ title_ids = tokenizer.encode(title, add_special_tokens=False)
139
+ abstract_ids = tokenizer.encode(abstract, add_special_tokens=False)
140
+ tech_ids = tokenizer.encode(" ".join(tech_tokens), add_special_tokens=False)
141
+
142
+ # [CLS] title [SEP] abstract [SEP] tech [SEP] -> 4 special tokens fixed
143
+ fixed_len = 4 + len(title_ids) + len(tech_ids)
144
+ abstract_budget = max(max_length - fixed_len, 0)
145
+ abstract_ids = abstract_ids[:abstract_budget] # right-side truncation
146
+
147
+ ids = (
148
+ [cls_id]
149
+ + title_ids
150
+ + [sep_id]
151
+ + abstract_ids
152
+ + [sep_id]
153
+ + tech_ids
154
+ + [sep_id]
155
+ )
156
+ all_ids.append(ids)
157
+
158
+ return tokenizer.pad({"input_ids": all_ids}, padding=True, return_tensors="pt")
159
+
160
+
161
+ enc = build_batch(titles, abstracts, tech_tokens_list, tokenizer, max_length=512)
162
+
163
+ # ---------------------------------------------------------------------------
164
+ # 4. Forward pass + extract the [CLS] embedding
165
+ # ---------------------------------------------------------------------------
166
+ with torch.no_grad():
167
+ outputs = model(**enc, output_hidden_states=True)
168
+
169
+ # last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
170
+ cls_embeddings = outputs.hidden_states[-1][:, 0, :]
171
+ print(cls_embeddings.shape) # (3, 768)
172
+ ```
173
+
174
+ ### Extracting IPC-code embeddings (TechToken method)
175
+
176
+ To obtain the context-dependent embedding of an IPC code from a specific patent, read the hidden-state vector at the position of the corresponding technological token (positions after the second `[SEP]`):
177
+
178
+ ```python
179
+ # Assuming enc contains a single patent with tech tokens at known positions
180
+ with torch.no_grad():
181
+ outputs = model(**enc, output_hidden_states=True)
182
+
183
+ last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 768)
184
+ # Identify the position of each TT token, then index last_hidden accordingly.
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Input Format
190
+
191
+ ```
192
+ [CLS] <title tokens> [SEP] <abstract tokens> [SEP] <ipc_code_1> <ipc_code_2> ... [SEP]
193
+ ```
194
+
195
+ - IPC codes must be **lower-cased** and at **group level** (e.g., `h05k7`, `g06f1`).
196
+ - The abstract is the only segment that should be truncated if the total length exceeds 512 tokens; title and IPC codes are always kept in full.
197
+
198
+ ---
199
+
200
+ ## Limitations
201
+
202
+ - Operates at IPC *group* level (4-character codes); intra-class innovation is invisible to the framework.
203
+ - Analysis is restricted to pairwise code combinations; higher-order assemblies of three or more technologies are not directly modeled.
204
+ - Trained and evaluated on European Patent Office (EPO) data in English; performance on other patent offices or languages has not been assessed.
205
+
206
+ ---
207
+
208
+ ## Citation
209
+
210
+ ```bibtex
211
+ @article{fenoaltea2026anticipating,
212
+ title = {Anticipating Innovation Using Large Language Models},
213
+ author = {Fenoaltea, Enrico Maria and Santoro, Filippo and De Marzo, Giordano
214
+ and Aroyehun, Segun Taofeek and Tacchella, Andrea},
215
+ journal = {arXiv preprint arXiv:2605.04875},
216
+ year = {2026}
217
+ }
218
+ ```