AndreaTacchella commited on
Commit
9af7485
·
verified ·
1 Parent(s): b916c70

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -13
README.md CHANGED
@@ -41,16 +41,15 @@ TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per
41
  During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
42
 
43
  - **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
44
- - **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent. Averaging these across a corpus yields rich, polysemy-aware code representations.
45
 
46
  ---
47
 
48
  ## Training Data
49
 
50
  - **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
51
- - **Fine-tuning split:** Patents published 1980–2005
52
- - **IPC granularity:** Group level, yielding **7,200 unique codes**
53
- - Patents missing either abstract or claims are excluded.
54
 
55
  ---
56
 
@@ -67,7 +66,7 @@ During fine-tuning the attention mechanism learns to link each technological tok
67
  | LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
68
  | **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
69
 
70
- TechTokenBERT achieves the best score in all three tasks, surpassing PatentSBERTa and Paecter — models specifically designed for IPC classification and citation prediction respectively — while being roughly 25× smaller than LLaMA 3.1 8B.
71
 
72
  ---
73
 
@@ -160,7 +159,7 @@ with torch.no_grad():
160
 
161
  # last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
162
  cls_embeddings = outputs.hidden_states[-1][:, 0, :]
163
- print(cls_embeddings.shape) # (3, 768)
164
  ```
165
 
166
  ### Extracting IPC-code embeddings
@@ -171,7 +170,7 @@ To obtain the context-dependent embedding of a specific IPC code within a patent
171
  with torch.no_grad():
172
  outputs = model(**enc, output_hidden_states=True)
173
 
174
- last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 768)
175
  # Identify the position of each TT token, then index last_hidden accordingly.
176
  ```
177
 
@@ -189,12 +188,6 @@ last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 768)
189
 
190
  ---
191
 
192
- ## Limitations
193
-
194
- - Trained and evaluated on European Patent Office (EPO) data in English; performance on other patent offices or languages has not been assessed.
195
- - Operates at IPC group level (4-character codes); finer-grained subgroup distinctions are not represented.
196
- - The expanded vocabulary (7,200 IPC tokens) is specific to EPO group-level codes; re-use with other classification systems requires re-training.
197
-
198
  ---
199
 
200
  ## Citation
 
41
  During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
42
 
43
  - **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
44
+ - **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.
45
 
46
  ---
47
 
48
  ## Training Data
49
 
50
  - **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
51
+ - **Fine-tuning split:** EPO Patents published 1980–2023
52
+ - **IPC granularity:** Group level, yielding **8000 unique codes**
 
53
 
54
  ---
55
 
 
66
  | LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
67
  | **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
68
 
69
+ For details see the full paper.
70
 
71
  ---
72
 
 
159
 
160
  # last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
161
  cls_embeddings = outputs.hidden_states[-1][:, 0, :]
162
+ print(cls_embeddings.shape) # (3, 1024)
163
  ```
164
 
165
  ### Extracting IPC-code embeddings
 
170
  with torch.no_grad():
171
  outputs = model(**enc, output_hidden_states=True)
172
 
173
+ last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 1024)
174
  # Identify the position of each TT token, then index last_hidden accordingly.
175
  ```
176
 
 
188
 
189
  ---
190
 
 
 
 
 
 
 
191
  ---
192
 
193
  ## Citation