Update README.md
Browse files
README.md
CHANGED
|
@@ -41,16 +41,15 @@ TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per
|
|
| 41 |
During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
|
| 42 |
|
| 43 |
- **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
|
| 44 |
-
- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
## Training Data
|
| 49 |
|
| 50 |
- **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
|
| 51 |
-
- **Fine-tuning split:** Patents published 1980–
|
| 52 |
-
- **IPC granularity:** Group level, yielding **
|
| 53 |
-
- Patents missing either abstract or claims are excluded.
|
| 54 |
|
| 55 |
---
|
| 56 |
|
|
@@ -67,7 +66,7 @@ During fine-tuning the attention mechanism learns to link each technological tok
|
|
| 67 |
| LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
|
| 68 |
| **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
---
|
| 73 |
|
|
@@ -160,7 +159,7 @@ with torch.no_grad():
|
|
| 160 |
|
| 161 |
# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
|
| 162 |
cls_embeddings = outputs.hidden_states[-1][:, 0, :]
|
| 163 |
-
print(cls_embeddings.shape) # (3,
|
| 164 |
```
|
| 165 |
|
| 166 |
### Extracting IPC-code embeddings
|
|
@@ -171,7 +170,7 @@ To obtain the context-dependent embedding of a specific IPC code within a patent
|
|
| 171 |
with torch.no_grad():
|
| 172 |
outputs = model(**enc, output_hidden_states=True)
|
| 173 |
|
| 174 |
-
last_hidden = outputs.hidden_states[-1] # (batch, seq_len,
|
| 175 |
# Identify the position of each TT token, then index last_hidden accordingly.
|
| 176 |
```
|
| 177 |
|
|
@@ -189,12 +188,6 @@ last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 768)
|
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
-
## Limitations
|
| 193 |
-
|
| 194 |
-
- Trained and evaluated on European Patent Office (EPO) data in English; performance on other patent offices or languages has not been assessed.
|
| 195 |
-
- Operates at IPC group level (4-character codes); finer-grained subgroup distinctions are not represented.
|
| 196 |
-
- The expanded vocabulary (7,200 IPC tokens) is specific to EPO group-level codes; re-use with other classification systems requires re-training.
|
| 197 |
-
|
| 198 |
---
|
| 199 |
|
| 200 |
## Citation
|
|
|
|
| 41 |
During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
|
| 42 |
|
| 43 |
- **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
|
| 44 |
+
- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
## Training Data
|
| 49 |
|
| 50 |
- **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
|
| 51 |
+
- **Fine-tuning split:** EPO Patents published 1980–2023
|
| 52 |
+
- **IPC granularity:** Group level, yielding **8000 unique codes**
|
|
|
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
|
|
| 66 |
| LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
|
| 67 |
| **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
|
| 68 |
|
| 69 |
+
For details see the full paper.
|
| 70 |
|
| 71 |
---
|
| 72 |
|
|
|
|
| 159 |
|
| 160 |
# last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
|
| 161 |
cls_embeddings = outputs.hidden_states[-1][:, 0, :]
|
| 162 |
+
print(cls_embeddings.shape) # (3, 1024)
|
| 163 |
```
|
| 164 |
|
| 165 |
### Extracting IPC-code embeddings
|
|
|
|
| 170 |
with torch.no_grad():
|
| 171 |
outputs = model(**enc, output_hidden_states=True)
|
| 172 |
|
| 173 |
+
last_hidden = outputs.hidden_states[-1] # (batch, seq_len, 1024)
|
| 174 |
# Identify the position of each TT token, then index last_hidden accordingly.
|
| 175 |
```
|
| 176 |
|
|
|
|
| 188 |
|
| 189 |
---
|
| 190 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
---
|
| 192 |
|
| 193 |
## Citation
|