AndreaTacchella
/

TechTokenBert

@@ -41,16 +41,15 @@ TechTokenBERT extends the vocabulary of BERT4Patent with one dedicated token per
 During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
 - **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
-- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent. Averaging these across a corpus yields rich, polysemy-aware code representations.
 ---
 ## Training Data
 - **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
-- **Fine-tuning split:** Patents published 1980–2005
-- **IPC granularity:** Group level, yielding **7,200 unique codes**
-- Patents missing either abstract or claims are excluded.
 ---
@@ -67,7 +66,7 @@ During fine-tuning the attention mechanism learns to link each technological tok
 | LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
 | **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
-TechTokenBERT achieves the best score in all three tasks, surpassing PatentSBERTa and Paecter — models specifically designed for IPC classification and citation prediction respectively — while being roughly 25× smaller than LLaMA 3.1 8B.
 ---
@@ -160,7 +159,7 @@ with torch.no_grad():
 # last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
 cls_embeddings = outputs.hidden_states[-1][:, 0, :]
-print(cls_embeddings.shape)  # (3, 768)
 ```
 ### Extracting IPC-code embeddings
@@ -171,7 +170,7 @@ To obtain the context-dependent embedding of a specific IPC code within a patent
 with torch.no_grad():
     outputs = model(**enc, output_hidden_states=True)
-last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 768)
 # Identify the position of each TT token, then index last_hidden accordingly.
 ```
@@ -189,12 +188,6 @@ last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 768)
 ---
-## Limitations
-- Trained and evaluated on European Patent Office (EPO) data in English; performance on other patent offices or languages has not been assessed.
-- Operates at IPC group level (4-character codes); finer-grained subgroup distinctions are not represented.
-- The expanded vocabulary (7,200 IPC tokens) is specific to EPO group-level codes; re-use with other classification systems requires re-training.
 ---
 ## Citation

 During fine-tuning the attention mechanism learns to link each technological token both to the words of the patent text and to the other IPC codes in the same patent. The result is a model with two complementary uses:
 - **Patent embeddings:** use the `[CLS]` token as a single vector representation of the full patent (title + abstract + IPC codes). The `[CLS]` embedding encodes information from both the text and the IPC codes it is associated with, outperforming standard sentence-embedding models on patent similarity tasks.
+- **IPC-code embeddings:** read the hidden-state vector at each technological-token position to obtain a context-dependent embedding of that IPC code in the specific patent.
 ---
 ## Training Data
 - **Source:** Full European Patent Bulletin AB (~1.3 M English-language patents, 1980–2024)
+- **Fine-tuning split:** EPO Patents published 1980–2023
+- **IPC granularity:** Group level, yielding **8000 unique codes**
 ---
 | LLaMA 3.1 8B FT (LLM2Vec) | 0.343 | 56.78 | 0.973 |
 | **TechTokenBERT** | **0.488** | **68.96** | **0.994** |
+For details see the full paper.
 ---
 # last hidden state: (batch, seq_len, hidden_dim); position 0 is [CLS]
 cls_embeddings = outputs.hidden_states[-1][:, 0, :]
+print(cls_embeddings.shape)  # (3, 1024)
 ```
 ### Extracting IPC-code embeddings
 with torch.no_grad():
     outputs = model(**enc, output_hidden_states=True)
+last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, 1024)
 # Identify the position of each TT token, then index last_hidden accordingly.
 ```
 ---
 ---
 ## Citation