|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- anferico/bert-for-patents |
|
|
tags: |
|
|
- patent |
|
|
- embeddings |
|
|
- contrastive-learning |
|
|
- information-retrieval |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# PatentMap-V0-SecPair-Drawing |
|
|
|
|
|
**PatentMap-V0-SecPair-Drawing** is a patent embedding model trained on abstract + drawing sections with section-pair augmentation. It is part of the PatentMap V0 model collection. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** [anferico/bert-for-patents](https://huggingface.co/anferico/bert-for-patents) |
|
|
- **Training Objective:** Contrastive learning (InfoNCE loss) |
|
|
- **Architecture:** BERT-large (340M parameters) |
|
|
- **Embedding Dimension:** 1024 |
|
|
- **Max Sequence Length:** 512 tokens |
|
|
- **Vocabulary Size:** 39860 |
|
|
- **Training Data:** USPTO patent applications (2010-2018) from [HUPD corpus](https://huggingface.co/datasets/HUPD/hupd) |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Patent Sections Used:** abstract + drawing |
|
|
- **Data Augmentation:** dropout + section_pair |
|
|
- **Batch Size:** 512 |
|
|
- **Learning Rate:** 1e-5 |
|
|
|
|
|
### Special Tokens |
|
|
|
|
|
This model includes additional patent-specific special tokens: |
|
|
- `[drawing]` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Input Format |
|
|
|
|
|
This model expects patent text formatted with special tokens: |
|
|
|
|
|
- **For abstract**: `Title [SEP] [abstract] Abstract text` |
|
|
- **For other sections**: `[section] Section text` (no title prefix) |
|
|
|
|
|
Example: |
|
|
```python |
|
|
# Abstract with title |
|
|
text = "Smart thermostat system [SEP] [abstract] A thermostat system comprising..." |
|
|
|
|
|
# Claim without title |
|
|
text = "[claim] A method comprising: step 1, step 2..." |
|
|
``` |
|
|
|
|
|
### Code Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "ZoeYou/PatentMap-V0-SecPair-Drawing" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
|
|
# Format patent text |
|
|
title = "Smart thermostat system" |
|
|
abstract = "A thermostat system comprising a temperature sensor..." |
|
|
patent_text = f"{title} [SEP] [abstract] {abstract}" |
|
|
|
|
|
# Encode and get embeddings |
|
|
inputs = tokenizer(patent_text, return_tensors="pt", padding=True, truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state[:, 0, :] # CLS token |
|
|
|
|
|
print(embeddings.shape) # torch.Size([1, 1024]) |
|
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
This model has been evaluated on multiple patent-specific tasks: |
|
|
|
|
|
- **IPC Classification** (linear probe and KNN) |
|
|
- **Prior Art Search** (recall@k, nDCG@k) |
|
|
- **Embedding Quality Metrics** (uniformity, alignment, topology) |
|
|
|
|
|
For detailed evaluation results, see the [PatentMap paper](https://arxiv.org/abs/2511.10657). |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Patent document retrieval |
|
|
- Patent similarity search |
|
|
- Prior art discovery |
|
|
- IPC classification |
|
|
- Patent landscape analysis |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{zuo2025patent, |
|
|
title={Patent Representation Learning via Self-supervision}, |
|
|
author={Zuo, You and Gerdes, Kim and de La Clergerie, Eric Villemonte and Sagot, Beno{\^i}t}, |
|
|
journal={arXiv preprint arXiv:2511.10657}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Collection |
|
|
|
|
|
This model is part of the PatentMap V0 collection. For an overview of all models, see [PatentMap-V0](https://huggingface.co/ZoeYou/patentmapv0-models). |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under CC BY-NC 4.0 license (non-commercial use only). |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/ZoeYou/patentmapv0) or contact the authors. |
|
|
|