File size: 3,588 Bytes
d7be18c 5b7c56e d7be18c 5b7c56e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
license: cc-by-nc-4.0
language:
- en
base_model:
- anferico/bert-for-patents
tags:
- patent
- embeddings
- contrastive-learning
- information-retrieval
pipeline_tag: feature-extraction
---
# PatentMap-V0-SecPair-BackgroundDrawing
**PatentMap-V0-SecPair-BackgroundDrawing** is a patent embedding model trained on abstract + background + drawing sections with section-pair augmentation. It is part of the PatentMap V0 model collection.
## Model Details
- **Base Model:** [anferico/bert-for-patents](https://huggingface.co/anferico/bert-for-patents)
- **Training Objective:** Contrastive learning (InfoNCE loss)
- **Architecture:** BERT-large (340M parameters)
- **Embedding Dimension:** 1024
- **Max Sequence Length:** 512 tokens
- **Vocabulary Size:** 39860
- **Training Data:** USPTO patent applications (2010-2018) from [HUPD corpus](https://huggingface.co/datasets/HUPD/hupd)
### Training Configuration
- **Patent Sections Used:** abstract + background + drawing
- **Data Augmentation:** dropout + section_pair
- **Batch Size:** 512
- **Learning Rate:** 1e-5
### Special Tokens
This model includes additional patent-specific special tokens:
- `[drawing]`
## Usage
### Input Format
This model expects patent text formatted with special tokens:
- **For abstract**: `Title [SEP] [abstract] Abstract text`
- **For other sections**: `[section] Section text` (no title prefix)
Example:
```python
# Abstract with title
text = "Smart thermostat system [SEP] [abstract] A thermostat system comprising..."
# Claim without title
text = "[claim] A method comprising: step 1, step 2..."
```
### Code Example
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Load model and tokenizer
model_name = "ZoeYou/PatentMap-V0-SecPair-BackgroundDrawing"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Format patent text
title = "Smart thermostat system"
abstract = "A thermostat system comprising a temperature sensor..."
patent_text = f"{title} [SEP] [abstract] {abstract}"
# Encode and get embeddings
inputs = tokenizer(patent_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :] # CLS token
print(embeddings.shape) # torch.Size([1, 1024])
```
## Evaluation
This model has been evaluated on multiple patent-specific tasks:
- **IPC Classification** (linear probe and KNN)
- **Prior Art Search** (recall@k, nDCG@k)
- **Embedding Quality Metrics** (uniformity, alignment, topology)
For detailed evaluation results, see the [PatentMap paper](https://arxiv.org/abs/2511.10657).
## Intended Use
This model is designed for:
- Patent document retrieval
- Patent similarity search
- Prior art discovery
- IPC classification
- Patent landscape analysis
## Citation
If you use this model, please cite:
```bibtex
@article{zuo2025patent,
title={Patent Representation Learning via Self-supervision},
author={Zuo, You and Gerdes, Kim and de La Clergerie, Eric Villemonte and Sagot, Beno{\^i}t},
journal={arXiv preprint arXiv:2511.10657},
year={2025}
}
```
## Model Collection
This model is part of the PatentMap V0 collection. For an overview of all models, see [PatentMap-V0](https://huggingface.co/ZoeYou/patentmapv0-models).
## License
This model is released under CC BY-NC 4.0 license (non-commercial use only).
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/ZoeYou/patentmapv0) or contact the authors.
|