ZoeYou
/

PatentMap-V0-SecPair-Drawing

Feature Extraction

contrastive-learning

information-retrieval

Model card Files Files and versions

PatentMap-V0-SecPair-Drawing / README.md

ZoeYou's picture

Update README.md

818e451 verified 27 days ago

|

history blame contribute delete

3.53 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	base_model:
	- anferico/bert-for-patents
	tags:
	- patent
	- embeddings
	- contrastive-learning
	- information-retrieval
	pipeline_tag: feature-extraction
	---

	# PatentMap-V0-SecPair-Drawing

	PatentMap-V0-SecPair-Drawing is a patent embedding model trained on abstract + drawing sections with section-pair augmentation. It is part of the PatentMap V0 model collection.

	## Model Details

	- Base Model: [anferico/bert-for-patents](https://huggingface.co/anferico/bert-for-patents)
	- Training Objective: Contrastive learning (InfoNCE loss)
	- Architecture: BERT-large (340M parameters)
	- Embedding Dimension: 1024
	- Max Sequence Length: 512 tokens
	- Vocabulary Size: 39860
	- Training Data: USPTO patent applications (2010-2018) from [HUPD corpus](https://huggingface.co/datasets/HUPD/hupd)

	### Training Configuration

	- Patent Sections Used: abstract + drawing
	- Data Augmentation: dropout + section_pair
	- Batch Size: 512
	- Learning Rate: 1e-5

	### Special Tokens

	This model includes additional patent-specific special tokens:
	- `[drawing]`

	## Usage

	### Input Format

	This model expects patent text formatted with special tokens:

	- For abstract: `Title [SEP] [abstract] Abstract text`
	- For other sections: `[section] Section text` (no title prefix)

	Example:
	```python
	# Abstract with title
	text = "Smart thermostat system [SEP] [abstract] A thermostat system comprising..."

	# Claim without title
	text = "[claim] A method comprising: step 1, step 2..."
	```

	### Code Example

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Load model and tokenizer
	model_name = "ZoeYou/PatentMap-V0-SecPair-Drawing"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	# Format patent text
	title = "Smart thermostat system"
	abstract = "A thermostat system comprising a temperature sensor..."
	patent_text = f"{title} [SEP] [abstract] {abstract}"

	# Encode and get embeddings
	inputs = tokenizer(patent_text, return_tensors="pt", padding=True, truncation=True, max_length=512)

	with torch.no_grad():
	outputs = model(**inputs)
	embeddings = outputs.last_hidden_state[:, 0, :] # CLS token

	print(embeddings.shape) # torch.Size([1, 1024])
	```

	## Evaluation

	This model has been evaluated on multiple patent-specific tasks:

	- IPC Classification (linear probe and KNN)
	- Prior Art Search (recall@k, nDCG@k)
	- Embedding Quality Metrics (uniformity, alignment, topology)

	For detailed evaluation results, see the [PatentMap paper](https://arxiv.org/abs/2511.10657).

	## Intended Use

	This model is designed for:
	- Patent document retrieval
	- Patent similarity search
	- Prior art discovery
	- IPC classification
	- Patent landscape analysis

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{zuo2025patent,
	title={Patent Representation Learning via Self-supervision},
	author={Zuo, You and Gerdes, Kim and de La Clergerie, Eric Villemonte and Sagot, Beno{\^i}t},
	journal={arXiv preprint arXiv:2511.10657},
	year={2025}
	}
	```

	## Model Collection

	This model is part of the PatentMap V0 collection. For an overview of all models, see [PatentMap-V0](https://huggingface.co/ZoeYou/patentmapv0-models).

	## License

	This model is released under CC BY-NC 4.0 license (non-commercial use only).

	## Contact

	For questions or issues, please open an issue on the [GitHub repository](https://github.com/ZoeYou/patentmapv0) or contact the authors.