--- license: cc-by-nc-4.0 language: - en base_model: - anferico/bert-for-patents tags: - patent - embeddings - contrastive-learning - information-retrieval pipeline_tag: feature-extraction --- # PatentMap-V0-SecPair-Drawing **PatentMap-V0-SecPair-Drawing** is a patent embedding model trained on abstract + drawing sections with section-pair augmentation. It is part of the PatentMap V0 model collection. ## Model Details - **Base Model:** [anferico/bert-for-patents](https://huggingface.co/anferico/bert-for-patents) - **Training Objective:** Contrastive learning (InfoNCE loss) - **Architecture:** BERT-large (340M parameters) - **Embedding Dimension:** 1024 - **Max Sequence Length:** 512 tokens - **Vocabulary Size:** 39860 - **Training Data:** USPTO patent applications (2010-2018) from [HUPD corpus](https://huggingface.co/datasets/HUPD/hupd) ### Training Configuration - **Patent Sections Used:** abstract + drawing - **Data Augmentation:** dropout + section_pair - **Batch Size:** 512 - **Learning Rate:** 1e-5 ### Special Tokens This model includes additional patent-specific special tokens: - `[drawing]` ## Usage ### Input Format This model expects patent text formatted with special tokens: - **For abstract**: `Title [SEP] [abstract] Abstract text` - **For other sections**: `[section] Section text` (no title prefix) Example: ```python # Abstract with title text = "Smart thermostat system [SEP] [abstract] A thermostat system comprising..." # Claim without title text = "[claim] A method comprising: step 1, step 2..." ``` ### Code Example ```python from transformers import AutoTokenizer, AutoModel import torch # Load model and tokenizer model_name = "ZoeYou/PatentMap-V0-SecPair-Drawing" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Format patent text title = "Smart thermostat system" abstract = "A thermostat system comprising a temperature sensor..." patent_text = f"{title} [SEP] [abstract] {abstract}" # Encode and get embeddings inputs = tokenizer(patent_text, return_tensors="pt", padding=True, truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state[:, 0, :] # CLS token print(embeddings.shape) # torch.Size([1, 1024]) ``` ## Evaluation This model has been evaluated on multiple patent-specific tasks: - **IPC Classification** (linear probe and KNN) - **Prior Art Search** (recall@k, nDCG@k) - **Embedding Quality Metrics** (uniformity, alignment, topology) For detailed evaluation results, see the [PatentMap paper](https://arxiv.org/abs/2511.10657). ## Intended Use This model is designed for: - Patent document retrieval - Patent similarity search - Prior art discovery - IPC classification - Patent landscape analysis ## Citation If you use this model, please cite: ```bibtex @article{zuo2025patent, title={Patent Representation Learning via Self-supervision}, author={Zuo, You and Gerdes, Kim and de La Clergerie, Eric Villemonte and Sagot, Beno{\^i}t}, journal={arXiv preprint arXiv:2511.10657}, year={2025} } ``` ## Model Collection This model is part of the PatentMap V0 collection. For an overview of all models, see [PatentMap-V0](https://huggingface.co/ZoeYou/patentmapv0-models). ## License This model is released under CC BY-NC 4.0 license (non-commercial use only). ## Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/ZoeYou/patentmapv0) or contact the authors.