Feature Extraction
Transformers
Safetensors
English
bert
contrastive-learning
embeddings
political-science
social-groups
clustering
text-embeddings-inference
Instructions to use maxwlnd/cl_mention_embedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use maxwlnd/cl_mention_embedding with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="maxwlnd/cl_mention_embedding")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("maxwlnd/cl_mention_embedding", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 4,852 Bytes
d1ed19e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
language:
- en
license: mit
tags:
- bert
- feature-extraction
- contrastive-learning
- embeddings
- political-science
- social-groups
- clustering
base_model: google-bert/bert-base-uncased
pipeline_tag: feature-extraction
library_name: transformers
---
# Contrastive Learning Mention Embedding
A BERT-base model with a linear projection head fine-tuned via contrastive learning to produce embeddings that maximize separability between mentions of different social groups. Designed for clustering social group mentions into qualitative categories.
This model is part of the [`group-appeal-detector`](https://github.com/MaximilianWeiland/group_appeal_detector) package, which also provides group mention detection and stance classification.
## Model Details
- **Base model:** `bert-base-uncased`
- **Architecture:** BERT-base + linear projection head (768 → 128 dimensions)
- **Training objective:** Triplet loss with hard negative mining
- **Training data:** Social group dictionary provided by [Will Horne, Alona O. Dolinsky and Lena Maria Huber](https://osf.io/preprints/osf/fp2h3_v3)
## How It Works
Each mention is fed into the model using the following prompt template:
```
Social group of {mention} is: [MASK].
```
The hidden state at the `[MASK]` position is extracted, passed through the projection layer, and L2-normalized. Mentions of the same social group category are pulled together in embedding space; mentions of different categories are pushed apart.
The model was trained using the triplet loss. Each anchor is a term from a category in the social group dictionary, paired with a randomly sampled positive from the same category and a hard negative mined from a different category.
## Usage
### Via `group-appeal-detector` package (recommended)
```bash
pip install group-appeal-detector
```
```python
from group_appeal_detector import GroupAppealDetector, GroupMentionClusterer
detector = GroupAppealDetector(device="cpu")
# collect mentions from a corpus
texts = [...]
all_mentions = detector.detect_mentions_batch(texts, batch_size=16, as_df=False)
mentions = [m["span"] for mentions in all_mentions for m in mentions]
# cluster into categories
clusterer = GroupMentionClusterer(mentions, device="cpu")
results_df = clusterer.cluster(n_clusters=5, as_df=True)
results_df.head()
```
To find the optimal number of clusters automatically:
```python
best_k, all_scores = clusterer.find_optimal_k(k_range=(2, 20), metric="silhouette", visualize=True)
results_df = clusterer.cluster(n_clusters=best_k, as_df=True)
```
### Direct usage
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoConfig, AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
REPO_ID = "maxwlnd/cl_mention_embedding"
class ModelMask(nn.Module):
def __init__(self, tokenizer, pretrained_model_name="bert-base-uncased", proj_dim=128):
super().__init__()
config = AutoConfig.from_pretrained(pretrained_model_name)
self.encoder = AutoModel.from_config(config)
self.mask_id = tokenizer.mask_token_id
self.projector = nn.Sequential(nn.Linear(config.hidden_size, proj_dim))
def encode(self, input_ids, attention_mask):
outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
mask_positions = (input_ids == self.mask_id)
h = torch.stack([
outputs.last_hidden_state[i][mask_positions[i]].mean(dim=0)
for i in range(input_ids.size(0))
])
z = self.projector(h)
return F.normalize(z, p=2, dim=1)
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
model = ModelMask(tokenizer)
model.load_state_dict(load_file(hf_hub_download(REPO_ID, "model.safetensors")))
model.eval()
def embed(mention: str) -> torch.Tensor:
prompt = f"Social group of {mention} is: {tokenizer.mask_token}."
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
return model.encode(inputs["input_ids"], inputs["attention_mask"])
emb_a = embed("farmers")
emb_b = embed("agricultural workers")
print(F.cosine_similarity(emb_a, emb_b).item())
```
## Related Models
This model is one of three models in the group appeal detection pipeline:
| Model | Task |
|---|---|
| [`maxwlnd/roberta_group_mention_detector`](https://huggingface.co/maxwlnd/roberta_group_mention_detector) | Detect social group mentions |
| [`maxwlnd/socialgroup_stance_classification_nli`](https://huggingface.co/maxwlnd/socialgroup_stance_classification_nli) | Classify stance toward a group as positive, negative, or neutral |
| [`maxwlnd/cl_mention_embedding`](https://huggingface.co/maxwlnd/cl_mention_embedding) | Embed mentions for clustering into qualitative categories (this model) |
## License
MIT
|