Feature Extraction
Transformers
Safetensors
English
bert
contrastive-learning
embeddings
political-science
social-groups
clustering
text-embeddings-inference
Instructions to use maxwlnd/cl_mention_embedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use maxwlnd/cl_mention_embedding with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="maxwlnd/cl_mention_embedding")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("maxwlnd/cl_mention_embedding", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: mit | |
| tags: | |
| - bert | |
| - feature-extraction | |
| - contrastive-learning | |
| - embeddings | |
| - political-science | |
| - social-groups | |
| - clustering | |
| base_model: google-bert/bert-base-uncased | |
| pipeline_tag: feature-extraction | |
| library_name: transformers | |
| # Contrastive Learning Mention Embedding | |
| A BERT-base model with a linear projection head fine-tuned via contrastive learning to produce embeddings that maximize separability between mentions of different social groups. Designed for clustering social group mentions into qualitative categories. | |
| This model is part of the [`group-appeal-detector`](https://github.com/MaximilianWeiland/group_appeal_detector) package, which also provides group mention detection and stance classification. | |
| ## Model Details | |
| - **Base model:** `bert-base-uncased` | |
| - **Architecture:** BERT-base + linear projection head (768 → 128 dimensions) | |
| - **Training objective:** Triplet loss with hard negative mining | |
| - **Training data:** Social group dictionary provided by [Will Horne, Alona O. Dolinsky and Lena Maria Huber](https://osf.io/preprints/osf/fp2h3_v3) | |
| ## How It Works | |
| Each mention is fed into the model using the following prompt template: | |
| ``` | |
| Social group of {mention} is: [MASK]. | |
| ``` | |
| The hidden state at the `[MASK]` position is extracted, passed through the projection layer, and L2-normalized. Mentions of the same social group category are pulled together in embedding space; mentions of different categories are pushed apart. | |
| The model was trained using the triplet loss. Each anchor is a term from a category in the social group dictionary, paired with a randomly sampled positive from the same category and a hard negative mined from a different category. | |
| ## Usage | |
| ### Via `group-appeal-detector` package (recommended) | |
| ```bash | |
| pip install group-appeal-detector | |
| ``` | |
| ```python | |
| from group_appeal_detector import GroupAppealDetector, GroupMentionClusterer | |
| detector = GroupAppealDetector(device="cpu") | |
| # collect mentions from a corpus | |
| texts = [...] | |
| all_mentions = detector.detect_mentions_batch(texts, batch_size=16, as_df=False) | |
| mentions = [m["span"] for mentions in all_mentions for m in mentions] | |
| # cluster into categories | |
| clusterer = GroupMentionClusterer(mentions, device="cpu") | |
| results_df = clusterer.cluster(n_clusters=5, as_df=True) | |
| results_df.head() | |
| ``` | |
| To find the optimal number of clusters automatically: | |
| ```python | |
| best_k, all_scores = clusterer.find_optimal_k(k_range=(2, 20), metric="silhouette", visualize=True) | |
| results_df = clusterer.cluster(n_clusters=best_k, as_df=True) | |
| ``` | |
| ### Direct usage | |
| ```python | |
| import torch | |
| import torch.nn as nn | |
| import torch.nn.functional as F | |
| from transformers import AutoConfig, AutoModel, AutoTokenizer | |
| from huggingface_hub import hf_hub_download | |
| from safetensors.torch import load_file | |
| REPO_ID = "maxwlnd/cl_mention_embedding" | |
| class ModelMask(nn.Module): | |
| def __init__(self, tokenizer, pretrained_model_name="bert-base-uncased", proj_dim=128): | |
| super().__init__() | |
| config = AutoConfig.from_pretrained(pretrained_model_name) | |
| self.encoder = AutoModel.from_config(config) | |
| self.mask_id = tokenizer.mask_token_id | |
| self.projector = nn.Sequential(nn.Linear(config.hidden_size, proj_dim)) | |
| def encode(self, input_ids, attention_mask): | |
| outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask) | |
| mask_positions = (input_ids == self.mask_id) | |
| h = torch.stack([ | |
| outputs.last_hidden_state[i][mask_positions[i]].mean(dim=0) | |
| for i in range(input_ids.size(0)) | |
| ]) | |
| z = self.projector(h) | |
| return F.normalize(z, p=2, dim=1) | |
| tokenizer = AutoTokenizer.from_pretrained(REPO_ID) | |
| model = ModelMask(tokenizer) | |
| model.load_state_dict(load_file(hf_hub_download(REPO_ID, "model.safetensors"))) | |
| model.eval() | |
| def embed(mention: str) -> torch.Tensor: | |
| prompt = f"Social group of {mention} is: {tokenizer.mask_token}." | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| with torch.no_grad(): | |
| return model.encode(inputs["input_ids"], inputs["attention_mask"]) | |
| emb_a = embed("farmers") | |
| emb_b = embed("agricultural workers") | |
| print(F.cosine_similarity(emb_a, emb_b).item()) | |
| ``` | |
| ## Related Models | |
| This model is one of three models in the group appeal detection pipeline: | |
| | Model | Task | | |
| |---|---| | |
| | [`maxwlnd/roberta_group_mention_detector`](https://huggingface.co/maxwlnd/roberta_group_mention_detector) | Detect social group mentions | | |
| | [`maxwlnd/socialgroup_stance_classification_nli`](https://huggingface.co/maxwlnd/socialgroup_stance_classification_nli) | Classify stance toward a group as positive, negative, or neutral | | |
| | [`maxwlnd/cl_mention_embedding`](https://huggingface.co/maxwlnd/cl_mention_embedding) | Embed mentions for clustering into qualitative categories (this model) | | |
| ## License | |
| MIT | |