rafidka commited on
Commit
6f59113
·
verified ·
1 Parent(s): 55dee25

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Vectra
2
+
3
+ BERT-base sentence embeddings trained with in-batch contrastive learning (Multiple Negatives Ranking Loss) on MultiNLI entailment pairs.
4
+
5
+ ## Model
6
+
7
+ - Base: `bert-base-uncased`
8
+ - Pooling: mean pooling over token embeddings (masked)
9
+ - Normalization: L2
10
+ - Objective: MNRL / InfoNCE-style softmax with temperature 0.05
11
+ - Training data: MultiNLI entailment pairs (subset)
12
+
13
+ ## Usage (embeddings)
14
+
15
+ ```python
16
+ import torch
17
+ import torch.nn.functional as F
18
+ from transformers import AutoTokenizer, AutoModel
19
+
20
+ def mean_pooling(last_hidden_state, attention_mask):
21
+ mask = attention_mask.unsqueeze(-1).to(dtype=last_hidden_state.dtype)
22
+ summed = (last_hidden_state * mask).sum(dim=1)
23
+ counts = mask.sum(dim=1).clamp(min=1e-6)
24
+ return summed / counts
25
+
26
+ @torch.no_grad()
27
+ def embed_texts(texts, model_id="rafidka/vectra", max_length=128, device="cuda"):
28
+ tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
29
+ model = AutoModel.from_pretrained(model_id, add_pooling_layer=False).to(device).eval()
30
+ batch = tok(texts, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt").to(device)
31
+ out = model(**batch)
32
+ emb = mean_pooling(out.last_hidden_state, batch["attention_mask"])
33
+ emb = F.normalize(emb, p=2, dim=-1)
34
+ return emb
35
+ ```