File size: 2,387 Bytes
9196686
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: mit
---
# Cosine-Embed

Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.

## What it produces
- Input: tokenized text (`input_ids`, `attention_mask`)
- Output: an embedding vector of size `hidden_dim` with L2 normalization
- Cosine similarity: `cos(a, b) = embedding(a) · embedding(b)`

## Model details
- Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
- Masked mean pooling over token embeddings
- Final L2 normalization

## Default configuration
These parameters are used in `Notebooks/Training.ipynb`:
- `vocab_size`: 30522
- `seq_len`: 128
- `hidden_dim`: 512
- `n_heads`: 8
- `n_layer`: 3
- `ff_dim`: 2048
- `eps`: 1e-5
- `dropout`: 0.1

## Training objective
The model is trained with triplet loss on cosine similarity:

`loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)`

## Checkpoints
- `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs)
- `checkpoints/model.safetensors`: weights-only export for inference

## Minimal inference
```python
import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file

from Architecture import EmbeddingModel, ModelConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

state_dict = load_file("checkpoints/model.safetensors")

cfg = ModelConfig(
    vocab_size=30522,
    seq_len=128,
    hidden_dim=512,
    n_heads=8,
    n_layer=3,
    eps=1e-5,
    ff_dim=2048,
    dropout=0.1,
)

model = EmbeddingModel(cfg).to(device)
model.load_state_dict(state_dict)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def embed(texts):
    enc = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    enc = {k: v.to(device) for k, v in enc.items()}
    with torch.no_grad():
        return model(enc["input_ids"], enc["attention_mask"])  # normalized

def cosine_similarity(a, b):
    ea = embed([a])[0]
    eb = embed([b])[0]
    return float((ea * eb).sum().item())
```

## Notes
- Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent).