| --- |
| license: mit |
| --- |
| # Cosine-Embed |
|
|
| Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product. |
|
|
| ## What it produces |
| - Input: tokenized text (`input_ids`, `attention_mask`) |
| - Output: an embedding vector of size `hidden_dim` with L2 normalization |
| - Cosine similarity: `cos(a, b) = embedding(a) · embedding(b)` |
|
|
| ## Model details |
| - Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward) |
| - Masked mean pooling over token embeddings |
| - Final L2 normalization |
|
|
| ## Default configuration |
| These parameters are used in `Notebooks/Training.ipynb`: |
| - `vocab_size`: 30522 |
| - `seq_len`: 128 |
| - `hidden_dim`: 512 |
| - `n_heads`: 8 |
| - `n_layer`: 3 |
| - `ff_dim`: 2048 |
| - `eps`: 1e-5 |
| - `dropout`: 0.1 |
|
|
| ## Training objective |
| The model is trained with triplet loss on cosine similarity: |
|
|
| `loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)` |
|
|
| ## Checkpoints |
| - `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs) |
| - `checkpoints/model.safetensors`: weights-only export for inference |
|
|
| ## Minimal inference |
| ```python |
| import torch |
| from transformers import AutoTokenizer |
| from safetensors.torch import load_file |
| |
| from Architecture import EmbeddingModel, ModelConfig |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| state_dict = load_file("checkpoints/model.safetensors") |
| |
| cfg = ModelConfig( |
| vocab_size=30522, |
| seq_len=128, |
| hidden_dim=512, |
| n_heads=8, |
| n_layer=3, |
| eps=1e-5, |
| ff_dim=2048, |
| dropout=0.1, |
| ) |
| |
| model = EmbeddingModel(cfg).to(device) |
| model.load_state_dict(state_dict) |
| model.eval() |
| |
| tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") |
| |
| def embed(texts): |
| enc = tokenizer( |
| texts, |
| padding=True, |
| truncation=True, |
| max_length=128, |
| return_tensors="pt", |
| ) |
| enc = {k: v.to(device) for k, v in enc.items()} |
| with torch.no_grad(): |
| return model(enc["input_ids"], enc["attention_mask"]) # normalized |
| |
| def cosine_similarity(a, b): |
| ea = embed([a])[0] |
| eb = embed([b])[0] |
| return float((ea * eb).sum().item()) |
| ``` |
|
|
| ## Notes |
| - Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent). |
|
|