Cosine-Embed / README.md
VirtualInsight's picture
Update README.md
9196686 verified
metadata
license: mit

Cosine-Embed

Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.

What it produces

  • Input: tokenized text (input_ids, attention_mask)
  • Output: an embedding vector of size hidden_dim with L2 normalization
  • Cosine similarity: cos(a, b) = embedding(a) · embedding(b)

Model details

  • Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
  • Masked mean pooling over token embeddings
  • Final L2 normalization

Default configuration

These parameters are used in Notebooks/Training.ipynb:

  • vocab_size: 30522
  • seq_len: 128
  • hidden_dim: 512
  • n_heads: 8
  • n_layer: 3
  • ff_dim: 2048
  • eps: 1e-5
  • dropout: 0.1

Training objective

The model is trained with triplet loss on cosine similarity:

loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)

Checkpoints

  • checkpoints/checkpoint.pt: training checkpoint (model, optimizer, losses, and configs)
  • checkpoints/model.safetensors: weights-only export for inference

Minimal inference

import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file

from Architecture import EmbeddingModel, ModelConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

state_dict = load_file("checkpoints/model.safetensors")

cfg = ModelConfig(
    vocab_size=30522,
    seq_len=128,
    hidden_dim=512,
    n_heads=8,
    n_layer=3,
    eps=1e-5,
    ff_dim=2048,
    dropout=0.1,
)

model = EmbeddingModel(cfg).to(device)
model.load_state_dict(state_dict)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def embed(texts):
    enc = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    enc = {k: v.to(device) for k, v in enc.items()}
    with torch.no_grad():
        return model(enc["input_ids"], enc["attention_mask"])  # normalized

def cosine_similarity(a, b):
    ea = embed([a])[0]
    eb = embed([b])[0]
    return float((ea * eb).sum().item())

Notes

  • Use the same tokenizer (bert-base-uncased) and the same max_length=128 (or keep seq_len and preprocessing consistent).