Update README.md

9196686 verified 9 days ago

2.39 kB

license: mit

Cosine-Embed

Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.

What it produces

Input: tokenized text (input_ids, attention_mask)
Output: an embedding vector of size hidden_dim with L2 normalization
Cosine similarity: cos(a, b) = embedding(a) · embedding(b)

Model details

Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
Masked mean pooling over token embeddings
Final L2 normalization

Default configuration

These parameters are used in Notebooks/Training.ipynb:

vocab_size: 30522
seq_len: 128
hidden_dim: 512
n_heads: 8
n_layer: 3
ff_dim: 2048
eps: 1e-5
dropout: 0.1

Training objective

The model is trained with triplet loss on cosine similarity:

loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)

Checkpoints

checkpoints/checkpoint.pt: training checkpoint (model, optimizer, losses, and configs)
checkpoints/model.safetensors: weights-only export for inference

Minimal inference

import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file

from Architecture import EmbeddingModel, ModelConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

state_dict = load_file("checkpoints/model.safetensors")

cfg = ModelConfig(
    vocab_size=30522,
    seq_len=128,
    hidden_dim=512,
    n_heads=8,
    n_layer=3,
    eps=1e-5,
    ff_dim=2048,
    dropout=0.1,
)

model = EmbeddingModel(cfg).to(device)
model.load_state_dict(state_dict)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def embed(texts):
    enc = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    enc = {k: v.to(device) for k, v in enc.items()}
    with torch.no_grad():
        return model(enc["input_ids"], enc["attention_mask"])  # normalized

def cosine_similarity(a, b):
    ea = embed([a])[0]
    eb = embed([b])[0]
    return float((ea * eb).sum().item())

Notes

Use the same tokenizer (bert-base-uncased) and the same max_length=128 (or keep seq_len and preprocessing consistent).