--- license: mit --- # Cosine-Embed Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product. ## What it produces - Input: tokenized text (`input_ids`, `attention_mask`) - Output: an embedding vector of size `hidden_dim` with L2 normalization - Cosine similarity: `cos(a, b) = embedding(a) ยท embedding(b)` ## Model details - Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward) - Masked mean pooling over token embeddings - Final L2 normalization ## Default configuration These parameters are used in `Notebooks/Training.ipynb`: - `vocab_size`: 30522 - `seq_len`: 128 - `hidden_dim`: 512 - `n_heads`: 8 - `n_layer`: 3 - `ff_dim`: 2048 - `eps`: 1e-5 - `dropout`: 0.1 ## Training objective The model is trained with triplet loss on cosine similarity: `loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)` ## Checkpoints - `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs) - `checkpoints/model.safetensors`: weights-only export for inference ## Minimal inference ```python import torch from transformers import AutoTokenizer from safetensors.torch import load_file from Architecture import EmbeddingModel, ModelConfig device = "cuda" if torch.cuda.is_available() else "cpu" state_dict = load_file("checkpoints/model.safetensors") cfg = ModelConfig( vocab_size=30522, seq_len=128, hidden_dim=512, n_heads=8, n_layer=3, eps=1e-5, ff_dim=2048, dropout=0.1, ) model = EmbeddingModel(cfg).to(device) model.load_state_dict(state_dict) model.eval() tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def embed(texts): enc = tokenizer( texts, padding=True, truncation=True, max_length=128, return_tensors="pt", ) enc = {k: v.to(device) for k, v in enc.items()} with torch.no_grad(): return model(enc["input_ids"], enc["attention_mask"]) # normalized def cosine_similarity(a, b): ea = embed([a])[0] eb = embed([b])[0] return float((ea * eb).sum().item()) ``` ## Notes - Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent).