Cosine-Embed / README.md
VirtualInsight's picture
Update README.md
9196686 verified
---
license: mit
---
# Cosine-Embed
Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.
## What it produces
- Input: tokenized text (`input_ids`, `attention_mask`)
- Output: an embedding vector of size `hidden_dim` with L2 normalization
- Cosine similarity: `cos(a, b) = embedding(a) · embedding(b)`
## Model details
- Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
- Masked mean pooling over token embeddings
- Final L2 normalization
## Default configuration
These parameters are used in `Notebooks/Training.ipynb`:
- `vocab_size`: 30522
- `seq_len`: 128
- `hidden_dim`: 512
- `n_heads`: 8
- `n_layer`: 3
- `ff_dim`: 2048
- `eps`: 1e-5
- `dropout`: 0.1
## Training objective
The model is trained with triplet loss on cosine similarity:
`loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)`
## Checkpoints
- `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs)
- `checkpoints/model.safetensors`: weights-only export for inference
## Minimal inference
```python
import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file
from Architecture import EmbeddingModel, ModelConfig
device = "cuda" if torch.cuda.is_available() else "cpu"
state_dict = load_file("checkpoints/model.safetensors")
cfg = ModelConfig(
vocab_size=30522,
seq_len=128,
hidden_dim=512,
n_heads=8,
n_layer=3,
eps=1e-5,
ff_dim=2048,
dropout=0.1,
)
model = EmbeddingModel(cfg).to(device)
model.load_state_dict(state_dict)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def embed(texts):
enc = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt",
)
enc = {k: v.to(device) for k, v in enc.items()}
with torch.no_grad():
return model(enc["input_ids"], enc["attention_mask"]) # normalized
def cosine_similarity(a, b):
ea = embed([a])[0]
eb = embed([b])[0]
return float((ea * eb).sum().item())
```
## Notes
- Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent).