VirtualInsight
/

Cosine-Embed

Model card Files Files and versions

Cosine-Embed / README.md

VirtualInsight's picture

Update README.md

9196686 verified 9 days ago

|

history blame contribute delete

2.39 kB

	---
	license: mit
	---
	# Cosine-Embed

	Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.

	## What it produces
	- Input: tokenized text (`input_ids`, `attention_mask`)
	- Output: an embedding vector of size `hidden_dim` with L2 normalization
	- Cosine similarity: `cos(a, b) = embedding(a) · embedding(b)`

	## Model details
	- Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
	- Masked mean pooling over token embeddings
	- Final L2 normalization

	## Default configuration
	These parameters are used in `Notebooks/Training.ipynb`:
	- `vocab_size`: 30522
	- `seq_len`: 128
	- `hidden_dim`: 512
	- `n_heads`: 8
	- `n_layer`: 3
	- `ff_dim`: 2048
	- `eps`: 1e-5
	- `dropout`: 0.1

	## Training objective
	The model is trained with triplet loss on cosine similarity:

	`loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)`

	## Checkpoints
	- `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs)
	- `checkpoints/model.safetensors`: weights-only export for inference

	## Minimal inference
	```python
	import torch
	from transformers import AutoTokenizer
	from safetensors.torch import load_file

	from Architecture import EmbeddingModel, ModelConfig

	device = "cuda" if torch.cuda.is_available() else "cpu"

	state_dict = load_file("checkpoints/model.safetensors")

	cfg = ModelConfig(
	vocab_size=30522,
	seq_len=128,
	hidden_dim=512,
	n_heads=8,
	n_layer=3,
	eps=1e-5,
	ff_dim=2048,
	dropout=0.1,
	)

	model = EmbeddingModel(cfg).to(device)
	model.load_state_dict(state_dict)
	model.eval()

	tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

	def embed(texts):
	enc = tokenizer(
	texts,
	padding=True,
	truncation=True,
	max_length=128,
	return_tensors="pt",
	)
	enc = {k: v.to(device) for k, v in enc.items()}
	with torch.no_grad():
	return model(enc["input_ids"], enc["attention_mask"]) # normalized

	def cosine_similarity(a, b):
	ea = embed([a])[0]
	eb = embed([b])[0]
	return float((ea * eb).sum().item())
	```

	## Notes
	- Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent).