Veyra-Embed-300M: Sentence Similarity Embedding Model

Veyra-Embed-300M is a from-scratch sentence embedding model developed by Dl26. It maps text into normalized dense vectors for sentence similarity, semantic search, clustering, and retrieval-style experiments.

The model is trained with a contrastive objective over real sentence pairs using in-batch negatives. It is designed as a compact encoder-style embedding model rather than a generative language model.

Model Details

Property Value
Developer Dl26
Model type Sentence embedding encoder
Parameters 305,113,088
Hidden size 1024
Layers 4
Tokenizer vocab size 32,430
Model vocab size 260,000
Embedding size 1024
Max length 64
Objective Symmetric contrastive InfoNCE
Training data sentence-transformers/all-nli, sentence-transformers/stsb, sentence-transformers/quora-duplicates, embedding-data/QQP_triplets

Usage

This checkpoint contains raw PyTorch/safetensors weights plus tokenizer files. A compatible implementation should create the same encoder architecture from config.json, load model.safetensors, then mean-pool and normalize the output embedding.

from safetensors.torch import load_file

state_dict = load_file("model.safetensors")

Intended Use

  • sentence similarity
  • semantic search
  • duplicate question retrieval
  • clustering
  • lightweight embedding research
  • ranking experiments

Limitations

  • It is not a generative language model.
  • It should be evaluated on target retrieval and similarity datasets before use.
  • It will not match large production embedding models trained on much larger curated mixtures.
Downloads last month
33
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Dl26/Veyra-Embed-300M