agraharr/telecom-snowflake-arctic-embed-s

Task: Domain-adapted Sentence Embeddings — Telecom Retrieval, QA Similarity, Semantic Search

Model Overview

This model is a domain-specialized telecom sentence embedding model fine-tuned from Snowflake/snowflake-arctic-embed-s on telecom-domain query/passage pairs and hard-negative triplet evaluation data.

It is intended for telecom-focused retrieval and semantic matching tasks such as:

Telecom RAG retrieval
Telecom question-answer retrieval
KPI / 3GPP / ORAN / radio-network concept search
Semantic similarity over telecom text
Clustering and deduplication of telecom questions, procedures, and knowledge snippets
Candidate retrieval before reranking in telecom assistant pipelines

Base model summary:

Base: Snowflake/snowflake-arctic-embed-s
Architecture: Sentence Transformer / bi-encoder embedding model
Embedding dimension: 384
Base model size: ~33M parameters
Similarity function: Cosine similarity
Primary language: English
Domain: Telecom / 5G / ORAN / 3GPP / network operations

The base snowflake-arctic-embed-s model is a compact retrieval-oriented embedding model based on intfloat/e5-small-unsupervised, with 33M parameters and 384-dimensional embeddings. The telecom fine-tuning adapts this model toward domain-specific telecom retrieval while preserving general retrieval behavior through general-domain replay data.

Intended Use

Use this model when you need compact, fast telecom-domain embeddings for:

Dense retrieval in RAG
Query-to-document search
Query-to-answer matching
Telecom FAQ / standards retrieval
Skill/tool retrieval for telecom agents
Similarity search over domain documents
Vector search in FAISS, ChromaDB, pgvector, OpenSearch, Vespa, or similar systems

This model is especially useful when you want a small 384-dimensional embedding model that can be used as a drop-in replacement for sentence-transformers/all-MiniLM-L6-v2 style vector indexes.

Not Intended For

This model is not a generative language model. It does not generate answers directly.

It should not be used as the only component for:

Factual answer generation
Legal, safety-critical, or regulatory decisions
Final response generation without a downstream LLM or verifier
Non-English multilingual retrieval without separate evaluation

For best results in RAG, use it for retrieval, then pass retrieved context to an LLM or reranker.

How to Use

Installation

pip install -U sentence-transformers

Direct Usage with Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("agraharr/telecom-snowflake-arctic-embed-s")

sentences = [
    "What is handover success rate in LTE?",
    "Handover success rate measures successful handovers divided by attempted handovers.",
    "CPU utilization measures how much processing capacity is being used."
]

embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape)  # (3, 384)

similarities = model.similarity(embeddings, embeddings)
print(similarities)

Recommended Retrieval Format

During fine-tuning, telecom examples were formatted in query/passage style. For retrieval, use the same convention where possible:

queries = [
    "query: What is the role of AMF in 5G?",
    "query: Explain handover failure causes"
]

passages = [
    "passage: The AMF handles access and mobility management functions in the 5G core.",
    "passage: Handover failures can be caused by radio conditions, neighbor relation issues, PCI conflicts, or parameter misconfiguration."
]

query_embeddings = model.encode(queries, normalize_embeddings=True)
passage_embeddings = model.encode(passages, normalize_embeddings=True)

If your application already indexes raw text without prefixes, evaluate both variants. For new retrieval systems, the query: / passage: format is recommended.

Training Details

Base Model

Base model: Snowflake/snowflake-arctic-embed-s
Base embedding dimension: 384
Base model size: ~33M parameters
Base model family: Arctic Embed / E5-small style retrieval model

Training Objective

The model was fine-tuned using contrastive retrieval training with query-positive pairs:

Primary loss: CachedMultipleNegativesRankingLoss
Optional nested loss: MatryoshkaLoss if enabled during training
Training format: (query, positive_passage)
Negative strategy: in-batch negatives from other positives in the batch
Input formatting: query: ... and passage: ...

Example training pair:

query: What is handover success rate?
passage: Handover success rate is calculated as the ratio of successful handovers to attempted handovers.

Domain Data

The telecom training mix can include:

Telco-DPR query-document relevance pairs
3GPP / 5G NR QA pairs
ORANBench question-answer pairs
GSMA TeleQnA question-answer pairs
Curated telecom sentence pairs / triplets

General-Domain Replay

To reduce catastrophic forgetting, the training pipeline can mix telecom-domain data with a general-domain retrieval replay set, such as MS-MARCO query-positive pairs.

Typical ratio:

80% telecom-domain retrieval data
20% general-domain retrieval data

This helps the model adapt to telecom terminology while preserving broader semantic retrieval behavior.

Hyperparameters

The exact values depend on the run. A typical training configuration is:

Hyperparameter	Value
Epochs	2
Batch size	32–512 depending on GPU
Learning rate	5e-6 to 1e-5
Max sequence length	512
Warmup ratio	0.10
Weight decay	0.01
AMP / mixed precision	Enabled when supported
Frozen lower layers	Optional, commonly 0–2 layers
General replay ratio	0.20

Evaluation

This model should be evaluated on both telecom-domain and general-domain retrieval tasks.

Recommended telecom metrics:

Recall@1
Recall@3
Recall@5
MRR@10
nDCG@10

Recommended evaluation sets:

Held-out Telco-DPR relevance mappings
Held-out telecom QA pairs
Hard-negative telecom triplets
General-domain retrieval validation set to monitor forgetting

Example Evaluation Table

Replace the values below with metrics from your training_report.json.

Model	Dimension	Telecom Recall@1	Telecom Recall@3	MRR@10	nDCG@10	Notes
`sentence-transformers/all-MiniLM-L6-v2`	384	TODO	TODO	TODO	TODO	Generic baseline
`Snowflake/snowflake-arctic-embed-s`	384	TODO	TODO	TODO	TODO	Base model
`agraharr/telecom-snowflake-arctic-embed-s`	384	TODO	TODO	TODO	TODO	Telecom fine-tuned
`agraharr/telecom-gte-modernbert-matryoshka`	768 / truncated	TODO	TODO	TODO	TODO	Larger comparison model

Matryoshka / Truncated Embeddings

If this model was trained with MatryoshkaLoss, it can be evaluated at smaller embedding dimensions, for example:

Example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("agraharr/telecom-snowflake-arctic-embed-s")

# Full 384d embedding
emb_384 = model.encode(["query: What is gNodeB?"], normalize_embeddings=True)

# If Matryoshka training was used, evaluate truncation in your retrieval stack
emb_128 = emb_384[:, :128]

Note: truncation should be used only after validating retrieval quality at the target dimension.

Limitations

The model is specialized for telecom-domain semantic retrieval and may not improve every general-domain task.
It should be evaluated on your own corpus before production deployment.
It may encode telecom-specific terminology better than generic embeddings, but retrieval results still require downstream ranking, filtering, or answer verification for high-stakes use cases.
If trained primarily on QA-style pairs, it may perform best on query-answer retrieval and less strongly on long-document retrieval unless trained/evaluated for that use case.
The model is English-focused unless additional multilingual data was used and evaluated.

Bias, Risks, and Responsible Use

This model inherits limitations from both the base embedding model and the telecom datasets used for fine-tuning. Retrieved passages may be incomplete, outdated, or semantically similar but operationally incorrect.

For telecom operations, recommendations, root-cause analysis, or configuration changes, retrieved content should be validated against authoritative documentation, engineering procedures, or human expert review.

Example RAG Usage

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("agraharr/telecom-snowflake-arctic-embed-s")

query = "query: What can cause handover failure in LTE?"
docs = [
    "passage: Handover failure can be caused by poor radio conditions, missing neighbor relations, PCI conflicts, or incorrect mobility parameters.",
    "passage: CPU utilization measures how much processing capacity is used on a server.",
    "passage: The AMF handles registration and mobility management in the 5G core."
]

q_emb = model.encode([query], normalize_embeddings=True)
d_emb = model.encode(docs, normalize_embeddings=True)

scores = np.dot(q_emb, d_emb.T)[0]
ranked = sorted(zip(scores, docs), reverse=True)

for score, doc in ranked:
    print(round(float(score), 4), doc)

Citation

If you use this model, please cite the base model and relevant embedding-training work:

@misc{merrick2024arcticembed,
  title={Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models},
  author={Luke Merrick and Danmei Xu and Gaurav Nuti and Daniel Campos},
  year={2024},
  eprint={2405.05374},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@inproceedings{kusupati2022matryoshka,
  title={Matryoshka Representation Learning},
  author={Kusupati, Aditya and Bhatt, Gantavya and Rege, Aniket and Wallingford, Matthew and Sinha, Aditya and Ramanujan, Vivek and Howard-Snyder, Will and Chen, Kaifeng and Kakade, Sham and Jain, Prateek and Farhadi, Ali},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}