Sentence Embedding Model - Production Release

📊 Model Performance

Semantic Understanding: Strong correlation with human judgments
Model Parameters: 3,299,584
Model Size: 12.6MB
Vocabulary Size: 164 tokens (automatically built from stopwords + domain words)
Max Sequence Length: 128 tokens
Embedding Dimensions: Model-specific

🚀 Quick Start

Installation

pip install -r api/requirements.txt

Basic Usage

from api.inference_api import SentenceEmbeddingInference

# Initialize model
model = SentenceEmbeddingInference("./")

# Generate embeddings
texts = ["Your text here", "Another text"]
embeddings = model.get_embeddings(texts)

# Compute similarity
similarity = model.compute_similarity("Text 1", "Text 2")

# Find similar texts
query = "Search query"
candidates = ["Text A", "Text B", "Text C"]
results = model.find_similar_texts(query, candidates, top_k=3)

Alternative Usage with Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('LNTTushar/sentence-embedding-model-production-release')

# Generate embeddings
sentences = ["Machine learning is transforming AI", "AI includes machine learning"]
embeddings = model.encode(sentences)

# Compute similarity
similarity = model.similarity(sentences[0], sentences[1])
print(f"Similarity: {similarity:.4f}")

🔧 Automatic Tokenizer Features

Stopwords Integration: Uses comprehensive English stopwords
Technical Vocabulary: Includes ML/AI domain-specific terms
Character Fallback: Handles unknown words with character-level encoding
Dynamic Building: Automatically extracts vocabulary from training data
No Manual Lists: Eliminates need for manual word curation

📁 Package Structure

├── models/           # Model weights and configuration
├── tokenizer/        # Auto-generated vocabulary and mappings
├── exports/          # Optimized model exports (TorchScript)
├── api/              # Python inference API
│   ├── inference_api.py
│   └── requirements.txt
└── README.md         # This file

⚡ Performance Benchmarks

Inference Speed: ~500-1000 sentences/second (CPU)
Memory Usage: ~13MB base model
Vocabulary: Auto-built with 164 tokens
Export Formats: PyTorch, TorchScript (optimized)

🎯 Development Highlights

This model represents a complete from-scratch development:

✅ Automated tokenizer with stopwords + technical terms
✅ No manual vocabulary curation required
✅ Dynamic vocabulary building from training data
✅ Comprehensive fallback mechanisms
✅ Production-ready deployment package

📞 API Reference

SentenceEmbeddingInference Class

Methods:

get_embeddings(texts, batch_size=8): Generate sentence embeddings
compute_similarity(text1, text2): Calculate cosine similarity
find_similar_texts(query, candidates, top_k=5): Find most similar texts
benchmark_performance(num_texts=100): Run performance benchmarks

📋 System Requirements

Python: 3.7+
PyTorch: 1.9.0+
NumPy: 1.20.0+
Memory: ~512MB RAM recommended
Storage: ~50MB for model files

🏷️ Version Information

Model Version: 1.0
Export Date: 2025-07-22
Tokenizer: Auto-generated with stopwords
Status: Production-ready

🔬 Technical Details

Architecture

Custom Transformer: Built from scratch with 3.3M parameters
Embedding Dimension: 384
Attention Heads: 6 per layer
Transformer Layers: 4 layers optimized for sentence embeddings
Pooling Strategy: Mean pooling for sentence-level representations

Training

Dataset: STS Benchmark + synthetic similarity pairs
Loss Function: Multi-objective (MSE + ranking + contrastive)
Optimization: Custom training pipeline with advanced techniques
Vocabulary Building: Automated from training corpus + stopwords

Performance Metrics

Spearman Correlation: Strong semantic similarity understanding
Processing Speed: 500-1000 sentences/second on CPU
Memory Efficiency: 13MB model size vs 90MB+ for comparable models
Deployment Ready: Optimized for production environments

Built with automated tokenizer using comprehensive stopwords and domain vocabulary

🎉 No more manual word lists - fully automated vocabulary building!

Downloads last month: 98

Dataset used to train LNTTushar/tryn-mini-7m

Evaluation results

cos_sim_spearman
self-reported

67.740
cos_sim_pearson
self-reported

67.210