tryn-mini-7m / README.md

Update README.md

12ab953 verified 7 months ago

5.21 kB

	---
	language: en
	license: apache-2.0
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	- pytorch
	- semantic-search
	- custom-architecture
	- automated-tokenizer
	datasets:
	- mteb/stsbenchmark-sts
	- synthetic-similarity-data
	metrics:
	- spearman_correlation
	- pearson_correlation
	model-index:
	- name: Sentence Embedding Model
	results:
	- task:
	type: STS
	dataset:
	type: mteb/stsbenchmark-sts
	name: MTEB STSBenchmark
	config: default
	split: test
	metrics:
	- type: cos_sim_spearman
	value: 67.74
	- type: cos_sim_pearson
	value: 67.21
	---

	# Sentence Embedding Model - Production Release

	## 📊 Model Performance
	- Semantic Understanding: Strong correlation with human judgments
	- Model Parameters: 3,299,584
	- Model Size: 12.6MB
	- Vocabulary Size: 164 tokens (automatically built from stopwords + domain words)
	- Max Sequence Length: 128 tokens
	- Embedding Dimensions: Model-specific

	## 🚀 Quick Start

	### Installation
	```bash
	pip install -r api/requirements.txt
	```

	### Basic Usage
	```python
	from api.inference_api import SentenceEmbeddingInference

	# Initialize model
	model = SentenceEmbeddingInference("./")

	# Generate embeddings
	texts = ["Your text here", "Another text"]
	embeddings = model.get_embeddings(texts)

	# Compute similarity
	similarity = model.compute_similarity("Text 1", "Text 2")

	# Find similar texts
	query = "Search query"
	candidates = ["Text A", "Text B", "Text C"]
	results = model.find_similar_texts(query, candidates, top_k=3)
	```

	### Alternative Usage with Sentence Transformers
	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer('LNTTushar/sentence-embedding-model-production-release')

	# Generate embeddings
	sentences = ["Machine learning is transforming AI", "AI includes machine learning"]
	embeddings = model.encode(sentences)

	# Compute similarity
	similarity = model.similarity(sentences[0], sentences[1])
	print(f"Similarity: {similarity:.4f}")
	```

	## 🔧 Automatic Tokenizer Features
	- Stopwords Integration: Uses comprehensive English stopwords
	- Technical Vocabulary: Includes ML/AI domain-specific terms
	- Character Fallback: Handles unknown words with character-level encoding
	- Dynamic Building: Automatically extracts vocabulary from training data
	- No Manual Lists: Eliminates need for manual word curation

	## 📁 Package Structure
	```
	├── models/ # Model weights and configuration
	├── tokenizer/ # Auto-generated vocabulary and mappings
	├── exports/ # Optimized model exports (TorchScript)
	├── api/ # Python inference API
	│ ├── inference_api.py
	│ └── requirements.txt
	└── README.md # This file
	```

	## ⚡ Performance Benchmarks
	- Inference Speed: ~500-1000 sentences/second (CPU)
	- Memory Usage: ~13MB base model
	- Vocabulary: Auto-built with 164 tokens
	- Export Formats: PyTorch, TorchScript (optimized)

	## 🎯 Development Highlights
	This model represents a complete from-scratch development:
	1. ✅ Automated tokenizer with stopwords + technical terms
	2. ✅ No manual vocabulary curation required
	3. ✅ Dynamic vocabulary building from training data
	4. ✅ Comprehensive fallback mechanisms
	5. ✅ Production-ready deployment package

	## 📞 API Reference

	### SentenceEmbeddingInference Class

	#### Methods:
	- `get_embeddings(texts, batch_size=8)`: Generate sentence embeddings
	- `compute_similarity(text1, text2)`: Calculate cosine similarity
	- `find_similar_texts(query, candidates, top_k=5)`: Find most similar texts
	- `benchmark_performance(num_texts=100)`: Run performance benchmarks

	## 📋 System Requirements
	- Python: 3.7+
	- PyTorch: 1.9.0+
	- NumPy: 1.20.0+
	- Memory: ~512MB RAM recommended
	- Storage: ~50MB for model files

	## 🏷️ Version Information
	- Model Version: 1.0
	- Export Date: 2025-07-22
	- Tokenizer: Auto-generated with stopwords
	- Status: Production-ready

	## 🔬 Technical Details

	### Architecture
	- Custom Transformer: Built from scratch with 3.3M parameters
	- Embedding Dimension: 384
	- Attention Heads: 6 per layer
	- Transformer Layers: 4 layers optimized for sentence embeddings
	- Pooling Strategy: Mean pooling for sentence-level representations

	### Training
	- Dataset: STS Benchmark + synthetic similarity pairs
	- Loss Function: Multi-objective (MSE + ranking + contrastive)
	- Optimization: Custom training pipeline with advanced techniques
	- Vocabulary Building: Automated from training corpus + stopwords

	### Performance Metrics
	- Spearman Correlation: Strong semantic similarity understanding
	- Processing Speed: 500-1000 sentences/second on CPU
	- Memory Efficiency: 13MB model size vs 90MB+ for comparable models
	- Deployment Ready: Optimized for production environments

	---

	Built with automated tokenizer using comprehensive stopwords and domain vocabulary

	🎉 No more manual word lists - fully automated vocabulary building!