| | --- |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - information-retrieval |
| | - semantic-search |
| | base_model: BAAI/bge-base-en-v1.5 |
| | pipeline_tag: sentence-similarity |
| | library_name: sentence-transformers |
| | license: mit |
| | language: |
| | - en |
| | metrics: |
| | - ndcg |
| | - recall |
| | - precision |
| | --- |
| | |
| | # VMware Technical Documentation Embeddings |
| |
|
| | A specialized sentence-transformers model fine-tuned for semantic search and information retrieval in technical documentation, with a focus on enterprise infrastructure and virtualization technologies. |
| |
|
| | ## Model Details |
| |
|
| | ### Description |
| |
|
| | This model extends [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) with domain-specific fine-tuning for technical documentation retrieval. It generates 768-dimensional dense embeddings optimized for semantic similarity in enterprise technology contexts. |
| |
|
| | - **Model Type:** Sentence Transformer (BERT-based) |
| | - **Base Model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) |
| | - **Embedding Dimension:** 768 |
| | - **Max Sequence Length:** 512 tokens |
| | - **Language:** English |
| | - **License:** MIT |
| |
|
| | ### Intended Use |
| |
|
| | **Primary Use Cases:** |
| | - Semantic search over technical documentation |
| | - Information retrieval for enterprise infrastructure queries |
| | - RAG (Retrieval-Augmented Generation) pipelines |
| | - Technical support knowledge bases |
| | - Enterprise search systems |
| |
|
| | **Optimized For:** |
| | - Natural language queries about technical topics |
| | - Documentation retrieval and ranking |
| | - Question answering systems |
| | - Knowledge management platforms |
| |
|
| | ### Out-of-Scope |
| |
|
| | This model is specialized for technical documentation and may not perform optimally for: |
| | - General domain text |
| | - Non-English languages |
| | - Code search or generation |
| | - Creative writing or entertainment content |
| |
|
| | ## Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install sentence-transformers |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer, util |
| | |
| | # Load model |
| | model = SentenceTransformer('BarraHome/vmware-embeddings-large-v1') |
| | |
| | # Example queries and documents |
| | queries = [ |
| | "How to configure high availability?", |
| | "Steps to install guest tools" |
| | ] |
| | |
| | documents = [ |
| | "High availability can be configured through the management interface...", |
| | "To install guest tools, first mount the ISO image..." |
| | ] |
| | |
| | # Generate embeddings |
| | query_embeddings = model.encode(queries) |
| | doc_embeddings = model.encode(documents) |
| | |
| | # Calculate similarity |
| | similarities = util.cos_sim(query_embeddings, doc_embeddings) |
| | print(similarities) |
| | ``` |
| |
|
| | ### Semantic Search Example |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer, util |
| | |
| | model = SentenceTransformer('BarraHome/vmware-embeddings-large-v1') |
| | |
| | # Your document corpus |
| | corpus = [ |
| | "Documentation about high availability features...", |
| | "Guide for load balancing configuration...", |
| | "Instructions for live migration procedures..." |
| | ] |
| | |
| | # Encode corpus |
| | corpus_embeddings = model.encode(corpus, convert_to_tensor=True) |
| | |
| | # Query |
| | query = "How to enable high availability?" |
| | query_embedding = model.encode(query, convert_to_tensor=True) |
| | |
| | # Search |
| | hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3) |
| | |
| | # Display results |
| | for hit in hits[0]: |
| | print(f"Score: {hit['score']:.4f}") |
| | print(f"Document: {corpus[hit['corpus_id']]}\n") |
| | ``` |
| |
|
| |
|
| | ## Performance |
| |
|
| | ### Evaluation Metrics |
| |
|
| | Evaluated on a held-out test set of 2,000 diverse technical queries: |
| |
|
| | | Metric | Base Model | Fine-tuned | Improvement | |
| | |--------|-----------|------------|-------------| |
| | | **Recall@1** | 0.637 | **0.759** | +19.2% | |
| | | **Recall@3** | 0.805 | **0.927** | +15.2% | |
| | | **Recall@5** | 0.853 | **0.956** | +12.1% | |
| | | **Recall@10** | 0.906 | **0.979** | +8.0% | |
| | | **NDCG@10** | 0.775 | **0.879** | +13.4% | |
| |
|
| | ### Key Performance Indicators |
| |
|
| | - ✅ **75.9%** top-1 accuracy |
| | - ✅ **92.7%** top-3 recall |
| | - ✅ **97.9%** top-10 recall |
| | - ✅ **0.879** NDCG@10 (excellent ranking quality) |
| |
|
| | ### Comparison with Base Model |
| |
|
| | The fine-tuned model shows consistent improvements across all metrics: |
| | - Higher recall at all k values |
| | - Better ranking quality (NDCG) |
| | - More accurate top-1 predictions |
| |
|
| | #### Performance Visualizations |
| |
|
| | **Detailed Metric Comparison:** |
| |
|
| |  |
| |
|
| | **Percentage Improvements:** |
| |
|
| |  |
| |
|
| | ## Training Details |
| |
|
| | ### Training Configuration |
| |
|
| | - **Framework:** sentence-transformers |
| | - **Loss Function:** MultipleNegativesRankingLoss |
| | - **Training Strategy:** Contrastive learning with hard negative mining |
| | - **Epochs:** 1 |
| | - **Batch Size:** 64 |
| | - **Learning Rate:** 2e-5 (with 10% warmup) |
| | - **Training Samples:** 671,972 query-document pairs |
| | - **Total Steps:** 10,500 |
| | - **Training Duration:** 4 hours 6 minutes |
| | - **Throughput:** 45.4 samples/second |
| | - **Final Loss:** 2.245 |
| | - **Precision:** FP16 |
| | - **Hardware:** NVIDIA RTX A6000 (49GB VRAM) |
| |
|
| | ### Model Architecture |
| |
|
| | ``` |
| | SentenceTransformer( |
| | (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) |
| | (1): Pooling({'pooling_mode_cls_token': True}) |
| | (2): Normalize() |
| | ) |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | ### Known Limitations |
| |
|
| | - **Domain-Specific:** Optimized for technical documentation; general domain performance not guaranteed |
| | - **English Only:** No multi-language support |
| | - **Context Length:** Limited to 512 tokens |
| | - **Recency:** Knowledge current as of training date |
| |
|
| | ### Recommendations |
| |
|
| | For optimal results: |
| |
|
| | 1. **Query Formulation:** |
| | - Use natural language questions |
| | - Include relevant technical terms |
| | - Keep queries under 512 tokens |
| |
|
| | 2. **Hybrid Search:** |
| | - Combine with keyword search (BM25) for best results |
| | - Use semantic search for understanding, keyword for precision |
| |
|
| | 3. **Batch Processing:** |
| | - Use `encode(..., batch_size=32)` for large collections |
| | - Enable `convert_to_tensor=True` for GPU acceleration |
| |
|
| | 4. **Reranking:** |
| | - Consider using a cross-encoder for final reranking |
| | - Retrieve top-100 with this model, rerank to top-10 |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Model Information |
| |
|
| | - **Parameters:** ~110M |
| | - **Architecture:** BERT-base |
| | - **Pooling:** CLS token |
| | - **Normalization:** L2 |
| | - **Similarity Function:** Cosine similarity |
| |
|
| | ### Performance Benchmarks |
| |
|
| | | Hardware | Batch Size | Throughput | |
| | |----------|-----------|------------| |
| | | RTX 3090 | 32 | ~850 docs/sec | |
| | | A100 | 128 | ~2,100 docs/sec | |
| | | CPU (16 cores) | 8 | ~180 docs/sec | |
| |
|
| | ### Resource Requirements |
| |
|
| | **Minimum:** |
| | - GPU: 4GB VRAM (batch size 16) |
| | - CPU: 4 cores, 8GB RAM |
| |
|
| | **Recommended:** |
| | - GPU: 8GB+ VRAM (batch size 32+) |
| | - CPU: 8+ cores, 16GB+ RAM |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{vmware-embeddings-2024, |
| | author = {Alberto Ferrer}, |
| | title = {VMware Technical Documentation Embeddings}, |
| | year = {2024}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/BarraHome/vmware-embeddings-large-v1}} |
| | } |
| | ``` |
| |
|
| | ### Base Model Citation |
| |
|
| | ```bibtex |
| | @misc{bge-base-en-v1.5, |
| | author = {BAAI}, |
| | title = {BGE Base English v1.5}, |
| | year = {2023}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/BAAI/bge-base-en-v1.5}} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgments |
| |
|
| | - **Base Model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) by Beijing Academy of Artificial Intelligence |
| | - **Framework:** [sentence-transformers](https://www.sbert.net/) by UKPLab |
| |
|
| | ## License |
| |
|
| | MIT License |
| |
|
| | Copyright (c) 2024 [Your Name] |
| |
|
| | Permission is hereby granted, free of charge, to any person obtaining a copy |
| | of this software and associated documentation files (the "Software"), to deal |
| | in the Software without restriction, including without limitation the rights |
| | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
| | copies of the Software, and to permit persons to whom the Software is |
| | furnished to do so, subject to the following conditions: |
| |
|
| | The above copyright notice and this permission notice shall be included in all |
| | copies or substantial portions of the Software. |
| |
|
| | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
| | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
| | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
| | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
| | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
| | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
| | SOFTWARE. |
| |
|
| | --- |
| |
|
| | **Note:** This model is intended for research and development. For production use, ensure compliance with your organization's policies and applicable regulations. |
| |
|