Spaces:
Build error
Build error
| license: mit | |
| sdk: gradio | |
| emoji: π | |
| colorFrom: gray | |
| sdk_version: 5.34.0 | |
| # π₯ Hybrid Search RAGtim Bot | |
| A sophisticated hybrid search system combining semantic vector search with BM25 keyword matching for optimal information retrieval. | |
| ## π Features | |
| - **Hybrid Search**: Combines transformer-based semantic similarity with BM25 keyword ranking | |
| - **Multi-Modal Search**: Vector search, BM25 search, and intelligent fusion | |
| - **Real-time API**: RESTful endpoints for integration | |
| - **Interactive UI**: Three interfaces - Chat, Advanced Search, and Statistics | |
| - **Knowledge Base**: Comprehensive markdown-based knowledge system | |
| ## π§ Technology Stack | |
| - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384-dim) | |
| - **Search**: Custom BM25 implementation + Vector similarity | |
| - **Framework**: Gradio 4.44.0 | |
| - **ML**: Transformers, PyTorch, NumPy | |
| - **Deployment**: Hugging Face Spaces | |
| ## π Knowledge Base Structure | |
| The system processes markdown files from the `knowledge_base/` directory: | |
| - `about.md` - Personal information and professional summary | |
| - `research_details.md` - Research projects and methodologies | |
| - `publications_detailed.md` - Publications with technical details | |
| - `skills_expertise.md` - Technical skills and expertise | |
| - `experience_detailed.md` - Professional experience | |
| - `statistics.md` - Statistical methods and biostatistics | |
| ## π Search Methods | |
| ### Hybrid Search (Recommended) | |
| Combines semantic and keyword search with configurable weights: | |
| - Default: 60% vector + 40% BM25 | |
| - Optimal for most queries | |
| - Balances meaning and exact term matching | |
| ### Vector Search | |
| Pure semantic similarity using transformer embeddings: | |
| - Best for conceptual questions | |
| - Finds semantically related content | |
| - Language-agnostic similarity | |
| ### BM25 Search | |
| Traditional keyword-based ranking: | |
| - Excellent for specific terms | |
| - TF-IDF with document length normalization | |
| - Fast and interpretable | |
| ## π οΈ API Endpoints | |
| ### Search API | |
| GET /api/stats | |
| ## π Configuration | |
| Key parameters in `config.py`: | |
| - `BM25_K1 = 1.5` - Term frequency saturation | |
| - `BM25_B = 0.75` - Document length normalization | |
| - `DEFAULT_VECTOR_WEIGHT = 0.6` - Hybrid search weighting | |
| - `DEFAULT_BM25_WEIGHT = 0.4` - Hybrid search weighting | |
| ## π Deployment | |
| 1. Clone to Hugging Face Spaces | |
| 2. Ensure all markdown files are in `knowledge_base/` | |
| 3. The system auto-initializes on startup | |
| 4. Access via the provided Space URL | |
| ## π‘ Usage Examples | |
| **Chat Interface:** | |
| - "What is Raktim's LLM research?" | |
| - "Tell me about statistical methods" | |
| - "Describe multimodal AI capabilities" | |
| **Advanced Search:** | |
| - Adjust vector/BM25 weights | |
| - Compare search methods | |
| - Fine-tune result count | |
| **API Integration:** | |
| ```python | |
| import requests | |
| response = requests.get( | |
| "https://your-space.hf.space/api/search", | |
| params={ | |
| "query": "machine learning research", | |
| "top_k": 5, | |
| "search_type": "hybrid" | |
| } | |
| ) | |
| ``` |