Spaces:
Sleeping
Sleeping
| title: Vaultwise Knowledge | |
| emoji: "\U0001F4DA" | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.29.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # Vaultwise -- Knowledge Management Platform | |
| **Interactive demo for [Vaultwise](https://github.com/dbhavery/vaultwise), a knowledge management platform with document ingestion, vector search, AI-powered Q&A, training generation, and analytics.** | |
| Vaultwise is a full-stack application (FastAPI + React) designed for teams that need to organize, search, and learn from their internal knowledge base. This demo showcases the core search and analytics capabilities using a built-in 30-article corpus for a fictional SaaS company. | |
| ## Demo Tabs | |
| | Tab | What It Does | | |
| |-----|--------------| | |
| | **Knowledge Search** | TF-IDF vector search over 30 knowledge base articles. Enter a query, get ranked results with relevance scores and highlighted matching terms. | | |
| | **AI Q&A** | Natural language question answering grounded in the knowledge base. Finds the best-matching article via TF-IDF, then generates an answer with source citation and relevant excerpt. | | |
| | **Training Generator** | Select any article to auto-generate a training module: learning objectives, structured content outline, and a 5-question multiple-choice quiz. | | |
| | **Knowledge Gap Analytics** | Dashboard with article distribution by category, freshness scores, view counts, and search query frequency analysis. | | |
| ## Search Algorithm | |
| The TF-IDF search engine is implemented from scratch using only Python and numpy -- no sklearn, no external NLP libraries. | |
| ### How It Works | |
| **1. Tokenization** | |
| Input text is lowercased, punctuation-stripped, and split into tokens. A stop word list filters out common English words that carry no semantic weight. | |
| **2. Term Frequency (TF)** | |
| Uses augmented term frequency to prevent bias toward longer documents: | |
| ``` | |
| TF(t, d) = 0.5 + 0.5 * (count(t, d) / max_count(d)) | |
| ``` | |
| **3. Inverse Document Frequency (IDF)** | |
| Measures how rare a term is across the corpus. Terms appearing in fewer documents receive higher weight: | |
| ``` | |
| IDF(t) = log(N / (1 + df(t))) | |
| ``` | |
| Where N is the total number of documents and df(t) is the number of documents containing term t. The +1 smoothing prevents division by zero. | |
| **4. TF-IDF Weight** | |
| The final weight for each term in each document: | |
| ``` | |
| W(t, d) = TF(t, d) * IDF(t) | |
| ``` | |
| **5. Cosine Similarity** | |
| Queries are converted to TF-IDF vectors using the same vocabulary and IDF values. Ranking uses cosine similarity between the query vector and each document vector: | |
| ``` | |
| similarity(q, d) = (q . d) / (||q|| * ||d||) | |
| ``` | |
| This measures the angle between vectors, making it independent of document length. | |
| ### Architecture (Full Platform) | |
| ``` | |
| Frontend (React + Vite) | |
| | | |
| v | |
| API Gateway (FastAPI) | |
| | | |
| +-- Document Ingestion Pipeline | |
| | PDF, HTML, Markdown parsing | |
| | Chunking and metadata extraction | |
| | | |
| +-- Search Engine | |
| | TF-IDF vectorization | |
| | Cosine similarity ranking | |
| | Query expansion and filtering | |
| | | |
| +-- AI Q&A Module | |
| | Context retrieval via search | |
| | LLM-powered answer generation | |
| | Source citation and grounding | |
| | | |
| +-- Training Generator | |
| | Article analysis | |
| | Outline and quiz generation | |
| | Learning objective extraction | |
| | | |
| +-- Analytics Engine | |
| Usage tracking | |
| Freshness scoring | |
| Gap identification | |
| ``` | |
| ## Running Locally | |
| ```bash | |
| pip install gradio numpy matplotlib | |
| python app.py | |
| ``` | |
| ## Links | |
| - **Source code:** [github.com/dbhavery/vaultwise](https://github.com/dbhavery/vaultwise) | |
| - **Author:** [Don Havery](https://github.com/dbhavery) | |