vaultwise-knowledge / README.md
dbhavery's picture
Upload folder using huggingface_hub
bc51393 verified
---
title: Vaultwise Knowledge
emoji: "\U0001F4DA"
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: mit
---
# Vaultwise -- Knowledge Management Platform
**Interactive demo for [Vaultwise](https://github.com/dbhavery/vaultwise), a knowledge management platform with document ingestion, vector search, AI-powered Q&A, training generation, and analytics.**
Vaultwise is a full-stack application (FastAPI + React) designed for teams that need to organize, search, and learn from their internal knowledge base. This demo showcases the core search and analytics capabilities using a built-in 30-article corpus for a fictional SaaS company.
## Demo Tabs
| Tab | What It Does |
|-----|--------------|
| **Knowledge Search** | TF-IDF vector search over 30 knowledge base articles. Enter a query, get ranked results with relevance scores and highlighted matching terms. |
| **AI Q&A** | Natural language question answering grounded in the knowledge base. Finds the best-matching article via TF-IDF, then generates an answer with source citation and relevant excerpt. |
| **Training Generator** | Select any article to auto-generate a training module: learning objectives, structured content outline, and a 5-question multiple-choice quiz. |
| **Knowledge Gap Analytics** | Dashboard with article distribution by category, freshness scores, view counts, and search query frequency analysis. |
## Search Algorithm
The TF-IDF search engine is implemented from scratch using only Python and numpy -- no sklearn, no external NLP libraries.
### How It Works
**1. Tokenization**
Input text is lowercased, punctuation-stripped, and split into tokens. A stop word list filters out common English words that carry no semantic weight.
**2. Term Frequency (TF)**
Uses augmented term frequency to prevent bias toward longer documents:
```
TF(t, d) = 0.5 + 0.5 * (count(t, d) / max_count(d))
```
**3. Inverse Document Frequency (IDF)**
Measures how rare a term is across the corpus. Terms appearing in fewer documents receive higher weight:
```
IDF(t) = log(N / (1 + df(t)))
```
Where N is the total number of documents and df(t) is the number of documents containing term t. The +1 smoothing prevents division by zero.
**4. TF-IDF Weight**
The final weight for each term in each document:
```
W(t, d) = TF(t, d) * IDF(t)
```
**5. Cosine Similarity**
Queries are converted to TF-IDF vectors using the same vocabulary and IDF values. Ranking uses cosine similarity between the query vector and each document vector:
```
similarity(q, d) = (q . d) / (||q|| * ||d||)
```
This measures the angle between vectors, making it independent of document length.
### Architecture (Full Platform)
```
Frontend (React + Vite)
|
v
API Gateway (FastAPI)
|
+-- Document Ingestion Pipeline
| PDF, HTML, Markdown parsing
| Chunking and metadata extraction
|
+-- Search Engine
| TF-IDF vectorization
| Cosine similarity ranking
| Query expansion and filtering
|
+-- AI Q&A Module
| Context retrieval via search
| LLM-powered answer generation
| Source citation and grounding
|
+-- Training Generator
| Article analysis
| Outline and quiz generation
| Learning objective extraction
|
+-- Analytics Engine
Usage tracking
Freshness scoring
Gap identification
```
## Running Locally
```bash
pip install gradio numpy matplotlib
python app.py
```
## Links
- **Source code:** [github.com/dbhavery/vaultwise](https://github.com/dbhavery/vaultwise)
- **Author:** [Don Havery](https://github.com/dbhavery)