vaultwise-knowledge / README.md
dbhavery's picture
Upload folder using huggingface_hub
bc51393 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Vaultwise Knowledge
emoji: 📚
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: mit

Vaultwise -- Knowledge Management Platform

Interactive demo for Vaultwise, a knowledge management platform with document ingestion, vector search, AI-powered Q&A, training generation, and analytics.

Vaultwise is a full-stack application (FastAPI + React) designed for teams that need to organize, search, and learn from their internal knowledge base. This demo showcases the core search and analytics capabilities using a built-in 30-article corpus for a fictional SaaS company.

Demo Tabs

Tab What It Does
Knowledge Search TF-IDF vector search over 30 knowledge base articles. Enter a query, get ranked results with relevance scores and highlighted matching terms.
AI Q&A Natural language question answering grounded in the knowledge base. Finds the best-matching article via TF-IDF, then generates an answer with source citation and relevant excerpt.
Training Generator Select any article to auto-generate a training module: learning objectives, structured content outline, and a 5-question multiple-choice quiz.
Knowledge Gap Analytics Dashboard with article distribution by category, freshness scores, view counts, and search query frequency analysis.

Search Algorithm

The TF-IDF search engine is implemented from scratch using only Python and numpy -- no sklearn, no external NLP libraries.

How It Works

1. Tokenization

Input text is lowercased, punctuation-stripped, and split into tokens. A stop word list filters out common English words that carry no semantic weight.

2. Term Frequency (TF)

Uses augmented term frequency to prevent bias toward longer documents:

TF(t, d) = 0.5 + 0.5 * (count(t, d) / max_count(d))

3. Inverse Document Frequency (IDF)

Measures how rare a term is across the corpus. Terms appearing in fewer documents receive higher weight:

IDF(t) = log(N / (1 + df(t)))

Where N is the total number of documents and df(t) is the number of documents containing term t. The +1 smoothing prevents division by zero.

4. TF-IDF Weight

The final weight for each term in each document:

W(t, d) = TF(t, d) * IDF(t)

5. Cosine Similarity

Queries are converted to TF-IDF vectors using the same vocabulary and IDF values. Ranking uses cosine similarity between the query vector and each document vector:

similarity(q, d) = (q . d) / (||q|| * ||d||)

This measures the angle between vectors, making it independent of document length.

Architecture (Full Platform)

Frontend (React + Vite)
    |
    v
API Gateway (FastAPI)
    |
    +-- Document Ingestion Pipeline
    |       PDF, HTML, Markdown parsing
    |       Chunking and metadata extraction
    |
    +-- Search Engine
    |       TF-IDF vectorization
    |       Cosine similarity ranking
    |       Query expansion and filtering
    |
    +-- AI Q&A Module
    |       Context retrieval via search
    |       LLM-powered answer generation
    |       Source citation and grounding
    |
    +-- Training Generator
    |       Article analysis
    |       Outline and quiz generation
    |       Learning objective extraction
    |
    +-- Analytics Engine
            Usage tracking
            Freshness scoring
            Gap identification

Running Locally

pip install gradio numpy matplotlib
python app.py

Links