Spaces:

dbhavery
/

vaultwise-knowledge

Sleeping

App Files Files Community

vaultwise-knowledge / README.md

dbhavery

Upload folder using huggingface_hub

bc51393 verified 3 days ago

preview code

raw

history blame contribute delete

3.85 kB

	---
	title: Vaultwise Knowledge
	emoji: "\U0001F4DA"
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 5.29.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# Vaultwise -- Knowledge Management Platform

	Interactive demo for [Vaultwise](https://github.com/dbhavery/vaultwise), a knowledge management platform with document ingestion, vector search, AI-powered Q&A, training generation, and analytics.

	Vaultwise is a full-stack application (FastAPI + React) designed for teams that need to organize, search, and learn from their internal knowledge base. This demo showcases the core search and analytics capabilities using a built-in 30-article corpus for a fictional SaaS company.

	## Demo Tabs

	\| Tab \| What It Does \|
	\|-----\|--------------\|
	\| Knowledge Search \| TF-IDF vector search over 30 knowledge base articles. Enter a query, get ranked results with relevance scores and highlighted matching terms. \|
	\| AI Q&A \| Natural language question answering grounded in the knowledge base. Finds the best-matching article via TF-IDF, then generates an answer with source citation and relevant excerpt. \|
	\| Training Generator \| Select any article to auto-generate a training module: learning objectives, structured content outline, and a 5-question multiple-choice quiz. \|
	\| Knowledge Gap Analytics \| Dashboard with article distribution by category, freshness scores, view counts, and search query frequency analysis. \|

	## Search Algorithm

	The TF-IDF search engine is implemented from scratch using only Python and numpy -- no sklearn, no external NLP libraries.

	### How It Works

	1. Tokenization

	Input text is lowercased, punctuation-stripped, and split into tokens. A stop word list filters out common English words that carry no semantic weight.

	2. Term Frequency (TF)

	Uses augmented term frequency to prevent bias toward longer documents:

	```
	TF(t, d) = 0.5 + 0.5 * (count(t, d) / max_count(d))
	```

	3. Inverse Document Frequency (IDF)

	Measures how rare a term is across the corpus. Terms appearing in fewer documents receive higher weight:

	```
	IDF(t) = log(N / (1 + df(t)))
	```

	Where N is the total number of documents and df(t) is the number of documents containing term t. The +1 smoothing prevents division by zero.

	4. TF-IDF Weight

	The final weight for each term in each document:

	```
	W(t, d) = TF(t, d) * IDF(t)
	```

	5. Cosine Similarity

	Queries are converted to TF-IDF vectors using the same vocabulary and IDF values. Ranking uses cosine similarity between the query vector and each document vector:

	```
	similarity(q, d) = (q . d) / (\|\|q\|\| * \|\|d\|\|)
	```

	This measures the angle between vectors, making it independent of document length.

	### Architecture (Full Platform)

	```
	Frontend (React + Vite)
	\|
	v
	API Gateway (FastAPI)
	\|
	+-- Document Ingestion Pipeline
	\| PDF, HTML, Markdown parsing
	\| Chunking and metadata extraction
	\|
	+-- Search Engine
	\| TF-IDF vectorization
	\| Cosine similarity ranking
	\| Query expansion and filtering
	\|
	+-- AI Q&A Module
	\| Context retrieval via search
	\| LLM-powered answer generation
	\| Source citation and grounding
	\|
	+-- Training Generator
	\| Article analysis
	\| Outline and quiz generation
	\| Learning objective extraction
	\|
	+-- Analytics Engine
	Usage tracking
	Freshness scoring
	Gap identification
	```

	## Running Locally

	```bash
	pip install gradio numpy matplotlib
	python app.py
	```

	## Links

	- Source code: [github.com/dbhavery/vaultwise](https://github.com/dbhavery/vaultwise)
	- Author: [Don Havery](https://github.com/dbhavery)