Building a RAG Knowledge Hub on Hugging Face Spaces with NVIDIA Nemotron & Gradio

Community Article
Published June 14, 2026

km

We all have too many tabs open. Between insightful Medium articles hidden behind paywalls, dense arXiv papers, and a folder full of PDFs, keeping track of knowledge is a mess. What if you could throw all of these sources into a single application and just... chat with them?

Enter BuildSmall KnowledgeHubβ€”an open-source AI knowledge management tool built for Hugging Face Spaces. It acts as a modular, local-first (where it counts) Retrieval-Augmented Generation (RAG) pipeline powered by Gradio, Qdrant, and NVIDIA's Nemotron models.

Here is a look at how we built it, the tech stack, and how you can deploy your own instance.

🌟 What does it do?

BuildSmall KnowledgeHub is designed to ingest multi-modal, real-world data sources seamlessly:

  • Medium Articles: By leveraging the Freedium mirror, the app bypasses paywalls to extract readable text, along with image references and captions.
  • arXiv Papers: Just drop in an arXiv link or ID, and the app automatically downloads and parses the PDF.
  • Local PDFs: Standard document uploads for your personal files.

Once ingested, the app chunks the content, embeds it locally, stores it in a Qdrant vector database, and uses an LLM to generate highly accurate, grounded answers to your queries.

πŸ› οΈ The Tech Stack: Powered by NVIDIA & ZeroGPU

To make this app fast and accurate, we split the workload between local models running on Hugging Face's ZeroGPU infrastructure and cloud APIs.

1. Embedding Pipeline (Local on ZeroGPU) We use nvidia/llama-nemotron-colembed-vl-3b-v2 for generating embeddings. Because Hugging Face Spaces offers ZeroGPU support, we wrap our Gradio ingestion callbacks with the @spaces.GPU decorator. This dynamically allocates GPU resources exactly when the embedding model needs them, keeping the app efficient and cost-effective.

2. Visual Parsing For handling complex documents, we rely on the lightweight but powerful Qwen/Qwen2-VL-2B-Instruct model to parse visual and text elements.

3. Chat & Generation (NVIDIA API) Once the relevant chunks are retrieved from our Qdrant database, we pass the context to the nvidia/nvidia-nemotron-nano-9b-v2 model via NVIDIA's OpenAI-compatible API (integrate.api.nvidia.com). This ensures high-speed generation without needing massive VRAM on the Space itself.

πŸ—οΈ Overcoming the Medium Extraction Hurdle

One of the coolest features of this hub is the Medium extraction. Writing a scraper for Medium is notoriously difficult due to paywalls and dynamic content.

Instead of reinventing the wheel, we integrated Freedium (freedium-mirror.cfd). When a user inputs a Medium URL, the app translates it into a Freedium mirror link, scrapes the clean HTML, and intelligently extracts not just the text, but the alt tags and image URLs. This means the LLM actually knows what images were in the article, preserving crucial context that standard text scrapers lose.

πŸš€ Deploying Your Own Knowledge Hub

Deploying this to your own Hugging Face Space is incredibly straightforward.

1. Create a Space Create a new Gradio Space (select ZeroGPU if you have access, or standard CPU/GPU).

2. Add Your Secrets In your Space settings, add the following under Variables and secrets:

  • NVIDIA_API_KEY: Your key from the NVIDIA developer portal.
  • QDRANT_URL: Your hosted Qdrant cluster URL.
  • QDRANT_API_KEY: Your Qdrant API key.

(For ZeroGPU spaces, ensure ENABLE_ZEROGPU=true and EMBEDDING_DEVICE=cuda are set).

3. Push the Code

git remote add space https://huggingface.co/spaces/build-small-hackathon/KnowledgeMesh
git push space main

That’s it! You now have a personal, multi-modal research assistant.

πŸ”— Try it out

Built for the BuildSmall Hackathon / Backyard AI Track.

More from this author

Community

This is a really thoughtful and useful project. Loved the idea and how it makes managing knowledge feel much simpler and more accessible!

Sign up or log in to comment