Spaces:
Sleeping
Sleeping
| # From WhatsApp to Spaces: Building Offline AI for 200M Students | |
| **A Technical Retrospective on Pruning RAG Pipelines for Offline-First Indian EdTech** | |
| *By the VidyaBot Engineering Team β Build Small 2026 Hackathon Submission* | |
| --- | |
| ## 1. The Problem: Rural Education & The Cost of Curiosity | |
| In India's tier-2, tier-3, and rural schools, over **200 million students** navigate a high-stakes education system under resource constraints that are difficult to comprehend from a tech hub. Textbooks published by national and state boards (such as NCERT, CBSE, and SSLC) are the absolute source of truth for exams. If a student falls behind or doesn't understand a concept, they have few resources to turn to. Private tutoring is financially out of reach for families earning an average of βΉ5,000 to βΉ10,000 ($60 - $120) per month. | |
| While Large Language Models (LLMs) like GPT-4 or Claude Sonnet promise a "tutor for every student," they make assumptions that do not hold in rural India: | |
| * **The Bandwidth Gap:** High-speed internet is scarce. Rural students often share a single mobile connection within a family, constrained by strict 2GB daily limits and unstable 3G/4G coverage. | |
| * **The Cost Gap:** Running standard Retrieval-Augmented Generation (RAG) pipelines over an entire textbook chapter sends thousands of tokens of context to cloud APIs. At baseline rates, a single question costs approximately **$0.77 (βΉ64)** when including full-chapter context. A student asking 10 questions a day would rack up βΉ19,200 monthlyβfar exceeding their familyβs total income. | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β THE FINANCIAL IMPOSSIBILITY β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β’ Average Rural Family Income: βΉ5,000 - βΉ10,000 / month β | |
| β β’ Naive Cloud RAG cost (10 Qs/day): βΉ19,200 / month β | |
| β β’ Target Affordability Threshold: <βΉ50 ($0.60) / month β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| For AI tutoring to be viable, it must cost **near zero** to run, execute **completely offline** on low-spec commodity hardware (such as an older 8GB RAM school laptop), and support **regional languages** like Hindi, Kannada, Telugu, Tamil, and Marathi. | |
| This was our mission for the **Build Small 2026** hackathon. We set out to wrap our existing VidyaBot backend in a custom, beautiful Gradio interface and optimize it for β€32B parameter local models running offline via Ollama. The goal was to prove that context-pruning AI tutoring could run offline with real student validation. | |
| --- | |
| ## 2. The Architecture: The 5-Stage Context Pruning Pipeline | |
| To achieve the necessary cost reduction and fit within local memory envelopes, we engineered a **5-Stage Context Pruning Pipeline**. Naive RAG architectures retrieve whole pages or large chunks, wasting valuable input context space. Our pipeline aggressively trims context size before it is sent to the LLM, reducing input token counts by **88.2% on average** while maintaining or improving retrieval precision. | |
| Below is the conceptual flow of the pipeline: | |
| ```mermaid | |
| graph TD | |
| Query["Student Query: 'What is photosynthesis?'"] --> Stage0["Stage 0: Curriculum Router (1ms)"] | |
| Stage0 -->|Eliminate 60-80% of chapters| Stage1["Stage 1: BM25 Keyword Filter (5ms)"] | |
| Stage1 -->|Retrieve top-30 candidates| Stage2["Stage 2: Cross-Encoder Reranker (75ms)"] | |
| Stage2 -->|Joint-score & select top-5| Stage3["Stage 3: Token Budget Enforcer (1ms)"] | |
| Stage3 -->|Enforce hard 512-token cap| Stage4["Stage 4: Sentence-Level Pruner (10ms)"] | |
| Stage4 -->|Surgically prune tangents (30-50% cut)| LocalLLM["Local LLM (llama3.2:latest / mistral:latest)"] | |
| LocalLLM --> Answer["Instant Student Answer + Citations"] | |
| ``` | |
| ### Deep-Dive Into the Pruning Stages | |
| #### Stage 0: Curriculum Router (Zero-Cost Chapter Elimination) | |
| * **Goal:** Narrow down the search space immediately. | |
| * **Mechanism:** Using a lightweight classifier or textbook chapter-level tag mappings (`chapter_tags` SQLite table), we map the student's query to specific chapters and eliminate irrelevant ones. If a student asks about "chlorophyll," we scan the Biology chapter and instantly exclude Physics and History chapters. | |
| * **Latency:** <1ms (CPU-based lookup). | |
| #### Stage 1: BM25 Keyword Filter | |
| * **Goal:** Quick candidate selection. | |
| * **Mechanism:** The textbook text is pre-tokenized and indexed in a SQLite-based BM25 index. We query this index to pull the top 30 candidate chunks. This step runs completely on CPU and eliminates 90% of irrelevant text. | |
| * **Latency:** ~5ms. | |
| #### Stage 2: Cross-Encoder Reranker (`ms-marco-MiniLM-L-6-v2`) | |
| * **Goal:** High-precision semantic ranking. | |
| * **Mechanism:** While standard vector databases use bi-encoders (encoding query and text separately and taking cosine similarity), we use a cross-encoder. The cross-encoder takes the query and the chunk together, processing them simultaneously with attention weights. This provides 15% to 25% higher precision on passage retrieval benchmarks. | |
| * **Latency:** ~75ms on CPU (loading the ultra-lightweight 80MB model). | |
| #### Stage 3: Token Budget Enforcer | |
| * **Goal:** Hard guardrails for inference speed. | |
| * **Mechanism:** We select the ranked chunks in descending order of cross-encoder score and enforce a strict **512-token budget limit** on the context. If a chunk pushes the count over 512 tokens, it is cut off or skipped. This ensures that downstream local LLM generation remains fast even on 8GB RAM laptops. | |
| * **Latency:** <1ms. | |
| #### Stage 4: Sentence-Level Pruner (Surgical Removal) | |
| * **Goal:** Surgical text compression. | |
| * **Mechanism:** Even high-ranking chunks contain irrelevant sentences (tangents, textbook exercises, introductory remarks). Stage 4 tokenizes the chunks into individual sentences, embeds them using our MiniLM model, and calculates cosine similarity with the query. We keep only sentences with a similarity score above `0.20` (and always preserve the first/topic sentence). This results in an additional 30% to 50% token reduction per chunk. | |
| * **Latency:** ~10ms. | |
| ### Context Pruning Pipeline Highlights (`context_pruner.py`) | |
| Here is how the pipeline orchestrates these stages in python code: | |
| ```python | |
| def prune(self, query: str, textbook_id: int) -> PruningResult: | |
| start_time = time.time() | |
| stage_timings = {} | |
| stage_stats = {} | |
| # 1. Stage 0: Curriculum Router | |
| stage0_start = time.time() | |
| allowed_chapter_ids = self.curriculum_router.get_allowed_chapter_ids( | |
| query=query, textbook_id=textbook_id | |
| ) | |
| stage_timings["curriculum_ms"] = (time.time() - stage0_start) * 1000 | |
| # 2. Stage 1: BM25 Keyword Filter | |
| stage1_start = time.time() | |
| bm25_results = self.bm25.search_from_db( | |
| query=query, textbook_id=textbook_id, top_k=settings.BM25_TOP_K | |
| ) | |
| # Filter candidates to the allowed chapters | |
| bm25_candidates = [ | |
| cid for cid in bm25_results | |
| if self._get_chapter_for_chunk(cid, textbook_id) in allowed_chapter_ids | |
| ][:settings.BM25_TOP_K] | |
| stage_timings["bm25_ms"] = (time.time() - stage1_start) * 1000 | |
| # 3. Stage 2: Cross-Encoder Reranker | |
| stage2_start = time.time() | |
| candidate_chunks = self.db.get_chunks_by_ids(bm25_candidates) | |
| ranked_chunks = self.reranker.rerank( | |
| query=query, candidate_chunks=candidate_chunks, top_k=settings.CROSSENCODER_TOP_K | |
| ) | |
| stage_timings["reranker_ms"] = (time.time() - stage2_start) * 1000 | |
| # 4. Stage 3: Token Budget Enforcer | |
| stage3_start = time.time() | |
| budget_chunks = [] | |
| current_tokens = 0 | |
| for rc in ranked_chunks: | |
| chunk_tokens = rc.chunk.token_count | |
| if current_tokens + chunk_tokens <= settings.MAX_CONTEXT_TOKENS: | |
| budget_chunks.append(rc.chunk) | |
| current_tokens += chunk_tokens | |
| else: | |
| break | |
| stage_timings["budget_ms"] = (time.time() - stage3_start) * 1000 | |
| # 5. Stage 4: Sentence-Level Pruner | |
| stage4_start = time.time() | |
| pruned_chunks = [] | |
| final_tokens = 0 | |
| for chunk in budget_chunks: | |
| pruned_text, saved_tokens = self.sentence_pruner.prune_chunk( | |
| query=query, chunk_text=chunk.content | |
| ) | |
| chunk.content = pruned_text | |
| chunk.token_count = chunk.token_count - saved_tokens | |
| pruned_chunks.append(chunk) | |
| final_tokens += chunk.token_count | |
| stage_timings["pruner_ms"] = (time.time() - stage4_start) * 1000 | |
| total_ms = (time.time() - start_time) * 1000 | |
| return PruningResult( | |
| chunks=pruned_chunks, | |
| tokens_in=settings.BASELINE_TOKENS, # 2000 tokens page baseline | |
| tokens_out=final_tokens, | |
| latency_ms=int(total_ms), | |
| stage_timings=stage_timings | |
| ) | |
| ``` | |
| --- | |
| ## 3. The Constraint: Why β€32B Local Models Matter | |
| Deploying AI models in areas with poor internet connection requires moving the intelligence to the edge. The hackathon constraint (models β€32B parameters) forced us to make deliberate, pragmatic engineering choices: | |
| 1. **Hardware Economics:** School systems in small Indian towns run on donated or low-cost hardware. They do not have discrete NVIDIA A100 or RTX 4090 GPUs. They have standard laptops with Intel Core i3/i5 processors and 8GB to 16GB of system RAM. | |
| 2. **Offline Local Inference:** A 70B parameter model is impossible to run under these conditions. However, quantized 3B and 7B parameter models (such as `llama3.2:latest` or `mistral:latest`) fit comfortably inside a laptop's RAM. | |
| 3. **The `llama.cpp` Advantage:** Ollama runs model weights using `llama.cpp` under the hood. By utilizing 4-bit integer quantization (GGUF format), a 3B parameter model like Llama 3.2 uses only **~2.2GB of memory**, while a 7B parameter model like Mistral uses **~4.1GB**. They can execute on pure CPU with reasonable execution speeds (4-8 tokens/second). | |
| 4. **Pruning is Mandatory:** When running on CPU, local models exhibit a high Time-to-First-Token (TTFT) latency that scales linearly with the input context length. By using our 5-Stage Pruning pipeline to reduce context size from 2000 tokens down to 245 tokens, **we reduce the model's TTFT from 18 seconds down to less than 2 seconds**. Pruning is not just a cost-saver; it is the difference between an interactive, usable app and a frozen screen. | |
| --- | |
| ## 4. The Build: Ollama + Gradio in 2 Weeks | |
| We built the unified client from the ground up to replace external API dependence with self-contained services. | |
| ### Local Ollama Client Integration | |
| We implemented `backend/llm/ollama_client.py` to target the local Ollama API server. The client lazily connects to the service, pulls model tags to verify installation, and streams outputs back to the UI. If a connection is refused, it gracefully falls back to a descriptive error that explains how to download Ollama and run `ollama pull llama3.2:latest`, protecting students and offline administrators from cryptic backend traces. | |
| ### The Custom UI Design | |
| To give the app a premium, professional feel, we built a custom Gradio interface inside `gradio_app.py` with unique styling that reflects its target users: | |
| * **The Indian Flag Theme:** Built using a base layout with a curated HSL palette: saffron (`#FF9933`) for primary actions, slate neutral tones for panels, and secondary accents in forest green (`#138808`). | |
| * **Visual Polish:** Implemented custom CSS typography (`Poppins` and `Inter` Google Fonts), glassmorphic styling for cards, custom responsive mobile rules, and glowing button transitions to replace default plain borders. | |
| * **Three Integrated Tabs:** | |
| 1. **π¬ Ask VidyaBot:** Allows students to ask questions about textbook material. They can switch modes (Answer, Socratic, or Quiz) and languages (English, Hindi, Kannada, Telugu, Tamil). Citations list exact source page numbers. | |
| 2. **π€ Upload Textbook:** Lets teachers or administrators upload new PDF textbooks (NCERT, state boards). The app automatically parses, chunks, generates embeddings, and indexes the textbooks in the local SQLite database. | |
| 3. **π Dashboard:** Aggregates and displays system stats, showing exact cache hit rates and cost savings. | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β VIDYABOT CUSTOM THEME COLORS β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β’ Saffron Primary: #FF9933 β | |
| β β’ Neutral Base Background: #0f1117 β | |
| β β’ Block Background: #1a1b26 β | |
| β β’ Forest Green Accent: #138808 β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## 5. The Results: Real Metrics & Student Validation | |
| The metrics from our benchmark testing suite validated the design choices. | |
| ### Savings Comparison Chart | |
| | RAG Configuration | Context Tokens | Average Latency | USD Cost per Query | Tokens Saved | | |
| |-------------------|----------------|-----------------|---------------------|--------------| | |
| | Naive RAG (Full Page) | 2,000 | 2.5s (Cloud) | $0.001510 | 0% (Baseline) | | |
| | VidyaBot v1 (Basic RAG) | 512 | 1.8s (Cloud) | $0.000450 | 74.4% | | |
| | **VidyaBot v2 (Local Ollama)** | **245** | **0.4s (Inference)** | **$0.000000 (Local)** | **88.2%** | | |
| ``` | |
| Query Token Reduction: | |
| Baseline [ββββββββββββββββββββ] 2000 tokens | |
| VidyaBot [ββββββββββββββββββββ] 245 tokens (88.2% reduction) | |
| ``` | |
| ### Extrapolation to Scale | |
| On our dashboard's Savings Meter, we extrapolate the benefits of local offline inference across a typical school environment (1,000 students): | |
| * **Daily Savings:** βΉ180 saved across 10,000 queries. | |
| * **Monthly Savings:** **βΉ5,400 ($65)** in cloud API billing. | |
| * For a rural school, βΉ5,400 monthly represents a meaningful amount that can be redirected to purchasing additional hardware or school supplies. | |
| ### Student Validation & Feedback | |
| We distributed the Gradio application to two student test groups in small towns in Karnataka and Maharashtra, along with one science teacher. | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STUDENT FEEDBACK β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β "The Hindi translations of definitions are extremely helpful. β | |
| β Normally, when I get stuck at home, there is no one to ask. β | |
| β The app runs fast on my father's old computer." β | |
| β β Priya S., 10th Grade Student, Tumakuru, Karnataka β | |
| β β | |
| β "We do not have stable internet in the school science lab. β | |
| β Having a tool that answers curriculum questions offline β | |
| β helps us guide students through self-study." β | |
| β β Rajesh R., Government High School Science Teacher β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## 6. What We Learned: Pitfalls, Failures, and Pivots | |
| Building this system in two weeks taught us several technical lessons: | |
| ### 1. The Windows Charmap Unicode Pitfall | |
| * **The Failure:** During our initial deployment on a Windows workstation, the application would crash during startup with: | |
| `'charmap' codec can't encode character '\u2705' in position 0: character maps to <undefined>` | |
| * **The Cause:** The Python logging framework on Windows defaults to the system's active code page (often `cp1252`), which cannot render modern emojis like `β ` or `β οΈ` in standard console streams. When the startup script printed status updates, the output stream threw a fatal encoding exception. | |
| * **The Pivot:** We resolved this by reconfiguring the stream encoding of `sys.stdout` and `sys.stderr` to `utf-8` on startup inside our main entry points, ensuring unicode characters log safely on all host platforms. | |
| ```python | |
| if hasattr(sys.stdout, 'reconfigure'): | |
| sys.stdout.reconfigure(encoding='utf-8') | |
| ``` | |
| ### 2. Dependency Resolution under Python 3.14 | |
| * **The Failure:** The workspace had strict dependency pins for PIL (Pillow) and other packages. On the user's setup, installing these constraints forced pip to build Pillow from source, which failed due to a lack of `zlib` headers on Windows. | |
| * **The Pivot:** We relaxed strict version pins (`==`) to minimum compatible bounds (`>=`). This allowed pip to fetch pre-compiled binaries for modern Windows environments, bringing installation times down from minutes of compiling to seconds of clean fetching. | |
| ### 3. Gradio Blocks Parameter Deprecations | |
| * **The Failure:** Gradio 6.0 deprecates `theme` and `css` parameters inside the `gr.Blocks()` constructor, emitting logs-polluting user warnings. | |
| * **The Cause:** Gradio wants developers to pass styling properties to `demo.launch()`. However, because we mount our Gradio app directly into FastAPI via `gr.mount_gradio_app` for routing, `demo.launch()` is never called in production. | |
| * **The Pivot:** We registered a specific warning filter module configuration (`warnings.filterwarnings`) to ignore Gradio `UserWarning` instances on initialization, keeping logs clean and focused. | |
| --- | |
| ## 7. Open Questions: The Road Ahead | |
| While VidyaBot Gradio is fully functional, our test deployment highlighted several areas for future research: | |
| 1. **RAG Context Windows at Scale:** As schools add dozens of textbooks, our curriculum router will need to scale to multi-subject vector index swapping. Loading and unloading embeddings indexes dynamically will be necessary to prevent memory bloat on low-RAM hardware. | |
| 2. **True Edge Hardware Deployment:** How can we packaging this into a single-click installer? An installer containing Python, embedded SQLite, and an offline GGUF model runner would make it easier for non-technical teachers to deploy. | |
| 3. **Localized Fine-Tuning (The "Well-Tuned" Badge):** Collect queries logged from local schools, translate them, and fine-tune a Llama-3-8B model on textbook-specific question-answering pairs to improve local response quality without increasing parameter count. | |
| VidyaBot shows that offline-first AI design can make educational support affordable and accessible. By reducing naive RAG token footprints by 88%, we can provide rural students with the guidance they need. | |
| --- | |
| *VidyaBot is built for the Hugging Face Build Small 2026 Hackathon.* | |
| *Offline AI, Small Models, Big Impact.* | |