FocusFlow Assistant
fix: pre-download embedding model during Docker build to avoid DNS failures on HF Spaces
3c23e28 | # FocusFlow - Technical Documentation | |
| ## π Project Overview | |
| ### Problem Statement | |
| Students face significant challenges in managing self-paced learning: | |
| - **Information Overload**: PDFs, videos, and notes scattered across sources make it difficult to create coherent study plans | |
| - **Lack of Personalization**: Generic study materials don't adapt to individual learning pace or mastery level | |
| - **No Progress Tracking**: Students can't easily measure improvement or identify knowledge gaps | |
| - **Verification Gap**: No way to trace AI-generated answers back to source materials | |
| ### Solution Description | |
| FocusFlow is an **intelligent, local-first study companion** that transforms unstructured learning materials into personalized, adaptive study experiences. It combines RAG (Retrieval-Augmented Generation) with synthetic student profiling to create customized learning paths that evolve based on performance. | |
| **Key Innovation**: Synthetic student profiles enable the app to "remember" progress across sessions and dynamically adjust difficulty, review frequency, and content depth based on demonstrated mastery. | |
| ### Target Users | |
| - **Self-paced learners** preparing for exams or certifications | |
| - **Students** managing multiple subjects with varied materials | |
| - **Knowledge workers** building expertise in new domains | |
| - **Anyone** seeking structured, verifiable learning from diverse sources | |
| --- | |
| ## π― Core Features | |
| ### 1. Multi-Subject Study Roadmap Generation | |
| - **Automated topic extraction** from uploaded PDFs and documents | |
| - **Multi-day planning** with topics distributed across subjects | |
| - **Subject identification** from document content and metadata | |
| - **Round-robin scheduling** ensures balanced coverage across all subjects | |
| - **Progressive unlocking** - topics unlock as previous ones are completed | |
| **Example**: Upload 3 PDFs β Get 5-day plan with 3 topics/day (one from each subject) | |
| ### 2. RAG-Based Q&A System | |
| - **Context-aware retrieval** using ChromaDB vector search | |
| - **Conversational memory** with chat history rewriting for follow-up questions | |
| - **Multi-source search** across all uploaded documents | |
| - **Streaming responses** with source citation | |
| - **Focus Mode** for distraction-free studying with side-by-side lesson/chat | |
| ### 3. Adaptive Quiz Generation | |
| - **Context-based questions** generated from actual course material | |
| - **Realistic distractors** using common misconceptions | |
| - **Guaranteed 3-question format** with intelligent fallbacks | |
| - **Score-based adaptation**: | |
| - Perfect score (3/3) β Future topics marked "Advanced" | |
| - Low score (1-2/3) β Future topics include review materials | |
| - **Automatic unlocking** of next topic upon quiz completion | |
| ### 4. Knowledge Tracking & Mastery System | |
| - **Subject-level mastery tracking** (High/Medium/Low) | |
| - **Historical quiz performance** with timestamps | |
| - **Average score calculation** across all attempts | |
| - **Mastery-based difficulty adjustment** for future content | |
| - **Analytics dashboard** with performance classification | |
| ### 5. Synthetic Student Profiles | |
| - **Persistent JSON storage** in `~/.focusflow/student_profile.json` | |
| - **Comprehensive tracking**: | |
| - Study plan with topic metadata | |
| - Quiz history with scores and timestamps | |
| - Mastery levels per subject | |
| - Time tracking per topic | |
| - Incomplete task queue | |
| - **Atomic writes** with backup for data integrity | |
| - **Thread-safe operations** for concurrent access | |
| ### 6. Data Persistence & Session Resumption | |
| - **Auto-save on key events**: | |
| - Plan generation | |
| - Quiz completion | |
| - Topic transitions | |
| - **Auto-load on startup** restores: | |
| - Active study plan | |
| - Quiz scores and progress | |
| - Mastery levels | |
| - Current position | |
| - **Toast notifications** for save/load feedback | |
| ### 7. Citations & Source Verification | |
| - **Expandable source references** with every AI response | |
| - **File + page number** for each citation | |
| - **Lesson content references** section at bottom | |
| - **Inline citation prompts** to LLM for accurate attribution | |
| - **Numbered citation format** for easy lookup | |
| --- | |
| ## ποΈ Technical Architecture | |
| ### Frontend Components (Streamlit) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Control Center β Workspace β Calendar β | |
| β (Sidebar) β (Main) β (Sidebar) β | |
| ββββββββββββββββββββΌββββββββββββββΌβββββββββββββββββ€ | |
| β - Study Timer β - Chat UI β - Date Picker β | |
| β - Sources List β - Lessons β - Plan View β | |
| β - File Upload β - Analytics β - Today's β | |
| β - Plan Gen β - Quizzes β Topics β | |
| ββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββ | |
| ``` | |
| **Key UI Panels**: | |
| - **Control Center**: Timer, source management, plan generation | |
| - **Intelligent Workspace**: RAG chat, lesson viewer, analytics modal | |
| - **Calendar**: Date-based topic navigation, today's task list | |
| - **Focus Mode**: Immersive 2-column layout (chat | lesson) | |
| **Chat Interface**: | |
| - Message history with role differentiation (user/assistant) | |
| - Source citation expandables | |
| - Streaming/static responses | |
| - Contained scrollable area (600px height) | |
| **Lesson Viewer**: | |
| - Markdown rendering with references section | |
| - Scrollable document container (650px height) | |
| - Inline citations and code examples | |
| ### Backend Logic (FastAPI) | |
| **Planning Engine** (`generate_study_plan`): | |
| 1. Query vector store for topic-related content | |
| 2. Group documents by source (each source = subject) | |
| 3. Extract subject names from content/metadata | |
| 4. Round-robin topic selection across subjects | |
| 5. Generate multi-day schedule with metadata | |
| **Retrieval System** (`query_knowledge_base`): | |
| 1. Context rewriting for multi-turn conversations | |
| 2. Similarity search across vector database | |
| 3. LLM synthesis with retrieved context | |
| 4. Source metadata extraction and return | |
| **Quiz Generator** (`generate_quiz_data`): | |
| 1. Retrieve relevant content chunks for topic | |
| 2. Prompt LLM for context-based questions | |
| 3. Fallback question generation from raw content | |
| 4. Ensure exact 3-question output | |
| 5. Return structured quiz data | |
| **Lesson Generator** (`generate_lesson_content`): | |
| 1. Retrieve 8 context chunks (500 chars each) | |
| 2. Extract source citations from metadata | |
| 3. Prompt for structured lesson (600-800 words) | |
| 4. Append references section with file + page | |
| 5. Return formatted Markdown | |
| ### Data Storage | |
| **Vector Database (ChromaDB)**: | |
| - Local storage at `./chroma_db` | |
| - Nomic-embed-text embeddings | |
| - Metadata: source path, page number | |
| - Persistent across sessions | |
| **Student Profiles (JSON)**: | |
| ```json | |
| { | |
| "student_id": "student_20260105_233537", | |
| "study_plan": { | |
| "plan_id": "plan_...", | |
| "topics": [...], | |
| "num_days": 5 | |
| }, | |
| "quiz_history": [...], | |
| "mastery_tracker": {...}, | |
| "time_tracking": {...}, | |
| "incomplete_tasks": [...] | |
| } | |
| ``` | |
| **File Storage**: | |
| - Uploaded PDFs in `./data/` | |
| - Profile at `~/.focusflow/student_profile.json` | |
| - Backup at `~/.focusflow/student_profile.backup.json` | |
| ### Agentic Behaviors | |
| **Multi-Step Planning**: | |
| - Query β Retrieval β Topic Extraction β Subject Grouping β Schedule Generation | |
| - 5+ steps with intermediate reasoning | |
| **Tool Use**: | |
| - Vector DB search | |
| - LLM generation | |
| - Profile read/write | |
| - PDF ingestion | |
| **Memory**: | |
| - Chat history (5 last messages) | |
| - Student profile persistence | |
| - Quiz performance tracking | |
| - Mastery levels | |
| **Reflection**: | |
| - Score-based plan adaptation | |
| - Context quality assessment | |
| - Fallback strategies for generation failures | |
| --- | |
| ## π» Tech Stack | |
| ### Languages & Frameworks | |
| - **Frontend**: Python + Streamlit 1.x | |
| - **Backend**: FastAPI + Uvicorn | |
| - **Vector DB**: ChromaDB | |
| - **LLM Orchestration**: LangChain | |
| ### Libraries & APIs | |
| **Core Dependencies**: | |
| ```python | |
| streamlit # Frontend UI | |
| fastapi # Backend API | |
| uvicorn # ASGI server | |
| langchain # LLM orchestration | |
| chromadb # Vector database | |
| ollama # Local LLM inference | |
| requests # API communication | |
| pydantic # Data validation | |
| ``` | |
| **Document Processing**: | |
| ```python | |
| pypdf # PDF parsing | |
| beautifulsoup4 # Web scraping | |
| youtube-transcript-api # Video transcripts | |
| ``` | |
| **Data & Visualization**: | |
| ```python | |
| pandas # Data manipulation | |
| plotly # Analytics charts | |
| ``` | |
| ### Models | |
| - **Embedding**: `nomic-embed-text` (local via Ollama) | |
| - **Generation**: `llama3.2:1b` (local via Ollama) | |
| ### Storage Methods | |
| - **Vector Store**: ChromaDB (local, persistent) | |
| - **Profiles**: JSON files (atomic writes) | |
| - **PDF Files**: Local filesystem (`./data/`) | |
| - **Session State**: Streamlit session storage | |
| --- | |
| ## π Key Workflows | |
| ### Workflow 1: Study Plan Generation | |
| ``` | |
| User uploads PDFs | |
| β | |
| Backend ingests β Chunks β Embeds β Stores in ChromaDB | |
| β | |
| User: "Create 5-day plan" | |
| β | |
| retrieve_topics(k=20) | |
| β | |
| group_by_source() β identify_subjects() | |
| β | |
| round_robin_schedule(num_days=5) | |
| β | |
| save_to_profile() β Return plan | |
| β | |
| Frontend displays Today's Topics (Day 1 unlocked) | |
| ``` | |
| ### Workflow 2: RAG Retrieval | |
| ``` | |
| User asks: "What is encapsulation?" | |
| β | |
| history_exists? β rewrite_query() [Context rewriting] | |
| β | |
| similarity_search(question, k=3) | |
| β | |
| build_prompt(context + history + question) | |
| β | |
| llm.invoke() β extract_sources() | |
| β | |
| Return {answer: str, sources: [{file, page}]} | |
| β | |
| Frontend displays answer + expandable citations | |
| ``` | |
| ### Workflow 3: Adaptive Quiz Flow | |
| ``` | |
| User unlocks Topic | |
| β | |
| load_lesson() β display_markdown() | |
| β | |
| User clicks "Take Quiz" | |
| β | |
| retrieve_context(topic, k=8) | |
| β | |
| generate_quiz(3_questions) | |
| β | |
| fallback if < 3? β context_based_fallback() | |
| β | |
| User answers β calculate_score() | |
| β | |
| score==3? β mark_advanced() | |
| score<3? β mark_review() | |
| β | |
| update_mastery_tracker() β save_profile() | |
| β | |
| unlock_next_topic() β rerun() | |
| ``` | |
| ### Workflow 4: Mastery Tracking Adaptation | |
| ``` | |
| Quiz completed with score X | |
| β | |
| update_subject_mastery({ | |
| scores: [..., X], | |
| avg_score: calculate_average(), | |
| mastery_level: determine_level() // High: β₯75%, Medium: β₯50%, Low: <50% | |
| }) | |
| β | |
| mastery_level==HIGH? | |
| β Future topics: Faster pace, advanced examples | |
| mastery_level==LOW? | |
| β Future topics: More review, foundational content | |
| β | |
| save_to_profile() | |
| β | |
| Next plan generation uses mastery data for difficulty | |
| ``` | |
| --- | |
| ## π Evaluation Metrics | |
| ### 1. Plan Quality Assessment | |
| **Metrics**: | |
| - **Subject Coverage**: % of uploaded subjects represented daily | |
| - **Balance Score**: Standard deviation of topics per subject | |
| - **Unlocking Logic**: % of topics that unlock correctly after quiz | |
| **Target**: | |
| - 100% subject coverage (all PDFs represented) | |
| - StdDev < 0.5 (even distribution) | |
| - 100% unlock success rate | |
| ### 2. Answer Accuracy Measurement | |
| **Metrics**: | |
| - **Source Relevance**: Cosine similarity of retrieved chunks | |
| - **Citation Accuracy**: % of answers with valid file+page citations | |
| - **Hallucination Rate**: Manual review of 50 Q&A pairs | |
| **Target**: | |
| - Avg similarity > 0.7 | |
| - 95%+ citation accuracy | |
| - <5% hallucination rate | |
| ### 3. Quiz Discrimination | |
| **Metrics**: | |
| - **Question Validity**: % of questions answerable from provided context | |
| - **Distractor Quality**: % of students choosing incorrect options | |
| - **Difficulty Spread**: Distribution across easy/medium/hard | |
| **Target**: | |
| - 100% context-answerable | |
| - 25-40% distractor selection rate (not too easy/hard) | |
| - Balanced difficulty distribution | |
| ### 4. User Mastery Gains | |
| **Metrics**: | |
| - **Score Progression**: Ξ average score from Day 1 to Day N | |
| - **Mastery Level Changes**: % of subjects moving from Low β Medium β High | |
| - **Retention Rate**: Quiz score on repeated topics after 1 week | |
| **Target**: | |
| - +15% average score improvement over 5 days | |
| - 60%+ mastery level improvement | |
| - 80%+ retention on repeated topics | |
| ### 5. System Performance | |
| **Metrics**: | |
| - **Plan Generation Time**: Seconds to generate 5-day plan | |
| - **Query Response Time**: Seconds from question to answer | |
| - **Profile Save Latency**: Milliseconds for atomic write | |
| **Target**:r | |
| - Plan gen: <10 seconds | |
| - Query response: <5 seconds | |
| - Save latency: <100ms | |
| --- | |
| ## π Future Enhancements | |
| - **Spaced Repetition**: Intelligent review scheduling using SM-2 algorithm | |
| - **Multi-User Support**: Authentication + isolated student profiles | |
| - **Cloud Deployment**: Oracle Cloud + Supabase for persistence | |
| - **Advanced Analytics**: Learning curve visualization, weak area identification | |
| - **Mobile Responsive**: Material Design responsive UI for mobile devices | |
| --- | |
| ## π Project Repository | |
| **GitHub**: [thesivarohith/hack](https://github.com/thesivarohith/hack) | |
| **Status**: Production-ready, cleaned codebase (commit: 9a8a489) | |
| --- | |
| *Documentation generated: 2026-01-06* | |