# FocusFlow - Technical Documentation ## 📋 Project Overview ### Problem Statement Students face significant challenges in managing self-paced learning: - **Information Overload**: PDFs, videos, and notes scattered across sources make it difficult to create coherent study plans - **Lack of Personalization**: Generic study materials don't adapt to individual learning pace or mastery level - **No Progress Tracking**: Students can't easily measure improvement or identify knowledge gaps - **Verification Gap**: No way to trace AI-generated answers back to source materials ### Solution Description FocusFlow is an **intelligent, local-first study companion** that transforms unstructured learning materials into personalized, adaptive study experiences. It combines RAG (Retrieval-Augmented Generation) with synthetic student profiling to create customized learning paths that evolve based on performance. **Key Innovation**: Synthetic student profiles enable the app to "remember" progress across sessions and dynamically adjust difficulty, review frequency, and content depth based on demonstrated mastery. ### Target Users - **Self-paced learners** preparing for exams or certifications - **Students** managing multiple subjects with varied materials - **Knowledge workers** building expertise in new domains - **Anyone** seeking structured, verifiable learning from diverse sources --- ## 🎯 Core Features ### 1. Multi-Subject Study Roadmap Generation - **Automated topic extraction** from uploaded PDFs and documents - **Multi-day planning** with topics distributed across subjects - **Subject identification** from document content and metadata - **Round-robin scheduling** ensures balanced coverage across all subjects - **Progressive unlocking** - topics unlock as previous ones are completed **Example**: Upload 3 PDFs → Get 5-day plan with 3 topics/day (one from each subject) ### 2. RAG-Based Q&A System - **Context-aware retrieval** using ChromaDB vector search - **Conversational memory** with chat history rewriting for follow-up questions - **Multi-source search** across all uploaded documents - **Streaming responses** with source citation - **Focus Mode** for distraction-free studying with side-by-side lesson/chat ### 3. Adaptive Quiz Generation - **Context-based questions** generated from actual course material - **Realistic distractors** using common misconceptions - **Guaranteed 3-question format** with intelligent fallbacks - **Score-based adaptation**: - Perfect score (3/3) → Future topics marked "Advanced" - Low score (1-2/3) → Future topics include review materials - **Automatic unlocking** of next topic upon quiz completion ### 4. Knowledge Tracking & Mastery System - **Subject-level mastery tracking** (High/Medium/Low) - **Historical quiz performance** with timestamps - **Average score calculation** across all attempts - **Mastery-based difficulty adjustment** for future content - **Analytics dashboard** with performance classification ### 5. Synthetic Student Profiles - **Persistent JSON storage** in `~/.focusflow/student_profile.json` - **Comprehensive tracking**: - Study plan with topic metadata - Quiz history with scores and timestamps - Mastery levels per subject - Time tracking per topic - Incomplete task queue - **Atomic writes** with backup for data integrity - **Thread-safe operations** for concurrent access ### 6. Data Persistence & Session Resumption - **Auto-save on key events**: - Plan generation - Quiz completion - Topic transitions - **Auto-load on startup** restores: - Active study plan - Quiz scores and progress - Mastery levels - Current position - **Toast notifications** for save/load feedback ### 7. Citations & Source Verification - **Expandable source references** with every AI response - **File + page number** for each citation - **Lesson content references** section at bottom - **Inline citation prompts** to LLM for accurate attribution - **Numbered citation format** for easy lookup --- ## 🏗️ Technical Architecture ### Frontend Components (Streamlit) ``` ┌─────────────────────────────────────────────────┐ │ Control Center │ Workspace │ Calendar │ │ (Sidebar) │ (Main) │ (Sidebar) │ ├──────────────────┼─────────────┼────────────────┤ │ - Study Timer │ - Chat UI │ - Date Picker │ │ - Sources List │ - Lessons │ - Plan View │ │ - File Upload │ - Analytics │ - Today's │ │ - Plan Gen │ - Quizzes │ Topics │ └──────────────────┴─────────────┴────────────────┘ ``` **Key UI Panels**: - **Control Center**: Timer, source management, plan generation - **Intelligent Workspace**: RAG chat, lesson viewer, analytics modal - **Calendar**: Date-based topic navigation, today's task list - **Focus Mode**: Immersive 2-column layout (chat | lesson) **Chat Interface**: - Message history with role differentiation (user/assistant) - Source citation expandables - Streaming/static responses - Contained scrollable area (600px height) **Lesson Viewer**: - Markdown rendering with references section - Scrollable document container (650px height) - Inline citations and code examples ### Backend Logic (FastAPI) **Planning Engine** (`generate_study_plan`): 1. Query vector store for topic-related content 2. Group documents by source (each source = subject) 3. Extract subject names from content/metadata 4. Round-robin topic selection across subjects 5. Generate multi-day schedule with metadata **Retrieval System** (`query_knowledge_base`): 1. Context rewriting for multi-turn conversations 2. Similarity search across vector database 3. LLM synthesis with retrieved context 4. Source metadata extraction and return **Quiz Generator** (`generate_quiz_data`): 1. Retrieve relevant content chunks for topic 2. Prompt LLM for context-based questions 3. Fallback question generation from raw content 4. Ensure exact 3-question output 5. Return structured quiz data **Lesson Generator** (`generate_lesson_content`): 1. Retrieve 8 context chunks (500 chars each) 2. Extract source citations from metadata 3. Prompt for structured lesson (600-800 words) 4. Append references section with file + page 5. Return formatted Markdown ### Data Storage **Vector Database (ChromaDB)**: - Local storage at `./chroma_db` - Nomic-embed-text embeddings - Metadata: source path, page number - Persistent across sessions **Student Profiles (JSON)**: ```json { "student_id": "student_20260105_233537", "study_plan": { "plan_id": "plan_...", "topics": [...], "num_days": 5 }, "quiz_history": [...], "mastery_tracker": {...}, "time_tracking": {...}, "incomplete_tasks": [...] } ``` **File Storage**: - Uploaded PDFs in `./data/` - Profile at `~/.focusflow/student_profile.json` - Backup at `~/.focusflow/student_profile.backup.json` ### Agentic Behaviors **Multi-Step Planning**: - Query → Retrieval → Topic Extraction → Subject Grouping → Schedule Generation - 5+ steps with intermediate reasoning **Tool Use**: - Vector DB search - LLM generation - Profile read/write - PDF ingestion **Memory**: - Chat history (5 last messages) - Student profile persistence - Quiz performance tracking - Mastery levels **Reflection**: - Score-based plan adaptation - Context quality assessment - Fallback strategies for generation failures --- ## 💻 Tech Stack ### Languages & Frameworks - **Frontend**: Python + Streamlit 1.x - **Backend**: FastAPI + Uvicorn - **Vector DB**: ChromaDB - **LLM Orchestration**: LangChain ### Libraries & APIs **Core Dependencies**: ```python streamlit # Frontend UI fastapi # Backend API uvicorn # ASGI server langchain # LLM orchestration chromadb # Vector database ollama # Local LLM inference requests # API communication pydantic # Data validation ``` **Document Processing**: ```python pypdf # PDF parsing beautifulsoup4 # Web scraping youtube-transcript-api # Video transcripts ``` **Data & Visualization**: ```python pandas # Data manipulation plotly # Analytics charts ``` ### Models - **Embedding**: `nomic-embed-text` (local via Ollama) - **Generation**: `llama3.2:1b` (local via Ollama) ### Storage Methods - **Vector Store**: ChromaDB (local, persistent) - **Profiles**: JSON files (atomic writes) - **PDF Files**: Local filesystem (`./data/`) - **Session State**: Streamlit session storage --- ## 🔄 Key Workflows ### Workflow 1: Study Plan Generation ``` User uploads PDFs ↓ Backend ingests → Chunks → Embeds → Stores in ChromaDB ↓ User: "Create 5-day plan" ↓ retrieve_topics(k=20) ↓ group_by_source() → identify_subjects() ↓ round_robin_schedule(num_days=5) ↓ save_to_profile() → Return plan ↓ Frontend displays Today's Topics (Day 1 unlocked) ``` ### Workflow 2: RAG Retrieval ``` User asks: "What is encapsulation?" ↓ history_exists? → rewrite_query() [Context rewriting] ↓ similarity_search(question, k=3) ↓ build_prompt(context + history + question) ↓ llm.invoke() → extract_sources() ↓ Return {answer: str, sources: [{file, page}]} ↓ Frontend displays answer + expandable citations ``` ### Workflow 3: Adaptive Quiz Flow ``` User unlocks Topic ↓ load_lesson() → display_markdown() ↓ User clicks "Take Quiz" ↓ retrieve_context(topic, k=8) ↓ generate_quiz(3_questions) ↓ fallback if < 3? → context_based_fallback() ↓ User answers → calculate_score() ↓ score==3? → mark_advanced() score<3? → mark_review() ↓ update_mastery_tracker() → save_profile() ↓ unlock_next_topic() → rerun() ``` ### Workflow 4: Mastery Tracking Adaptation ``` Quiz completed with score X ↓ update_subject_mastery({ scores: [..., X], avg_score: calculate_average(), mastery_level: determine_level() // High: ≥75%, Medium: ≥50%, Low: <50% }) ↓ mastery_level==HIGH? → Future topics: Faster pace, advanced examples mastery_level==LOW? → Future topics: More review, foundational content ↓ save_to_profile() ↓ Next plan generation uses mastery data for difficulty ``` --- ## 📊 Evaluation Metrics ### 1. Plan Quality Assessment **Metrics**: - **Subject Coverage**: % of uploaded subjects represented daily - **Balance Score**: Standard deviation of topics per subject - **Unlocking Logic**: % of topics that unlock correctly after quiz **Target**: - 100% subject coverage (all PDFs represented) - StdDev < 0.5 (even distribution) - 100% unlock success rate ### 2. Answer Accuracy Measurement **Metrics**: - **Source Relevance**: Cosine similarity of retrieved chunks - **Citation Accuracy**: % of answers with valid file+page citations - **Hallucination Rate**: Manual review of 50 Q&A pairs **Target**: - Avg similarity > 0.7 - 95%+ citation accuracy - <5% hallucination rate ### 3. Quiz Discrimination **Metrics**: - **Question Validity**: % of questions answerable from provided context - **Distractor Quality**: % of students choosing incorrect options - **Difficulty Spread**: Distribution across easy/medium/hard **Target**: - 100% context-answerable - 25-40% distractor selection rate (not too easy/hard) - Balanced difficulty distribution ### 4. User Mastery Gains **Metrics**: - **Score Progression**: Δ average score from Day 1 to Day N - **Mastery Level Changes**: % of subjects moving from Low → Medium → High - **Retention Rate**: Quiz score on repeated topics after 1 week **Target**: - +15% average score improvement over 5 days - 60%+ mastery level improvement - 80%+ retention on repeated topics ### 5. System Performance **Metrics**: - **Plan Generation Time**: Seconds to generate 5-day plan - **Query Response Time**: Seconds from question to answer - **Profile Save Latency**: Milliseconds for atomic write **Target**:r - Plan gen: <10 seconds - Query response: <5 seconds - Save latency: <100ms --- ## 🚀 Future Enhancements - **Spaced Repetition**: Intelligent review scheduling using SM-2 algorithm - **Multi-User Support**: Authentication + isolated student profiles - **Cloud Deployment**: Oracle Cloud + Supabase for persistence - **Advanced Analytics**: Learning curve visualization, weak area identification - **Mobile Responsive**: Material Design responsive UI for mobile devices --- ## 📝 Project Repository **GitHub**: [thesivarohith/hack](https://github.com/thesivarohith/hack) **Status**: Production-ready, cleaned codebase (commit: 9a8a489) --- *Documentation generated: 2026-01-06*