focusflow / TECHNICAL_DOCUMENTATION.md
FocusFlow Assistant
fix: pre-download embedding model during Docker build to avoid DNS failures on HF Spaces
3c23e28

FocusFlow - Technical Documentation

πŸ“‹ Project Overview

Problem Statement

Students face significant challenges in managing self-paced learning:

  • Information Overload: PDFs, videos, and notes scattered across sources make it difficult to create coherent study plans
  • Lack of Personalization: Generic study materials don't adapt to individual learning pace or mastery level
  • No Progress Tracking: Students can't easily measure improvement or identify knowledge gaps
  • Verification Gap: No way to trace AI-generated answers back to source materials

Solution Description

FocusFlow is an intelligent, local-first study companion that transforms unstructured learning materials into personalized, adaptive study experiences. It combines RAG (Retrieval-Augmented Generation) with synthetic student profiling to create customized learning paths that evolve based on performance.

Key Innovation: Synthetic student profiles enable the app to "remember" progress across sessions and dynamically adjust difficulty, review frequency, and content depth based on demonstrated mastery.

Target Users

  • Self-paced learners preparing for exams or certifications
  • Students managing multiple subjects with varied materials
  • Knowledge workers building expertise in new domains
  • Anyone seeking structured, verifiable learning from diverse sources

🎯 Core Features

1. Multi-Subject Study Roadmap Generation

  • Automated topic extraction from uploaded PDFs and documents
  • Multi-day planning with topics distributed across subjects
  • Subject identification from document content and metadata
  • Round-robin scheduling ensures balanced coverage across all subjects
  • Progressive unlocking - topics unlock as previous ones are completed

Example: Upload 3 PDFs β†’ Get 5-day plan with 3 topics/day (one from each subject)

2. RAG-Based Q&A System

  • Context-aware retrieval using ChromaDB vector search
  • Conversational memory with chat history rewriting for follow-up questions
  • Multi-source search across all uploaded documents
  • Streaming responses with source citation
  • Focus Mode for distraction-free studying with side-by-side lesson/chat

3. Adaptive Quiz Generation

  • Context-based questions generated from actual course material
  • Realistic distractors using common misconceptions
  • Guaranteed 3-question format with intelligent fallbacks
  • Score-based adaptation:
    • Perfect score (3/3) β†’ Future topics marked "Advanced"
    • Low score (1-2/3) β†’ Future topics include review materials
  • Automatic unlocking of next topic upon quiz completion

4. Knowledge Tracking & Mastery System

  • Subject-level mastery tracking (High/Medium/Low)
  • Historical quiz performance with timestamps
  • Average score calculation across all attempts
  • Mastery-based difficulty adjustment for future content
  • Analytics dashboard with performance classification

5. Synthetic Student Profiles

  • Persistent JSON storage in ~/.focusflow/student_profile.json
  • Comprehensive tracking:
    • Study plan with topic metadata
    • Quiz history with scores and timestamps
    • Mastery levels per subject
    • Time tracking per topic
    • Incomplete task queue
  • Atomic writes with backup for data integrity
  • Thread-safe operations for concurrent access

6. Data Persistence & Session Resumption

  • Auto-save on key events:
    • Plan generation
    • Quiz completion
    • Topic transitions
  • Auto-load on startup restores:
    • Active study plan
    • Quiz scores and progress
    • Mastery levels
    • Current position
  • Toast notifications for save/load feedback

7. Citations & Source Verification

  • Expandable source references with every AI response
  • File + page number for each citation
  • Lesson content references section at bottom
  • Inline citation prompts to LLM for accurate attribution
  • Numbered citation format for easy lookup

πŸ—οΈ Technical Architecture

Frontend Components (Streamlit)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Control Center  β”‚  Workspace  β”‚   Calendar     β”‚
β”‚  (Sidebar)       β”‚  (Main)     β”‚   (Sidebar)    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ - Study Timer    β”‚ - Chat UI   β”‚ - Date Picker  β”‚
β”‚ - Sources List   β”‚ - Lessons   β”‚ - Plan View    β”‚
β”‚ - File Upload    β”‚ - Analytics β”‚ - Today's      β”‚
β”‚ - Plan Gen       β”‚ - Quizzes   β”‚   Topics       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key UI Panels:

  • Control Center: Timer, source management, plan generation
  • Intelligent Workspace: RAG chat, lesson viewer, analytics modal
  • Calendar: Date-based topic navigation, today's task list
  • Focus Mode: Immersive 2-column layout (chat | lesson)

Chat Interface:

  • Message history with role differentiation (user/assistant)
  • Source citation expandables
  • Streaming/static responses
  • Contained scrollable area (600px height)

Lesson Viewer:

  • Markdown rendering with references section
  • Scrollable document container (650px height)
  • Inline citations and code examples

Backend Logic (FastAPI)

Planning Engine (generate_study_plan):

  1. Query vector store for topic-related content
  2. Group documents by source (each source = subject)
  3. Extract subject names from content/metadata
  4. Round-robin topic selection across subjects
  5. Generate multi-day schedule with metadata

Retrieval System (query_knowledge_base):

  1. Context rewriting for multi-turn conversations
  2. Similarity search across vector database
  3. LLM synthesis with retrieved context
  4. Source metadata extraction and return

Quiz Generator (generate_quiz_data):

  1. Retrieve relevant content chunks for topic
  2. Prompt LLM for context-based questions
  3. Fallback question generation from raw content
  4. Ensure exact 3-question output
  5. Return structured quiz data

Lesson Generator (generate_lesson_content):

  1. Retrieve 8 context chunks (500 chars each)
  2. Extract source citations from metadata
  3. Prompt for structured lesson (600-800 words)
  4. Append references section with file + page
  5. Return formatted Markdown

Data Storage

Vector Database (ChromaDB):

  • Local storage at ./chroma_db
  • Nomic-embed-text embeddings
  • Metadata: source path, page number
  • Persistent across sessions

Student Profiles (JSON):

{
  "student_id": "student_20260105_233537",
  "study_plan": {
    "plan_id": "plan_...",
    "topics": [...],
    "num_days": 5
  },
  "quiz_history": [...],
  "mastery_tracker": {...},
  "time_tracking": {...},
  "incomplete_tasks": [...]
}

File Storage:

  • Uploaded PDFs in ./data/
  • Profile at ~/.focusflow/student_profile.json
  • Backup at ~/.focusflow/student_profile.backup.json

Agentic Behaviors

Multi-Step Planning:

  • Query β†’ Retrieval β†’ Topic Extraction β†’ Subject Grouping β†’ Schedule Generation
  • 5+ steps with intermediate reasoning

Tool Use:

  • Vector DB search
  • LLM generation
  • Profile read/write
  • PDF ingestion

Memory:

  • Chat history (5 last messages)
  • Student profile persistence
  • Quiz performance tracking
  • Mastery levels

Reflection:

  • Score-based plan adaptation
  • Context quality assessment
  • Fallback strategies for generation failures

πŸ’» Tech Stack

Languages & Frameworks

  • Frontend: Python + Streamlit 1.x
  • Backend: FastAPI + Uvicorn
  • Vector DB: ChromaDB
  • LLM Orchestration: LangChain

Libraries & APIs

Core Dependencies:

streamlit              # Frontend UI
fastapi               # Backend API
uvicorn              # ASGI server
langchain            # LLM orchestration
chromadb             # Vector database
ollama               # Local LLM inference
requests             # API communication
pydantic             # Data validation

Document Processing:

pypdf                # PDF parsing
beautifulsoup4       # Web scraping
youtube-transcript-api  # Video transcripts

Data & Visualization:

pandas               # Data manipulation
plotly               # Analytics charts

Models

  • Embedding: nomic-embed-text (local via Ollama)
  • Generation: llama3.2:1b (local via Ollama)

Storage Methods

  • Vector Store: ChromaDB (local, persistent)
  • Profiles: JSON files (atomic writes)
  • PDF Files: Local filesystem (./data/)
  • Session State: Streamlit session storage

πŸ”„ Key Workflows

Workflow 1: Study Plan Generation

User uploads PDFs
      ↓
Backend ingests β†’ Chunks β†’ Embeds β†’ Stores in ChromaDB
      ↓
User: "Create 5-day plan"
      ↓
retrieve_topics(k=20)
      ↓
group_by_source() β†’ identify_subjects()
      ↓
round_robin_schedule(num_days=5)
      ↓
save_to_profile() β†’ Return plan
      ↓
Frontend displays Today's Topics (Day 1 unlocked)

Workflow 2: RAG Retrieval

User asks: "What is encapsulation?"
      ↓
history_exists? β†’ rewrite_query()  [Context rewriting]
      ↓
similarity_search(question, k=3)
      ↓
build_prompt(context + history + question)
      ↓
llm.invoke() β†’ extract_sources()
      ↓
Return {answer: str, sources: [{file, page}]}
      ↓
Frontend displays answer + expandable citations

Workflow 3: Adaptive Quiz Flow

User unlocks Topic
      ↓
load_lesson() β†’ display_markdown()
      ↓
User clicks "Take Quiz"
      ↓
retrieve_context(topic, k=8)
      ↓
generate_quiz(3_questions)
      ↓
fallback if < 3? β†’ context_based_fallback()
      ↓
User answers β†’ calculate_score()
      ↓
score==3? β†’ mark_advanced()
score<3?  β†’ mark_review()
      ↓
update_mastery_tracker() β†’ save_profile()
      ↓
unlock_next_topic() β†’ rerun()

Workflow 4: Mastery Tracking Adaptation

Quiz completed with score X
      ↓
update_subject_mastery({
  scores: [..., X],
  avg_score: calculate_average(),
  mastery_level: determine_level()  // High: β‰₯75%, Medium: β‰₯50%, Low: <50%
})
      ↓
mastery_level==HIGH?
  β†’ Future topics: Faster pace, advanced examples
mastery_level==LOW?
  β†’ Future topics: More review, foundational content
      ↓
save_to_profile()
      ↓
Next plan generation uses mastery data for difficulty

πŸ“Š Evaluation Metrics

1. Plan Quality Assessment

Metrics:

  • Subject Coverage: % of uploaded subjects represented daily
  • Balance Score: Standard deviation of topics per subject
  • Unlocking Logic: % of topics that unlock correctly after quiz

Target:

  • 100% subject coverage (all PDFs represented)
  • StdDev < 0.5 (even distribution)
  • 100% unlock success rate

2. Answer Accuracy Measurement

Metrics:

  • Source Relevance: Cosine similarity of retrieved chunks
  • Citation Accuracy: % of answers with valid file+page citations
  • Hallucination Rate: Manual review of 50 Q&A pairs

Target:

  • Avg similarity > 0.7
  • 95%+ citation accuracy
  • <5% hallucination rate

3. Quiz Discrimination

Metrics:

  • Question Validity: % of questions answerable from provided context
  • Distractor Quality: % of students choosing incorrect options
  • Difficulty Spread: Distribution across easy/medium/hard

Target:

  • 100% context-answerable
  • 25-40% distractor selection rate (not too easy/hard)
  • Balanced difficulty distribution

4. User Mastery Gains

Metrics:

  • Score Progression: Ξ” average score from Day 1 to Day N
  • Mastery Level Changes: % of subjects moving from Low β†’ Medium β†’ High
  • Retention Rate: Quiz score on repeated topics after 1 week

Target:

  • +15% average score improvement over 5 days
  • 60%+ mastery level improvement
  • 80%+ retention on repeated topics

5. System Performance

Metrics:

  • Plan Generation Time: Seconds to generate 5-day plan
  • Query Response Time: Seconds from question to answer
  • Profile Save Latency: Milliseconds for atomic write

Target:r

  • Plan gen: <10 seconds
  • Query response: <5 seconds
  • Save latency: <100ms

πŸš€ Future Enhancements

  • Spaced Repetition: Intelligent review scheduling using SM-2 algorithm
  • Multi-User Support: Authentication + isolated student profiles
  • Cloud Deployment: Oracle Cloud + Supabase for persistence
  • Advanced Analytics: Learning curve visualization, weak area identification
  • Mobile Responsive: Material Design responsive UI for mobile devices

πŸ“ Project Repository

GitHub: thesivarohith/hack

Status: Production-ready, cleaned codebase (commit: 9a8a489)


Documentation generated: 2026-01-06